[RFC] mm: count zram read/write into PSI_IO_WAIT

Message ID	1638356341-17014-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Huangzhaoyang <huangzhaoyang@gmail.com> To: Nitin Gupta <ngupta@vflare.org>, Sergey Senozhatsky <senozhatsky@chromium.org>, Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>, Minchan Kim <minchan@kernel.org>, Zhaoyang Huang <zhaoyang.huang@unisoc.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: count zram read/write into PSI_IO_WAIT Date: Wed, 1 Dec 2021 18:59:01 +0800 Message-Id: <1638356341-17014-1-git-send-email-huangzhaoyang@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: count zram read/write into PSI_IO_WAIT \| expand [RFC] mm: count zram read/write into PSI_IO_WAIT

Zhaoyang Huang Dec. 1, 2021, 10:59 a.m. UTC

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Have zram reading/writing be counted in PSI_IO_WAIT.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 drivers/block/zram/zram_drv.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Zhaoyang Huang Dec. 1, 2021, 11:12 a.m. UTC | #1

There is no chance for zram reading/writing to be counted in
PSI_IO_WAIT so far as zram will deal with the request just in current
context without invoking submit_bio and io_schedule.

On Wed, Dec 1, 2021 at 6:59 PM Huangzhaoyang <huangzhaoyang@gmail.com> wrote:
>
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>
> Have zram reading/writing be counted in PSI_IO_WAIT.
>
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  drivers/block/zram/zram_drv.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf275..b0e4766 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -34,6 +34,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/cpuhotplug.h>
>  #include <linux/part_stat.h>
> +#include <linux/psi.h>
>
>  #include "zram_drv.h"
>
> @@ -1246,7 +1247,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
>                                 zram_get_element(zram, index),
>                                 bio, partial_io);
>         }
> -
> +#ifdef CONFIG_PSI
> +       psi_task_change(current, 0, TSK_IOWAIT);
> +#endif
>         handle = zram_get_handle(zram, index);
>         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
>                 unsigned long value;
> @@ -1257,6 +1260,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
>                 zram_fill_page(mem, PAGE_SIZE, value);
>                 kunmap_atomic(mem);
>                 zram_slot_unlock(zram, index);
> +#ifdef CONFIG_PSI
> +               psi_task_change(current, TSK_IOWAIT, 0);
> +#endif
>                 return 0;
>         }
>
> @@ -1284,6 +1290,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
>         if (WARN_ON(ret))
>                 pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
>
> +#ifdef CONFIG_PSI
> +       psi_task_change(current, TSK_IOWAIT, 0);
> +#endif
>         return ret;
>  }
>
> @@ -1471,7 +1480,13 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
>                 vec.bv_offset = 0;
>         }
>
> +#ifdef CONFIG_PSI
> +       psi_task_change(current, 0, TSK_IOWAIT);
> +#endif
>         ret = __zram_bvec_write(zram, &vec, index, bio);
> +#ifdef CONFIG_PSI
> +       psi_task_change(current, TSK_IOWAIT, 0);
> +#endif
>  out:
>         if (is_partial_io(bvec))
>                 __free_page(page);
> @@ -1607,7 +1622,6 @@ static blk_qc_t zram_submit_bio(struct bio *bio)
>                 atomic64_inc(&zram->stats.invalid_io);
>                 goto error;
>         }
> -
>         __zram_make_request(zram, bio);
>         return BLK_QC_T_NONE;
>
> --
> 1.9.1
>

Johannes Weiner Dec. 2, 2021, 4:28 p.m. UTC | #2

On Wed, Dec 01, 2021 at 07:12:30PM +0800, Zhaoyang Huang wrote:
> There is no chance for zram reading/writing to be counted in
> PSI_IO_WAIT so far as zram will deal with the request just in current
> context without invoking submit_bio and io_schedule.

Hm, but you're also not waiting for a real io device - during which
the CPU could be doing something else e.g. You're waiting for
decompression. The thread also isn't in D-state during that time. What
scenario would benefit from this accounting? How is IO pressure from
comp/decomp paths actionable to you?

What about when you use zram with disk writeback enabled, and you see
a mix of decompression and actual disk IO. Wouldn't you want to be
able to tell the two apart, to see if you're short on CPU or short on
IO bandwidth in this setup? Your patch would make that impossible.

This needs a much more comprehensive changelog.

> > @@ -1246,7 +1247,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> >                                 zram_get_element(zram, index),
> >                                 bio, partial_io);
> >         }
> > -
> > +#ifdef CONFIG_PSI
> > +       psi_task_change(current, 0, TSK_IOWAIT);
> > +#endif

Add psi_iostall_enter() and leave helpers that encapsulate the ifdefs.

Zhaoyang Huang Dec. 3, 2021, 9:16 a.m. UTC | #3

On Fri, Dec 3, 2021 at 12:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 01, 2021 at 07:12:30PM +0800, Zhaoyang Huang wrote:
> > There is no chance for zram reading/writing to be counted in
> > PSI_IO_WAIT so far as zram will deal with the request just in current
> > context without invoking submit_bio and io_schedule.
>
> Hm, but you're also not waiting for a real io device - during which
> the CPU could be doing something else e.g. You're waiting for
> decompression. The thread also isn't in D-state during that time. What
> scenario would benefit from this accounting? How is IO pressure from
> comp/decomp paths actionable to you?
No. Block device related D-state will be counted in via
psi_dequeue(io_wait). What I am proposing here is do NOT ignore the
influence on non-productive time by huge numbers of in-context swap
in/out (zram like). This can help to make IO pressure more accurate
and coordinate with the number of PSWPIN/OUT. It is like counting the
IO time within filemap_fault->wait_on_page_bit_common into
psi_mem_stall, which introduces memory pressure high by IO.
>
> What about when you use zram with disk writeback enabled, and you see
> a mix of decompression and actual disk IO. Wouldn't you want to be
> able to tell the two apart, to see if you're short on CPU or short on
> IO bandwidth in this setup? Your patch would make that impossible.
OK. Is it better to start the IO counting from pageout? Both of the
bdev and ram backed swap would benefit from it.
>
> This needs a much more comprehensive changelog.
>
> > > @@ -1246,7 +1247,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> > >                                 zram_get_element(zram, index),
> > >                                 bio, partial_io);
> > >         }
> > > -
> > > +#ifdef CONFIG_PSI
> > > +       psi_task_change(current, 0, TSK_IOWAIT);
> > > +#endif
>
> Add psi_iostall_enter() and leave helpers that encapsulate the ifdefs.
OK.

Johannes Weiner Dec. 8, 2021, 4:38 p.m. UTC | #4

On Fri, Dec 03, 2021 at 05:16:47PM +0800, Zhaoyang Huang wrote:
> On Fri, Dec 3, 2021 at 12:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Dec 01, 2021 at 07:12:30PM +0800, Zhaoyang Huang wrote:
> > > There is no chance for zram reading/writing to be counted in
> > > PSI_IO_WAIT so far as zram will deal with the request just in current
> > > context without invoking submit_bio and io_schedule.
> >
> > Hm, but you're also not waiting for a real io device - during which
> > the CPU could be doing something else e.g. You're waiting for
> > decompression. The thread also isn't in D-state during that time. What
> > scenario would benefit from this accounting? How is IO pressure from
> > comp/decomp paths actionable to you?
> No. Block device related D-state will be counted in via
> psi_dequeue(io_wait). What I am proposing here is do NOT ignore the
> influence on non-productive time by huge numbers of in-context swap
> in/out (zram like). This can help to make IO pressure more accurate
> and coordinate with the number of PSWPIN/OUT. It is like counting the
> IO time within filemap_fault->wait_on_page_bit_common into
> psi_mem_stall, which introduces memory pressure high by IO.

It's not ignored, it shows up as memory pressure. Because those delays
occur due to a lack of memory.

On the other hand, having a faster IO device would make no difference
to the time spent on compression and decompression. Counting this time
as IO pressure makes no sense to me.

I'm against merging this patch.

Chris Down Dec. 8, 2021, 5:45 p.m. UTC | #5

Zhaoyang Huang writes:
>No. Block device related D-state will be counted in via
>psi_dequeue(io_wait). What I am proposing here is do NOT ignore the
>influence on non-productive time by huge numbers of in-context swap
>in/out (zram like). This can help to make IO pressure more accurate
>and coordinate with the number of PSWPIN/OUT. It is like counting the
>IO time within filemap_fault->wait_on_page_bit_common into
>psi_mem_stall, which introduces memory pressure high by IO.

I think part of the confusion here is that the name "io" doesn't really just 
mean "io", it means "disk I/O". As in, we are targeting real, physical or 
network disk I/O. Of course, we can only do what's reasonable if the device 
we're accounting for is layers upon layers eventually leading to a 
memory-backed device, but _intentionally_ polluting that with more memory-bound 
accesses doesn't make any sense when we already have separate accounting for 
memory. Why would anyone want that?

I'm with Johannes here, I think this would actively make memory pressure 
monitoring less useful. This is a NAK from my perspective as someone who 
actually uses these things in production.

Chris Down Dec. 8, 2021, 5:47 p.m. UTC | #6

Chris Down writes:
>I'm with Johannes here, I think this would actively make memory 
>pressure monitoring less useful. This is a NAK from my perspective as 
>someone who actually uses these things in production.

s/memory/io/

[RFC] mm: count zram read/write into PSI_IO_WAIT

Commit Message

Comments

Patch