diff mbox series

block/psi: make PSI annotations of submit_bio only work for file pages

Message ID 20220316063927.2128383-1-yang.yang29@zte.com.cn (mailing list archive)
State New, archived
Headers show
Series block/psi: make PSI annotations of submit_bio only work for file pages | expand

Commit Message

CGEL March 16, 2022, 6:39 a.m. UTC
From: Yang Yang <yang.yang29@zte.com.cn>

psi tracks the time spent on submitting the IO of refaulting file pages
and anonymous pages[1]. But after we tracks refaulting anonymous pages
in swap_readpage[2][3], there is no need to track refaulting anonymous
pages in submit_bio.

So this patch can reduce redundant calling of psi_memstall_enter. And
make it easier to track refaulting file pages and anonymous pages
separately.

[1] commit b8e24a9300b0 ("block: annotate refault stalls from IO submission")
[2] commit 937790699be9 ("mm/page_io.c: annotate refault stalls from swap_readpage")
[3] commit 2b413a1a728f ("mm: page_io: fix psi memory pressure error on cold swapins")

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
---
 block/bio.c               | 9 +++++----
 block/blk-core.c          | 2 +-
 fs/direct-io.c            | 2 +-
 include/linux/blk_types.h | 2 +-
 4 files changed, 8 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig March 16, 2022, 8:48 a.m. UTC | #1
> @@ -1035,8 +1035,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
>  	bio->bi_iter.bi_size += len;
>  	bio->bi_vcnt++;
>  
> -	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
> -		bio_set_flag(bio, BIO_WORKINGSET);
> +	if (!bio_flagged(bio, BIO_WORKINGSET_FILE) &&
> +	    unlikely(PageWorkingset(page)) && !PageSwapBacked(page))
> +		bio_set_flag(bio, BIO_WORKINGSET_FILE);

This needs to go out of the block I/O fast path, not grow even more
checks.
CGEL March 17, 2022, 2:18 a.m. UTC | #2
On Wed, Mar 16, 2022 at 01:48:05AM -0700, Christoph Hellwig wrote:
> > @@ -1035,8 +1035,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
> >  	bio->bi_iter.bi_size += len;
> >  	bio->bi_vcnt++;
> >  
> > -	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
> > -		bio_set_flag(bio, BIO_WORKINGSET);
> > +	if (!bio_flagged(bio, BIO_WORKINGSET_FILE) &&
> > +	    unlikely(PageWorkingset(page)) && !PageSwapBacked(page))
> > +		bio_set_flag(bio, BIO_WORKINGSET_FILE);
> 
> This needs to go out of the block I/O fast path, not grow even more
> checks.

Thanks for your replying.

First, Johannes Weiner had made his state, see:
https://lore.kernel.org/all/Yio17pXawRuuVJFO@cmpxchg.org/

Second, I understand your concern, but actually there seems no better
way to do file pages workingset delay accounting, because this is no
unique function to track file pages submitting, kernel do this in
multiple sub-system.

Thirdly, this patch doesn't make it worse, indeed it reduce unnecessary
psi accounting in submit_bio.
Johannes Weiner March 21, 2022, 2:33 p.m. UTC | #3
On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> From: Yang Yang <yang.yang29@zte.com.cn>
> 
> psi tracks the time spent on submitting the IO of refaulting file pages
> and anonymous pages[1]. But after we tracks refaulting anonymous pages
> in swap_readpage[2][3], there is no need to track refaulting anonymous
> pages in submit_bio.
> 
> So this patch can reduce redundant calling of psi_memstall_enter. And
> make it easier to track refaulting file pages and anonymous pages
> separately.

I don't think this is an improvement.

psi_memstall_enter() will check current->in_memstall once, detect the
nested call, and bail. Your patch checks PageSwapBacked for every page
being added. It's more branches for less robust code.
CGEL March 22, 2022, 2:47 a.m. UTC | #4
On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > From: Yang Yang <yang.yang29@zte.com.cn>
> > 
> > psi tracks the time spent on submitting the IO of refaulting file pages
> > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > pages in submit_bio.
> > 
> > So this patch can reduce redundant calling of psi_memstall_enter. And
> > make it easier to track refaulting file pages and anonymous pages
> > separately.
> 
> I don't think this is an improvement.
> 
> psi_memstall_enter() will check current->in_memstall once, detect the
> nested call, and bail. Your patch checks PageSwapBacked for every page
> being added. It's more branches for less robust code.

We are also working for a new patch to classify different reasons cause
psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
user to tuning sysctl, for example, if user see high compact delay, he
may try do adjust THP sysctl to reduce the compact delay.

To support that, we should distinguish what's the reason cause psi in
submit_io(), this patch does the job.
CGEL March 22, 2022, 3:44 a.m. UTC | #5
On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > From: Yang Yang <yang.yang29@zte.com.cn>
> > 
> > psi tracks the time spent on submitting the IO of refaulting file pages
> > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > pages in submit_bio.
> > 
> > So this patch can reduce redundant calling of psi_memstall_enter. And
> > make it easier to track refaulting file pages and anonymous pages
> > separately.
> 
> I don't think this is an improvement.
> 
> psi_memstall_enter() will check current->in_memstall once, detect the
> nested call, and bail. Your patch checks PageSwapBacked for every page
> being added. It's more branches for less robust code.

And PageSwapBacked checking is after unlikely(PageWorkingset(page), so I think
the impact is little.
Johannes Weiner March 22, 2022, 1:27 p.m. UTC | #6
On Tue, Mar 22, 2022 at 02:47:42AM +0000, CGEL wrote:
> On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> > On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > > From: Yang Yang <yang.yang29@zte.com.cn>
> > > 
> > > psi tracks the time spent on submitting the IO of refaulting file pages
> > > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > > pages in submit_bio.
> > > 
> > > So this patch can reduce redundant calling of psi_memstall_enter. And
> > > make it easier to track refaulting file pages and anonymous pages
> > > separately.
> > 
> > I don't think this is an improvement.
> > 
> > psi_memstall_enter() will check current->in_memstall once, detect the
> > nested call, and bail. Your patch checks PageSwapBacked for every page
> > being added. It's more branches for less robust code.
> 
> We are also working for a new patch to classify different reasons cause
> psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
> user to tuning sysctl, for example, if user see high compact delay, he
> may try do adjust THP sysctl to reduce the compact delay.
> 
> To support that, we should distinguish what's the reason cause psi in
> submit_io(), this patch does the job.

Please submit these patches together then. On its own, this patch
isn't desirable.
CGEL March 23, 2022, 6:10 a.m. UTC | #7
On Tue, Mar 22, 2022 at 09:27:58AM -0400, Johannes Weiner wrote:
> On Tue, Mar 22, 2022 at 02:47:42AM +0000, CGEL wrote:
> > On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > > > From: Yang Yang <yang.yang29@zte.com.cn>
> > > > 
> > > > psi tracks the time spent on submitting the IO of refaulting file pages
> > > > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > > > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > > > pages in submit_bio.
> > > > 
> > > > So this patch can reduce redundant calling of psi_memstall_enter. And
> > > > make it easier to track refaulting file pages and anonymous pages
> > > > separately.
> > > 
> > > I don't think this is an improvement.
> > > 
> > > psi_memstall_enter() will check current->in_memstall once, detect the
> > > nested call, and bail. Your patch checks PageSwapBacked for every page
> > > being added. It's more branches for less robust code.
> > 
> > We are also working for a new patch to classify different reasons cause
> > psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
> > user to tuning sysctl, for example, if user see high compact delay, he
> > may try do adjust THP sysctl to reduce the compact delay.
> > 
> > To support that, we should distinguish what's the reason cause psi in
> > submit_io(), this patch does the job.
> 
> Please submit these patches together then. On its own, this patch
> isn't desirable.
I think this patch has it's independent value, I try to make a better
explain.

1) This patch doesn't work it worse or even better
After this patch, swap workingset handle is simpler, file workingset
handle just has one more check, as below.
Before this patch handling swap workingset:
	a) in swap_readpage() call psi_memstall_enter() ->
	b) in __bio_add_page() test if should call bio_set_flag(), true ->
	c) in __bio_add_page() call bio_set_flag()
	d) in submit_bio() test if should call psi_memstall_enter(), true ->
	e) call psi_memstall_enter, detect the nested call, and bail.
	f) call bio_clear_flag if needed.
Before this patch handling file page workingset:
	a) in __bio_add_page() test if should call bio_set_flag(), true ->
	...
	b) call bio_clear_flag if needed.
After this patch handling swap workingset:
	a) in swap_readpage() call psi_memstall_enter() ->
	b) in __bio_add_page() test if should call bio_set_flag(), one more check, false and return.
	c) in submit_bio() test if should call psi_memstall_enter(), false and return.
After this patch handling file pages workingset:
	a) in __bio_add_page() test if should call bio_set_flag(), one more check, true ->
	...
	b) call bio_clear_flag if needed.

2) This patch help tools like kprobe to trace different workingset
After this patch we know workingset in submit_io() is only cause by file pages.

3) This patch will help code evolution
Such as psi classify, getdelays supports counting file pages workingset submit.
CGEL March 30, 2022, 8:34 a.m. UTC | #8
On Wed, Mar 23, 2022 at 06:11:03AM +0000, CGEL wrote:
> On Tue, Mar 22, 2022 at 09:27:58AM -0400, Johannes Weiner wrote:
> > On Tue, Mar 22, 2022 at 02:47:42AM +0000, CGEL wrote:
> > > On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> > > > On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > > > > From: Yang Yang <yang.yang29@zte.com.cn>
> > > > > 
> > > > > psi tracks the time spent on submitting the IO of refaulting file pages
> > > > > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > > > > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > > > > pages in submit_bio.
> > > > > 
> > > > > So this patch can reduce redundant calling of psi_memstall_enter. And
> > > > > make it easier to track refaulting file pages and anonymous pages
> > > > > separately.
> > > > 
> > > > I don't think this is an improvement.
> > > > 
> > > > psi_memstall_enter() will check current->in_memstall once, detect the
> > > > nested call, and bail. Your patch checks PageSwapBacked for every page
> > > > being added. It's more branches for less robust code.
> > > 
> > > We are also working for a new patch to classify different reasons cause
> > > psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
> > > user to tuning sysctl, for example, if user see high compact delay, he
> > > may try do adjust THP sysctl to reduce the compact delay.
> > > 
> > > To support that, we should distinguish what's the reason cause psi in
> > > submit_io(), this patch does the job.
> > 
> > Please submit these patches together then. On its own, this patch
> > isn't desirable.
> I think this patch has it's independent value, I try to make a better
> explain.
> 
> 1) This patch doesn't work it worse or even better
> After this patch, swap workingset handle is simpler, file workingset
> handle just has one more check, as below.
> Before this patch handling swap workingset:
> 	a) in swap_readpage() call psi_memstall_enter() ->
> 	b) in __bio_add_page() test if should call bio_set_flag(), true ->
> 	c) in __bio_add_page() call bio_set_flag()
> 	d) in submit_bio() test if should call psi_memstall_enter(), true ->
> 	e) call psi_memstall_enter, detect the nested call, and bail.
> 	f) call bio_clear_flag if needed.
> Before this patch handling file page workingset:
> 	a) in __bio_add_page() test if should call bio_set_flag(), true ->
> 	...
> 	b) call bio_clear_flag if needed.
> After this patch handling swap workingset:
> 	a) in swap_readpage() call psi_memstall_enter() ->
> 	b) in __bio_add_page() test if should call bio_set_flag(), one more check, false and return.
> 	c) in submit_bio() test if should call psi_memstall_enter(), false and return.
> After this patch handling file pages workingset:
> 	a) in __bio_add_page() test if should call bio_set_flag(), one more check, true ->
> 	...
> 	b) call bio_clear_flag if needed.
> 
> 2) This patch help tools like kprobe to trace different workingset
> After this patch we know workingset in submit_io() is only cause by file pages.
> 
> 3) This patch will help code evolution
> Such as psi classify, getdelays supports counting file pages workingset submit.

Looking forward to your review, thanks!
Johannes Weiner March 30, 2022, 1 p.m. UTC | #9
On Wed, Mar 30, 2022 at 08:34:08AM +0000, CGEL wrote:
> On Wed, Mar 23, 2022 at 06:11:03AM +0000, CGEL wrote:
> > On Tue, Mar 22, 2022 at 09:27:58AM -0400, Johannes Weiner wrote:
> > > On Tue, Mar 22, 2022 at 02:47:42AM +0000, CGEL wrote:
> > > > On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > > > > > From: Yang Yang <yang.yang29@zte.com.cn>
> > > > > > 
> > > > > > psi tracks the time spent on submitting the IO of refaulting file pages
> > > > > > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > > > > > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > > > > > pages in submit_bio.
> > > > > > 
> > > > > > So this patch can reduce redundant calling of psi_memstall_enter. And
> > > > > > make it easier to track refaulting file pages and anonymous pages
> > > > > > separately.
> > > > > 
> > > > > I don't think this is an improvement.
> > > > > 
> > > > > psi_memstall_enter() will check current->in_memstall once, detect the
> > > > > nested call, and bail. Your patch checks PageSwapBacked for every page
> > > > > being added. It's more branches for less robust code.
> > > > 
> > > > We are also working for a new patch to classify different reasons cause
> > > > psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
> > > > user to tuning sysctl, for example, if user see high compact delay, he
> > > > may try do adjust THP sysctl to reduce the compact delay.
> > > > 
> > > > To support that, we should distinguish what's the reason cause psi in
> > > > submit_io(), this patch does the job.
> > > 
> > > Please submit these patches together then. On its own, this patch
> > > isn't desirable.
> > I think this patch has it's independent value, I try to make a better
> > explain.

You missed the point about it complicating semantics.

Right now, the bio layer annotates stalls from queue contention. This
is very simple. The swap code has relied on it in the past. It doesn't
now, but that doesn't change what the concept is at the bio layer.

You patch explicitly codifies that the MM handles swap IOs, and the
lower bio layer handles files. This asymmetry is ugly and error prone.

If you want type distinction, we should move it all into MM code, like
Christoph is saying. Were swap code handles anon refaults and the page
cache code handles file refaults. This would be my preferred layering,
and my original patch did that: https://lkml.org/lkml/2019/7/22/1070.

But it was NAKed, and I had to agree with the argument. The page cache
code is not very centralized, and the place where we deal with
individual pages (and detect refaults) and where we submit bios (where
the stalls occur) is spread out into multiple filesystems. There are
180 submit_bio() calls in fs/; you'd have to audit which ones are for
page cache submissions, and then add stall annotations or use a new
submit_bio_cache() wrapper that handles it. Changes in the filesystem
could easily miss this protocol and silently break pressure detection.

I would prefer the annotations to be at this level, I just don't see
how to do it cleanly/robustly. Maybe Christoph has an idea, he
understands the fs side much better than I do.
Christoph Hellwig March 30, 2022, 1:04 p.m. UTC | #10
On Wed, Mar 30, 2022 at 09:00:46AM -0400, Johannes Weiner wrote:
> If you want type distinction, we should move it all into MM code, like
> Christoph is saying. Were swap code handles anon refaults and the page
> cache code handles file refaults. This would be my preferred layering,
> and my original patch did that: https://lkml.org/lkml/2019/7/22/1070.

FYI, I started redoing that version and I think with all the cleanups
to filemap.c and the readahead code this can be done fairly nicely now:

http://git.infradead.org/users/hch/block.git/commitdiff/666abb29c6db870d3941acc5ac19e83fbc72cfd4
Jens Axboe March 30, 2022, 1:49 p.m. UTC | #11
On 3/30/22 7:04 AM, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 09:00:46AM -0400, Johannes Weiner wrote:
>> If you want type distinction, we should move it all into MM code, like
>> Christoph is saying. Were swap code handles anon refaults and the page
>> cache code handles file refaults. This would be my preferred layering,
>> and my original patch did that: https://lkml.org/lkml/2019/7/22/1070.
> 
> FYI, I started redoing that version and I think with all the cleanups
> to filemap.c and the readahead code this can be done fairly nicely now:
> 
> http://git.infradead.org/users/hch/block.git/commitdiff/666abb29c6db870d3941acc5ac19e83fbc72cfd4

This looks way better than hiding it deep down in the block layer.
Johannes Weiner March 30, 2022, 3:45 p.m. UTC | #12
On Wed, Mar 30, 2022 at 06:04:08AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 09:00:46AM -0400, Johannes Weiner wrote:
> > If you want type distinction, we should move it all into MM code, like
> > Christoph is saying. Were swap code handles anon refaults and the page
> > cache code handles file refaults. This would be my preferred layering,
> > and my original patch did that: https://lkml.org/lkml/2019/7/22/1070.
> 
> FYI, I started redoing that version and I think with all the cleanups
> to filemap.c and the readahead code this can be done fairly nicely now:
> 
> http://git.infradead.org/users/hch/block.git/commitdiff/666abb29c6db870d3941acc5ac19e83fbc72cfd4

Yes, it's definitely much nicer now with the MM instantiating the
pages for ->readpage(s).

But AFAICS this breaks compressed btrfs (and erofs?) because those
still do additional add_to_page_cache_lru() and bio submissions.
Christoph Hellwig March 30, 2022, 3:54 p.m. UTC | #13
On Wed, Mar 30, 2022 at 11:45:56AM -0400, Johannes Weiner wrote:
> > FYI, I started redoing that version and I think with all the cleanups
> > to filemap.c and the readahead code this can be done fairly nicely now:
> > 
> > http://git.infradead.org/users/hch/block.git/commitdiff/666abb29c6db870d3941acc5ac19e83fbc72cfd4
> 
> Yes, it's definitely much nicer now with the MM instantiating the
> pages for ->readpage(s).
> 
> But AFAICS this breaks compressed btrfs (and erofs?) because those
> still do additional add_to_page_cache_lru() and bio submissions.

In btrfs, add_ra_bio_pages only passed freshly allocated pages to
add_to_page_cache_lru.  These can't really have PageWorkingSet set,
can they?  In erofs they can also come from a local page pool, but
I think otherwise the same applies.
Johannes Weiner March 30, 2022, 4:17 p.m. UTC | #14
On Wed, Mar 30, 2022 at 08:54:09AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 11:45:56AM -0400, Johannes Weiner wrote:
> > > FYI, I started redoing that version and I think with all the cleanups
> > > to filemap.c and the readahead code this can be done fairly nicely now:
> > > 
> > > http://git.infradead.org/users/hch/block.git/commitdiff/666abb29c6db870d3941acc5ac19e83fbc72cfd4
> > 
> > Yes, it's definitely much nicer now with the MM instantiating the
> > pages for ->readpage(s).
> > 
> > But AFAICS this breaks compressed btrfs (and erofs?) because those
> > still do additional add_to_page_cache_lru() and bio submissions.
> 
> In btrfs, add_ra_bio_pages only passed freshly allocated pages to
> add_to_page_cache_lru.  These can't really have PageWorkingSet set,
> can they?  In erofs they can also come from a local page pool, but
> I think otherwise the same applies.

It's add_to_page_cache_lru() that sets the flag.

Basically, when a PageWorkingset (hot) page gets reclaimed, the bit is
stored in the vacated tree slot. When the entry is brought back in,
add_to_page_cache_lru() transfers it to the newly allocated page.

add_to_page_cache_lru()
  filemap_add_folio()
    __filemap_add_folio(mapping, folio, index, gfp, &shadow)
      *shadow = *slot
      *slot = folio
  if (shadow)
    workingset_refault(folio, shadow)
      folio_set_workingset()
CGEL March 31, 2022, 2:21 a.m. UTC | #15
On Wed, Mar 30, 2022 at 09:00:46AM -0400, Johannes Weiner wrote:
> On Wed, Mar 30, 2022 at 08:34:08AM +0000, CGEL wrote:
> > On Wed, Mar 23, 2022 at 06:11:03AM +0000, CGEL wrote:
> > > On Tue, Mar 22, 2022 at 09:27:58AM -0400, Johannes Weiner wrote:
> > > > On Tue, Mar 22, 2022 at 02:47:42AM +0000, CGEL wrote:
> > > > > On Mon, Mar 21, 2022 at 10:33:20AM -0400, Johannes Weiner wrote:
> > > > > > On Wed, Mar 16, 2022 at 06:39:28AM +0000, cgel.zte@gmail.com wrote:
> > > > > > > From: Yang Yang <yang.yang29@zte.com.cn>
> > > > > > > 
> > > > > > > psi tracks the time spent on submitting the IO of refaulting file pages
> > > > > > > and anonymous pages[1]. But after we tracks refaulting anonymous pages
> > > > > > > in swap_readpage[2][3], there is no need to track refaulting anonymous
> > > > > > > pages in submit_bio.
> > > > > > > 
> > > > > > > So this patch can reduce redundant calling of psi_memstall_enter. And
> > > > > > > make it easier to track refaulting file pages and anonymous pages
> > > > > > > separately.
> > > > > > 
> > > > > > I don't think this is an improvement.
> > > > > > 
> > > > > > psi_memstall_enter() will check current->in_memstall once, detect the
> > > > > > nested call, and bail. Your patch checks PageSwapBacked for every page
> > > > > > being added. It's more branches for less robust code.
> > > > > 
> > > > > We are also working for a new patch to classify different reasons cause
> > > > > psi_memstall_enter(): reclaim, thrashing, compact, etc. This will help
> > > > > user to tuning sysctl, for example, if user see high compact delay, he
> > > > > may try do adjust THP sysctl to reduce the compact delay.
> > > > > 
> > > > > To support that, we should distinguish what's the reason cause psi in
> > > > > submit_io(), this patch does the job.
> > > > 
> > > > Please submit these patches together then. On its own, this patch
> > > > isn't desirable.
> > > I think this patch has it's independent value, I try to make a better
> > > explain.
> 
> You missed the point about it complicating semantics.
> 
> Right now, the bio layer annotates stalls from queue contention. This
> is very simple. The swap code has relied on it in the past. It doesn't
> now, but that doesn't change what the concept is at the bio layer.
> 
> You patch explicitly codifies that the MM handles swap IOs, and the
> lower bio layer handles files. This asymmetry is ugly and error prone.
>
Yes this asymmetry is bad, we also don't think this patch is perfect.

But just as you said below, page cache creating is spread out multiple
filesystems, if we move psi_memstall_enter to the upper layer it would
maybe horrible repetition and also error prone. I think the primary
question is page cache code is not centralized, before it's be solved
(not likely?), we may have to make a compromise, and I think this
patch is a simple one.

Thanks.

> If you want type distinction, we should move it all into MM code, like
> Christoph is saying. Were swap code handles anon refaults and the page
> cache code handles file refaults. This would be my preferred layering,
> and my original patch did that: https://lkml.org/lkml/2019/7/22/1070.
> 
> But it was NAKed, and I had to agree with the argument. The page cache
> code is not very centralized, and the place where we deal with
> individual pages (and detect refaults) and where we submit bios (where
> the stalls occur) is spread out into multiple filesystems. There are
> 180 submit_bio() calls in fs/; you'd have to audit which ones are for
> page cache submissions, and then add stall annotations or use a new
> submit_bio_cache() wrapper that handles it. Changes in the filesystem
> could easily miss this protocol and silently break pressure detection.
>
> I would prefer the annotations to be at this level, I just don't see
> how to do it cleanly/robustly. Maybe Christoph has an idea, he
> understands the fs side much better than I do.
Christoph Hellwig March 31, 2022, 5:15 a.m. UTC | #16
On Wed, Mar 30, 2022 at 12:17:09PM -0400, Johannes Weiner wrote:
> It's add_to_page_cache_lru() that sets the flag.
> 
> Basically, when a PageWorkingset (hot) page gets reclaimed, the bit is
> stored in the vacated tree slot. When the entry is brought back in,
> add_to_page_cache_lru() transfers it to the newly allocated page.

Ok.  In this case my patch didn't quite do the right thing for readahead
either.  But that does leave a question for the btrfs compressed
case, which only adds extra pages to a read to readahad a bigger
cluster size - that is these pages are not read at the request of the
VM.  Does it really make sense to do PSI accounting for them in that
case?
Johannes Weiner March 31, 2022, 7:07 p.m. UTC | #17
On Wed, Mar 30, 2022 at 10:15:32PM -0700, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 12:17:09PM -0400, Johannes Weiner wrote:
> > It's add_to_page_cache_lru() that sets the flag.
> > 
> > Basically, when a PageWorkingset (hot) page gets reclaimed, the bit is
> > stored in the vacated tree slot. When the entry is brought back in,
> > add_to_page_cache_lru() transfers it to the newly allocated page.
> 
> Ok.  In this case my patch didn't quite do the right thing for readahead
> either.  But that does leave a question for the btrfs compressed
> case, which only adds extra pages to a read to readahad a bigger
> cluster size - that is these pages are not read at the request of the
> VM.  Does it really make sense to do PSI accounting for them in that
> case?

I think it does.

I suppose it's an argument about readahead pages in general, which
technically the workload itself doesn't commission explicitly. But
those pages are still triggered by a nearby access, their reads
contribute to device utilization, and if they're PageWorkingset it
means they're only being read because there is a lack of memory.

In a perfect world, readahead would stop when memory or IO are
contended. But it doesn't, and the stalls it can inject into the
workload are as real as stalls from directly requested reads.
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 3c57b3ba727d..54b60be4f3b0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1035,8 +1035,9 @@  void __bio_add_page(struct bio *bio, struct page *page,
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
 
-	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
-		bio_set_flag(bio, BIO_WORKINGSET);
+	if (!bio_flagged(bio, BIO_WORKINGSET_FILE) &&
+	    unlikely(PageWorkingset(page)) && !PageSwapBacked(page))
+		bio_set_flag(bio, BIO_WORKINGSET_FILE);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
@@ -1254,7 +1255,7 @@  static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
  * is returned only if 0 pages could be pinned.
  *
  * It's intended for direct IO, so doesn't do PSI tracking, the caller is
- * responsible for setting BIO_WORKINGSET if necessary.
+ * responsible for setting BIO_WORKINGSET_FILE if necessary.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1274,7 +1275,7 @@  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	/* don't account direct I/O as memory stall */
-	bio_clear_flag(bio, BIO_WORKINGSET);
+	bio_clear_flag(bio, BIO_WORKINGSET_FILE);
 	return bio->bi_vcnt ? 0 : ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
diff --git a/block/blk-core.c b/block/blk-core.c
index ddac62aebc55..9a955323734b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -918,7 +918,7 @@  void submit_bio(struct bio *bio)
 	 * part of overall IO time.
 	 */
 	if (unlikely(bio_op(bio) == REQ_OP_READ &&
-	    bio_flagged(bio, BIO_WORKINGSET))) {
+	    bio_flagged(bio, BIO_WORKINGSET_FILE))) {
 		unsigned long pflags;
 
 		psi_memstall_enter(&pflags);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index aef06e607b40..7cdec50fb27b 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -420,7 +420,7 @@  static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 
 	bio->bi_private = dio;
 	/* don't account direct I/O as memory stall */
-	bio_clear_flag(bio, BIO_WORKINGSET);
+	bio_clear_flag(bio, BIO_WORKINGSET_FILE);
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0c7a9a1f06c8..a1aaba4767e9 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -314,7 +314,7 @@  enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
-	BIO_WORKINGSET,		/* contains userspace workingset pages */
+	BIO_WORKINGSET_FILE,	/* contains userspace workingset file pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */