diff mbox series

[3/5] btrfs: add manual PSI accounting for compressed reads

Message ID 20220915094200.139713-4-hch@lst.de (mailing list archive)
State New
Headers show
Series [1/5] mm: add PSI accounting around ->read_folio and ->readahead calls | expand

Commit Message

Christoph Hellwig Sept. 15, 2022, 9:41 a.m. UTC
btrfs compressed reads try to always read the entire compressed chunk,
even if only a subset is requested.  Currently this is covered by the
magic PSI accounting underneath submit_bio, but that is about to go
away. Instead add manual psi_memstall_{enter,leave} annotations.

Note that for readahead this really should be using readahead_expand,
but the additionals reads are also done for plain ->read_folio where
readahead_expand can't work, so this overall logic is left as-is for
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Thorsten Leemhuis Nov. 3, 2022, 10:46 a.m. UTC | #1
Hi Christoph!

On 15.09.22 11:41, Christoph Hellwig wrote:
> btrfs compressed reads try to always read the entire compressed chunk,
> even if only a subset is requested.  Currently this is covered by the
> magic PSI accounting underneath submit_bio, but that is about to go
> away. Instead add manual psi_memstall_{enter,leave} annotations.
> 
> Note that for readahead this really should be using readahead_expand,
> but the additionals reads are also done for plain ->read_folio where
> readahead_expand can't work, so this overall logic is left as-is for
> now.

It seems this patch makes systemd-oomd overreact on my day-to-day
machine and aggressively kill applications. I'm not the only one that
noticed such a behavior with 6.1 pre-releases:
https://bugzilla.redhat.com/show_bug.cgi?id=2133829
https://bugzilla.redhat.com/show_bug.cgi?id=2134971

I think I have a pretty reliable way to trigger the issue that involves
starting the apps that I normally use and a VM that I occasionally use,
which up to now never resulted in such a behaviour.

On master as of today (8e5423e991e8) I can trigger the problem within a
minute or two. But I fail to trigger it with v6.0.6 or when I revert
4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
And yes, I use btrfs with compression for / and /home/.

See [1] for a log msg from systemd-oomd.

Note, I had some trouble with bisecting[2]. This series looked
suspicious, so I removed it completely ontop of master and the problem
went away. Then I tried reverting only 4088a47e78f9 which helped, too.
Let me know if you want me to try another combination or need more data.

Ciao, Thorsten


[1] just one example:
```
> 10:52:29 t14s systemd-oomd[1261]: Considered 60 cgroups for killing, top candidates were:
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/packagekit.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 93.66 Avg60: 38.22 Avg300: 9.48 Total: 29s
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 276.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 181098
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 178926
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/firewalld.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 34.6M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 13035
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12854
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/sssd-kcm.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 184us
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 32.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 7667
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 7501
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/systemd-journald.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 8ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 14.5M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 13020
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12914
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/libvirtd.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 18.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 12983
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12896
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/geoclue.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 18.0M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 3625
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 3550
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/polkit.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 2ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 15.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 10664
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 10596
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/NetworkManager.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 3ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 6.6M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 2515
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 2492
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/abrt-xorg.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 5.2M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 35154
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 35131
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/dbus-broker.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 7.5M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 1183
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 1161
> 10:52:29 t14s systemd-oomd[1261]: Killed /system.slice/packagekit.service due to memory pressure for /system.slice being 91.73% > 50.00% for > 20s with reclaim activity
```

[2]

I have no idea what went wrong. 0a78a376ef3c (the last merge before the
one with this series) was fine, but c6ea70604249 (which afaics basically
is the next commit if I understand things right) was not. I tried
reverting it, which should give me the merge base (v6.0-rc2), but it was
broken, too. I guess I must have done something wrong, but I have no
idea what, but I tried again and got the same result. :-/

/me must be missing something and/or not understand git properly...


> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Acked-by: David Sterba <dsterba@suse.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/compression.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index e84d22c5c6a83..370788b9b1249 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -15,6 +15,7 @@
>  #include <linux/string.h>
>  #include <linux/backing-dev.h>
>  #include <linux/writeback.h>
> +#include <linux/psi.h>
>  #include <linux/slab.h>
>  #include <linux/sched/mm.h>
>  #include <linux/log2.h>
> @@ -519,7 +520,8 @@ static u64 bio_end_offset(struct bio *bio)
>   */
>  static noinline int add_ra_bio_pages(struct inode *inode,
>  				     u64 compressed_end,
> -				     struct compressed_bio *cb)
> +				     struct compressed_bio *cb,
> +				     unsigned long *pflags)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	unsigned long end_index;
> @@ -588,6 +590,9 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  			continue;
>  		}
>  
> +		if (PageWorkingset(page))
> +			psi_memstall_enter(pflags);
> +
>  		ret = set_page_extent_mapped(page);
>  		if (ret < 0) {
>  			unlock_page(page);
> @@ -674,6 +679,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	u64 em_len;
>  	u64 em_start;
>  	struct extent_map *em;
> +	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
> +	unsigned long pflags = 1;
>  	blk_status_t ret;
>  	int ret2;
>  	int i;
> @@ -729,7 +736,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		goto fail;
>  	}
>  
> -	add_ra_bio_pages(inode, em_start + em_len, cb);
> +	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
>  
>  	/* include any pages we added in add_ra-bio_pages */
>  	cb->len = bio->bi_iter.bi_size;
> @@ -810,6 +817,9 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		}
>  	}
>  
> +	if (!pflags)
> +		psi_memstall_leave(&pflags);
> +
>  	if (refcount_dec_and_test(&cb->pending_ios))
>  		finish_compressed_bio_read(cb);
>  	return;
Thorsten Leemhuis Nov. 3, 2022, 11:08 a.m. UTC | #2
On 03.11.22 11:46, Thorsten Leemhuis wrote:
> On 15.09.22 11:41, Christoph Hellwig wrote:
>> btrfs compressed reads try to always read the entire compressed chunk,
>> even if only a subset is requested.  Currently this is covered by the
>> magic PSI accounting underneath submit_bio, but that is about to go
>> away. Instead add manual psi_memstall_{enter,leave} annotations.
>>
>> Note that for readahead this really should be using readahead_expand,
>> but the additionals reads are also done for plain ->read_folio where
>> readahead_expand can't work, so this overall logic is left as-is for
>> now.
> 
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971

Great, the kernel's regression tracker reports a regression and forgets
to tell his regression tracking bot about it to ensure it's tracked... :-D

#regzbot ^introduced 4088a47e78f9
#regzbot title mm/btrfs: systemd-oomd overreacting due to PSI changes
for Btrfs
#regzbot ignore-activity

Ciao, Thorsten
Christoph Hellwig Nov. 3, 2022, 12:40 p.m. UTC | #3
On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> 
> I think I have a pretty reliable way to trigger the issue that involves
> starting the apps that I normally use and a VM that I occasionally use,
> which up to now never resulted in such a behaviour.
> 
> On master as of today (8e5423e991e8) I can trigger the problem within a
> minute or two. But I fail to trigger it with v6.0.6 or when I revert
> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
> And yes, I use btrfs with compression for / and /home/.

So, I did in fact not want to include this patch because it is a little
iffy and includes PSI accounting for reads where btrfs just does
aggresive readaround for compression, but Johannes asked for it to be
added.  I'd be perfectly fine with just reverting it.
Johannes Weiner Nov. 3, 2022, 10:20 p.m. UTC | #4
On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
> Hi Christoph!
> 
> On 15.09.22 11:41, Christoph Hellwig wrote:
> > btrfs compressed reads try to always read the entire compressed chunk,
> > even if only a subset is requested.  Currently this is covered by the
> > magic PSI accounting underneath submit_bio, but that is about to go
> > away. Instead add manual psi_memstall_{enter,leave} annotations.
> > 
> > Note that for readahead this really should be using readahead_expand,
> > but the additionals reads are also done for plain ->read_folio where
> > readahead_expand can't work, so this overall logic is left as-is for
> > now.
> 
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> 
> I think I have a pretty reliable way to trigger the issue that involves
> starting the apps that I normally use and a VM that I occasionally use,
> which up to now never resulted in such a behaviour.
> 
> On master as of today (8e5423e991e8) I can trigger the problem within a
> minute or two. But I fail to trigger it with v6.0.6 or when I revert
> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
> And yes, I use btrfs with compression for / and /home/.
> 
> See [1] for a log msg from systemd-oomd.
> 
> Note, I had some trouble with bisecting[2]. This series looked
> suspicious, so I removed it completely ontop of master and the problem
> went away. Then I tried reverting only 4088a47e78f9 which helped, too.
> Let me know if you want me to try another combination or need more data.

Oh, I think I see the bug. We can leak pressure state from the bio
submission, which causes the task to permanently drive up pressure.

Can you try this patch?

From 499e5cab7b39fc4c90a0f96e33cdc03274b316fd Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 3 Nov 2022 17:34:31 -0400
Subject: [PATCH] fs: btrfs: fix leaked psi pressure state

When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
was faulty. The task can remain in a stall state after the read. This
results in incorrectly elevated pressure, which triggers OOM kills.

pflags record the *previous* memstall state when we enter a new
one. The code tried to initialize pflags to 1, and then optimize the
leave call when we either didn't enter a memstall, or were already
inside a nested stall. However, there can be multiple PageWorkingset
pages in the bio, at which point it's that path itself that re-enters
the state and overwrites pflags. This causes us to miss the exit.

Enter the stall only once if needed, then unwind correctly.

Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Fixes: 4088a47e78f9 btrfs: add manual PSI accounting for compressed reads
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f1f051ad3147..e6635fe70067 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
 				     struct compressed_bio *cb,
-				     unsigned long *pflags)
+				     int *memstall, unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
-		if (PageWorkingset(page))
+		if (!*memstall && PageWorkingset(page)) {
 			psi_memstall_enter(pflags);
+			*memstall = 1;
+		}
 
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
@@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
-	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
-	if (!pflags)
+	if (memstall)
 		psi_memstall_leave(&pflags);
 
 	if (refcount_dec_and_test(&cb->pending_ios))
Thorsten Leemhuis Nov. 4, 2022, 7:32 a.m. UTC | #5
On 03.11.22 23:20, Johannes Weiner wrote:
> On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
>> On 15.09.22 11:41, Christoph Hellwig wrote:
>>> btrfs compressed reads try to always read the entire compressed chunk,
>>> even if only a subset is requested.  Currently this is covered by the
>>> magic PSI accounting underneath submit_bio, but that is about to go
>>> away. Instead add manual psi_memstall_{enter,leave} annotations.
>>>
>>> Note that for readahead this really should be using readahead_expand,
>>> but the additionals reads are also done for plain ->read_folio where
>>> readahead_expand can't work, so this overall logic is left as-is for
>>> now.
>>
>> It seems this patch makes systemd-oomd overreact on my day-to-day
>> machine and aggressively kill applications. I'm not the only one that
>> noticed such a behavior with 6.1 pre-releases:
>> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
>> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> [...]
>> On master as of today (8e5423e991e8) I can trigger the problem within a
>> minute or two. But I fail to trigger it with v6.0.6 or when I revert
>> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
>> And yes, I use btrfs with compression for / and /home/.
> [...]
> 
> Oh, I think I see the bug. We can leak pressure state from the bio
> submission, which causes the task to permanently drive up pressure.

Thx for looking into this.

> Can you try this patch?

It apparently does the trick -- at least my test setup that usually
triggers the bug within a minute or two survived for nearly an hour now, so:

Tested-by: Thorsten Leemhuis <linux@leemhuis.info>

Can you please also add this tag to help future archeologists, as
explained by the kernel docs (for details see
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst):

Link:
https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/

It also will make my regression tracking bot see further postings of
this patch and mark the issue as resolved once the patch lands in mainline.

tia and thx again for the patch!

Ciao, Thorsten

>>From 499e5cab7b39fc4c90a0f96e33cdc03274b316fd Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 3 Nov 2022 17:34:31 -0400
> Subject: [PATCH] fs: btrfs: fix leaked psi pressure state
> 
> When psi annotations were added to to btrfs compression reads, the psi
> state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
> was faulty. The task can remain in a stall state after the read. This
> results in incorrectly elevated pressure, which triggers OOM kills.
> 
> pflags record the *previous* memstall state when we enter a new
> one. The code tried to initialize pflags to 1, and then optimize the
> leave call when we either didn't enter a memstall, or were already
> inside a nested stall. However, there can be multiple PageWorkingset
> pages in the bio, at which point it's that path itself that re-enters
> the state and overwrites pflags. This causes us to miss the exit.
> 
> Enter the stall only once if needed, then unwind correctly.
> 
> Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
> Fixes: 4088a47e78f9 btrfs: add manual PSI accounting for compressed reads
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/compression.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index f1f051ad3147..e6635fe70067 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
>  static noinline int add_ra_bio_pages(struct inode *inode,
>  				     u64 compressed_end,
>  				     struct compressed_bio *cb,
> -				     unsigned long *pflags)
> +				     int *memstall, unsigned long *pflags)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	unsigned long end_index;
> @@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  			continue;
>  		}
>  
> -		if (PageWorkingset(page))
> +		if (!*memstall && PageWorkingset(page)) {
>  			psi_memstall_enter(pflags);
> +			*memstall = 1;
> +		}
>  
>  		ret = set_page_extent_mapped(page);
>  		if (ret < 0) {
> @@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	u64 em_len;
>  	u64 em_start;
>  	struct extent_map *em;
> -	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
> -	unsigned long pflags = 1;
> +	unsigned long pflags;
> +	int memstall = 0;
>  	blk_status_t ret;
>  	int ret2;
>  	int i;
> @@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		goto fail;
>  	}
>  
> -	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
> +	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
>  
>  	/* include any pages we added in add_ra-bio_pages */
>  	cb->len = bio->bi_iter.bi_size;
> @@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		}
>  	}
>  
> -	if (!pflags)
> +	if (memstall)
>  		psi_memstall_leave(&pflags);
>  
>  	if (refcount_dec_and_test(&cb->pending_ios))
Johannes Weiner Nov. 4, 2022, 12:36 p.m. UTC | #6
On Fri, Nov 04, 2022 at 08:32:22AM +0100, Thorsten Leemhuis wrote:
> On 03.11.22 23:20, Johannes Weiner wrote:
> > Can you try this patch?
> 
> It apparently does the trick -- at least my test setup that usually
> triggers the bug within a minute or two survived for nearly an hour now, so:
> 
> Tested-by: Thorsten Leemhuis <linux@leemhuis.info>

Great, thanks Thorsten.

> Can you please also add this tag to help future archeologists, as
> explained by the kernel docs (for details see
> Documentation/process/submitting-patches.rst and
> Documentation/process/5.Posting.rst):
> 
> Link:
> https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
> 
> It also will make my regression tracking bot see further postings of
> this patch and mark the issue as resolved once the patch lands in mainline.

Done.

Looks like erofs has the same issue, I included a fix for that.

Andrew would you mind picking this up and sending it Linusward? Jens
routed the series originally, but I believe he is out today.

Thanks

From b668b261ed18105e91745f3d7676b6bca968476d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 3 Nov 2022 17:34:31 -0400
Subject: [PATCH] fs: fix leaked psi pressure state

When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
was faulty. A pressure state, once entered, is never left. This
results in incorrectly elevated pressure, which triggers OOM kills.

pflags record the *previous* memstall state when we enter a new
one. The code tried to initialize pflags to 1, and then optimize the
leave call when we either didn't enter a memstall, or were already
inside a nested stall. However, there can be multiple PageWorkingset
pages in the bio, at which point it's that path itself that enters
repeatedly and overwrites pflags. This causes us to miss the exit.

Enter the stall only once if needed, then unwind correctly.

erofs has the same problem, fix that up too. And move the memstall
exit past submit_bio() to restore submit accounting originally added
by b8e24a9300b0 ("block: annotate refault stalls from IO submission").

Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads")
Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space")
Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer")
Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
---
 fs/btrfs/compression.c | 14 ++++++++------
 fs/erofs/zdata.c       | 18 +++++++++++-------
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f1f051ad3147..e6635fe70067 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
 				     struct compressed_bio *cb,
-				     unsigned long *pflags)
+				     int *memstall, unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
-		if (PageWorkingset(page))
+		if (!*memstall && PageWorkingset(page)) {
 			psi_memstall_enter(pflags);
+			*memstall = 1;
+		}
 
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
@@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
-	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
-	if (!pflags)
+	if (memstall)
 		psi_memstall_leave(&pflags);
 
 	if (refcount_dec_and_test(&cb->pending_ios))
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index c7f24fc7efd5..064a166324a7 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1412,8 +1412,8 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 	struct block_device *last_bdev;
 	unsigned int nr_bios = 0;
 	struct bio *bio = NULL;
-	/* initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 
 	bi_private = jobqueueset_init(sb, q, fgq, force_fg);
 	qtail[JQ_BYPASS] = &q[JQ_BYPASS]->head;
@@ -1463,14 +1463,18 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 			if (bio && (cur != last_index + 1 ||
 				    last_bdev != mdev.m_bdev)) {
 submit_bio_retry:
-				if (!pflags)
-					psi_memstall_leave(&pflags);
 				submit_bio(bio);
+				if (memstall) {
+					psi_memstall_leave(&pflags);
+					memstall = 0;
+				}
 				bio = NULL;
 			}
 
-			if (unlikely(PageWorkingset(page)))
+			if (unlikely(PageWorkingset(page)) && !memstall) {
 				psi_memstall_enter(&pflags);
+				memstall = 1;
+			}
 
 			if (!bio) {
 				bio = bio_alloc(mdev.m_bdev, BIO_MAX_VECS,
@@ -1500,9 +1504,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 	} while (owned_head != Z_EROFS_PCLUSTER_TAIL);
 
 	if (bio) {
-		if (!pflags)
-			psi_memstall_leave(&pflags);
 		submit_bio(bio);
+		if (memstall)
+			psi_memstall_leave(&pflags);
 	}
 
 	/*
diff mbox series

Patch

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e84d22c5c6a83..370788b9b1249 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -15,6 +15,7 @@ 
 #include <linux/string.h>
 #include <linux/backing-dev.h>
 #include <linux/writeback.h>
+#include <linux/psi.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
 #include <linux/log2.h>
@@ -519,7 +520,8 @@  static u64 bio_end_offset(struct bio *bio)
  */
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
-				     struct compressed_bio *cb)
+				     struct compressed_bio *cb,
+				     unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -588,6 +590,9 @@  static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
+		if (PageWorkingset(page))
+			psi_memstall_enter(pflags);
+
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
 			unlock_page(page);
@@ -674,6 +679,8 @@  void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
+	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
+	unsigned long pflags = 1;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -729,7 +736,7 @@  void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -810,6 +817,9 @@  void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
+	if (!pflags)
+		psi_memstall_leave(&pflags);
+
 	if (refcount_dec_and_test(&cb->pending_ios))
 		finish_compressed_bio_read(cb);
 	return;