Message ID | b5f8d32e8b4fb1bfb3a88f5373530980f1d2f523.1723798569.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: only run extent map shrinker inside cleaner kthread | expand |
On Fri, Aug 16, 2024 at 9:58 AM Qu Wenruo <wqu@suse.com> wrote: > > There are still reports about extent map shrinker is causing high CPU > usage. > > It turns out that the call backs can be very frequent, even inside > kswapd (and we can have multiple kswapd if the system have several > nodes). You mean numa nodes? Are you sure that's the problem, and not something like tasks waiting long for a kswapd task because it's doing extent map removal? > > For the only other fs implementing the reclaim callbacks, XFS does it > very differently by ensure the following conditions: > > - Make sure there is only one reclaim work queued > > - Add a delay before queuing the reclaim workload > The default delay is 18s (60% of xfs_syncd_centisecs) > > In btrfs, there is already a context which is very similar to the XFS > condition: cleaner kthread. > > There is only one cleaner kthread for the fs, and it's waken periodically > (the same time interval as commit interval, which is 30s by default). > > So it's much better to run the extent map shrinker inside cleaner > kthread. No please, don't do it in the cleaner. There's already a lot of different work that the cleaner does, from running delayed iputs, to defrag inodes, remove deleted roots, etc. Adding the shrinker there only increases the delay for those tasks and those tasks increase the delay for the shrinker to run and free memory. I'd rather have this done as an unbounded work queue item like we do for space reclaim. In fact one thing I have in my todo list is to get rid of the cleaner and have all work done in work queues (it's not that straightforward as it may seem). > > And make the free_cached_objects() callback to only record the latest > number to free. > > Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ > Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ > Reported-by: Ivan Shapovalov <intelfx@intelfx.name> > Reported-by: Jannik Glückert <jannik.glueckert@gmail.com> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Reason for RFC: > > I do not have an environment which can trigger the reclaim that > frequently, thus more tests would be appreciated. So I'd rather have time spent analysing things and testing them than having hypothetical test patches. > > If one wants to test, please use this branch: > https://github.com/adam900710/linux.git em_shrink_freq > > And enable CONFIG_BTRFS_DEBUG (since the previous patch hides the > shrinker behind DEBUG builds). > > --- > fs/btrfs/disk-io.c | 3 +++ > fs/btrfs/extent_map.c | 3 ++- > fs/btrfs/extent_map.h | 2 +- > fs/btrfs/fs.h | 1 + > fs/btrfs/super.c | 22 +++++++--------------- > 5 files changed, 14 insertions(+), 17 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index a6f5441e62d1..624dd7552e0f 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1542,6 +1542,9 @@ static int cleaner_kthread(void *arg) > * space. > */ > btrfs_reclaim_bgs(fs_info); > + > + if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) > + btrfs_free_extent_maps(fs_info); > sleep: > clear_and_wake_up_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags); > if (kthread_should_park()) > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c > index 25d191f1ac10..1491429f9386 100644 > --- a/fs/btrfs/extent_map.c > +++ b/fs/btrfs/extent_map.c > @@ -1248,13 +1248,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx > return nr_dropped; > } > > -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) > +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info) > { > struct btrfs_em_shrink_ctx ctx; > u64 start_root_id; > u64 next_root_id; > bool cycled = false; > long nr_dropped = 0; > + long nr_to_scan = READ_ONCE(fs_info->extent_map_shrinker_nr_to_scan); > > ctx.scanned = 0; > ctx.nr_to_scan = nr_to_scan; > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h > index 5154a8f1d26c..070621b4467f 100644 > --- a/fs/btrfs/extent_map.h > +++ b/fs/btrfs/extent_map.h > @@ -189,6 +189,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, > int btrfs_replace_extent_map_range(struct btrfs_inode *inode, > struct extent_map *new_em, > bool modified); > -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan); > +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info); > > #endif > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h > index 3d6d4b503220..a594c8309693 100644 > --- a/fs/btrfs/fs.h > +++ b/fs/btrfs/fs.h > @@ -636,6 +636,7 @@ struct btrfs_fs_info { > spinlock_t extent_map_shrinker_lock; > u64 extent_map_shrinker_last_root; > u64 extent_map_shrinker_last_ino; > + long extent_map_shrinker_nr_to_scan; > > /* Protected by 'trans_lock'. */ > struct list_head dirty_cowonly_roots; > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 98fa0f382480..5d9958063ddd 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -2402,13 +2402,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro > > trace_btrfs_extent_map_shrinker_count(fs_info, nr); > > - /* > - * Only report the real number for DEBUG builds, as there are reports of > - * serious performance degradation caused by too frequent shrinks. > - */ > - if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) > - return nr; > - return 0; > + return nr; > } > > static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc) > @@ -2417,15 +2411,13 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont > struct btrfs_fs_info *fs_info = btrfs_sb(sb); > > /* > - * We may be called from any task trying to allocate memory and we don't > - * want to slow it down with scanning and dropping extent maps. It would > - * also cause heavy lock contention if many tasks concurrently enter > - * here. Therefore only allow kswapd tasks to scan and drop extent maps. > + * Only record the latest nr_to_scan value. The real scan will happen > + * at cleaner kthread. > + * As free_cached_objects() can be triggered very frequently, it's > + * not practical to scan the whole fs to reclaim extent maps. > */ > - if (!current_is_kswapd()) > - return 0; > - > - return btrfs_free_extent_maps(fs_info, nr_to_scan); > + WRITE_ONCE(fs_info->extent_map_shrinker_nr_to_scan, nr_to_scan); This is never reset to 0, so the shrinker will run even when we don't need to free extent maps anymore, doing extra work, causing some fsyncs to go to the slow path, etc. If you can wait about 1 week for me to be back from vacation, I can continue with plans I have for the shrinker, and just disable the shrinker in the meanwhile. Thanks Qu. > + return 0; > } > > static const struct super_operations btrfs_super_ops = { > -- > 2.46.0 > >
在 2024/8/16 20:39, Filipe Manana 写道: > On Fri, Aug 16, 2024 at 9:58 AM Qu Wenruo <wqu@suse.com> wrote: >> >> There are still reports about extent map shrinker is causing high CPU >> usage. >> >> It turns out that the call backs can be very frequent, even inside >> kswapd (and we can have multiple kswapd if the system have several >> nodes). > > You mean numa nodes? > Are you sure that's the problem, and not something like tasks waiting > long for a kswapd task because it's doing extent map removal? I'm pretty sure this is not the root cause. > >> >> For the only other fs implementing the reclaim callbacks, XFS does it >> very differently by ensure the following conditions: >> >> - Make sure there is only one reclaim work queued >> >> - Add a delay before queuing the reclaim workload >> The default delay is 18s (60% of xfs_syncd_centisecs) >> >> In btrfs, there is already a context which is very similar to the XFS >> condition: cleaner kthread. >> >> There is only one cleaner kthread for the fs, and it's waken periodically >> (the same time interval as commit interval, which is 30s by default). >> >> So it's much better to run the extent map shrinker inside cleaner >> kthread. > > No please, don't do it in the cleaner. > > There's already a lot of different work that the cleaner does, from > running delayed iputs, to defrag inodes, remove deleted roots, etc. > Adding the shrinker there only increases the delay for those tasks and > those tasks increase the delay for the shrinker to run and free > memory. > > I'd rather have this done as an unbounded work queue item like we do > for space reclaim. > In fact one thing I have in my todo list is to get rid of the cleaner > and have all work done in work queues (it's not that straightforward > as it may seem). OK, that also works fine for me, although it would be better also to introduce the extra delay just like XFS. > >> >> And make the free_cached_objects() callback to only record the latest >> number to free. >> >> Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ >> Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ >> Reported-by: Ivan Shapovalov <intelfx@intelfx.name> >> Reported-by: Jannik Glückert <jannik.glueckert@gmail.com> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> Reason for RFC: >> >> I do not have an environment which can trigger the reclaim that >> frequently, thus more tests would be appreciated. > > So I'd rather have time spent analysing things and testing them than > having hypothetical test patches. > >> >> If one wants to test, please use this branch: >> https://github.com/adam900710/linux.git em_shrink_freq >> >> And enable CONFIG_BTRFS_DEBUG (since the previous patch hides the >> shrinker behind DEBUG builds). >> >> --- >> fs/btrfs/disk-io.c | 3 +++ >> fs/btrfs/extent_map.c | 3 ++- >> fs/btrfs/extent_map.h | 2 +- >> fs/btrfs/fs.h | 1 + >> fs/btrfs/super.c | 22 +++++++--------------- >> 5 files changed, 14 insertions(+), 17 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index a6f5441e62d1..624dd7552e0f 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -1542,6 +1542,9 @@ static int cleaner_kthread(void *arg) >> * space. >> */ >> btrfs_reclaim_bgs(fs_info); >> + >> + if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) >> + btrfs_free_extent_maps(fs_info); >> sleep: >> clear_and_wake_up_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags); >> if (kthread_should_park()) >> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c >> index 25d191f1ac10..1491429f9386 100644 >> --- a/fs/btrfs/extent_map.c >> +++ b/fs/btrfs/extent_map.c >> @@ -1248,13 +1248,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx >> return nr_dropped; >> } >> >> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) >> +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info) >> { >> struct btrfs_em_shrink_ctx ctx; >> u64 start_root_id; >> u64 next_root_id; >> bool cycled = false; >> long nr_dropped = 0; >> + long nr_to_scan = READ_ONCE(fs_info->extent_map_shrinker_nr_to_scan); >> >> ctx.scanned = 0; >> ctx.nr_to_scan = nr_to_scan; >> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h >> index 5154a8f1d26c..070621b4467f 100644 >> --- a/fs/btrfs/extent_map.h >> +++ b/fs/btrfs/extent_map.h >> @@ -189,6 +189,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, >> int btrfs_replace_extent_map_range(struct btrfs_inode *inode, >> struct extent_map *new_em, >> bool modified); >> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan); >> +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info); >> >> #endif >> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h >> index 3d6d4b503220..a594c8309693 100644 >> --- a/fs/btrfs/fs.h >> +++ b/fs/btrfs/fs.h >> @@ -636,6 +636,7 @@ struct btrfs_fs_info { >> spinlock_t extent_map_shrinker_lock; >> u64 extent_map_shrinker_last_root; >> u64 extent_map_shrinker_last_ino; >> + long extent_map_shrinker_nr_to_scan; >> >> /* Protected by 'trans_lock'. */ >> struct list_head dirty_cowonly_roots; >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index 98fa0f382480..5d9958063ddd 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -2402,13 +2402,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro >> >> trace_btrfs_extent_map_shrinker_count(fs_info, nr); >> >> - /* >> - * Only report the real number for DEBUG builds, as there are reports of >> - * serious performance degradation caused by too frequent shrinks. >> - */ >> - if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) >> - return nr; >> - return 0; >> + return nr; >> } >> >> static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc) >> @@ -2417,15 +2411,13 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont >> struct btrfs_fs_info *fs_info = btrfs_sb(sb); >> >> /* >> - * We may be called from any task trying to allocate memory and we don't >> - * want to slow it down with scanning and dropping extent maps. It would >> - * also cause heavy lock contention if many tasks concurrently enter >> - * here. Therefore only allow kswapd tasks to scan and drop extent maps. >> + * Only record the latest nr_to_scan value. The real scan will happen >> + * at cleaner kthread. >> + * As free_cached_objects() can be triggered very frequently, it's >> + * not practical to scan the whole fs to reclaim extent maps. >> */ >> - if (!current_is_kswapd()) >> - return 0; >> - >> - return btrfs_free_extent_maps(fs_info, nr_to_scan); >> + WRITE_ONCE(fs_info->extent_map_shrinker_nr_to_scan, nr_to_scan); > > This is never reset to 0, so the shrinker will run even when we don't > need to free extent maps anymore, doing extra work, causing some > fsyncs to go to the slow path, etc. This will be updated by any newer free_cached_objects() call back, with the calculated nr_to_scan. If there is really nothing to free, it will be set to 0, as the nr_cached_objects() will return 0. > > If you can wait about 1 week for me to be back from vacation, I can > continue with plans I have for the shrinker, and just disable the > shrinker in the meanwhile. Sure, that also works for me. Thanks, Qu > > Thanks Qu. > >> + return 0; >> } >> >> static const struct super_operations btrfs_super_ops = { >> -- >> 2.46.0 >> >> >
在 2024/8/17 07:22, Qu Wenruo 写道: > > > 在 2024/8/16 20:39, Filipe Manana 写道: >> On Fri, Aug 16, 2024 at 9:58 AM Qu Wenruo <wqu@suse.com> wrote: >>> >>> There are still reports about extent map shrinker is causing high CPU >>> usage. >>> >>> It turns out that the call backs can be very frequent, even inside >>> kswapd (and we can have multiple kswapd if the system have several >>> nodes). >> >> You mean numa nodes? >> Are you sure that's the problem, and not something like tasks waiting >> long for a kswapd task because it's doing extent map removal? > > I'm pretty sure this is not the root cause. > >> >>> >>> For the only other fs implementing the reclaim callbacks, XFS does it >>> very differently by ensure the following conditions: >>> >>> - Make sure there is only one reclaim work queued >>> >>> - Add a delay before queuing the reclaim workload >>> The default delay is 18s (60% of xfs_syncd_centisecs) >>> >>> In btrfs, there is already a context which is very similar to the XFS >>> condition: cleaner kthread. >>> >>> There is only one cleaner kthread for the fs, and it's waken >>> periodically >>> (the same time interval as commit interval, which is 30s by default). >>> >>> So it's much better to run the extent map shrinker inside cleaner >>> kthread. >> >> No please, don't do it in the cleaner. >> >> There's already a lot of different work that the cleaner does, from >> running delayed iputs, to defrag inodes, remove deleted roots, etc. >> Adding the shrinker there only increases the delay for those tasks and >> those tasks increase the delay for the shrinker to run and free >> memory. >> >> I'd rather have this done as an unbounded work queue item like we do >> for space reclaim. >> In fact one thing I have in my todo list is to get rid of the cleaner >> and have all work done in work queues (it's not that straightforward >> as it may seem). > > OK, that also works fine for me, although it would be better also to > introduce the extra delay just like XFS. > >> >>> >>> And make the free_cached_objects() callback to only record the latest >>> number to free. >>> >>> Link: >>> https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ >>> Link: >>> https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ >>> Reported-by: Ivan Shapovalov <intelfx@intelfx.name> >>> Reported-by: Jannik Glückert <jannik.glueckert@gmail.com> >>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>> --- >>> Reason for RFC: >>> >>> I do not have an environment which can trigger the reclaim that >>> frequently, thus more tests would be appreciated. >> >> So I'd rather have time spent analysing things and testing them than >> having hypothetical test patches. >> >>> >>> If one wants to test, please use this branch: >>> https://github.com/adam900710/linux.git em_shrink_freq >>> >>> And enable CONFIG_BTRFS_DEBUG (since the previous patch hides the >>> shrinker behind DEBUG builds). >>> >>> --- >>> fs/btrfs/disk-io.c | 3 +++ >>> fs/btrfs/extent_map.c | 3 ++- >>> fs/btrfs/extent_map.h | 2 +- >>> fs/btrfs/fs.h | 1 + >>> fs/btrfs/super.c | 22 +++++++--------------- >>> 5 files changed, 14 insertions(+), 17 deletions(-) >>> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>> index a6f5441e62d1..624dd7552e0f 100644 >>> --- a/fs/btrfs/disk-io.c >>> +++ b/fs/btrfs/disk-io.c >>> @@ -1542,6 +1542,9 @@ static int cleaner_kthread(void *arg) >>> * space. >>> */ >>> btrfs_reclaim_bgs(fs_info); >>> + >>> + if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) >>> + btrfs_free_extent_maps(fs_info); >>> sleep: >>> clear_and_wake_up_bit(BTRFS_FS_CLEANER_RUNNING, >>> &fs_info->flags); >>> if (kthread_should_park()) >>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c >>> index 25d191f1ac10..1491429f9386 100644 >>> --- a/fs/btrfs/extent_map.c >>> +++ b/fs/btrfs/extent_map.c >>> @@ -1248,13 +1248,14 @@ static long btrfs_scan_root(struct btrfs_root >>> *root, struct btrfs_em_shrink_ctx >>> return nr_dropped; >>> } >>> >>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long >>> nr_to_scan) >>> +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info) >>> { >>> struct btrfs_em_shrink_ctx ctx; >>> u64 start_root_id; >>> u64 next_root_id; >>> bool cycled = false; >>> long nr_dropped = 0; >>> + long nr_to_scan = >>> READ_ONCE(fs_info->extent_map_shrinker_nr_to_scan); >>> >>> ctx.scanned = 0; >>> ctx.nr_to_scan = nr_to_scan; >>> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h >>> index 5154a8f1d26c..070621b4467f 100644 >>> --- a/fs/btrfs/extent_map.h >>> +++ b/fs/btrfs/extent_map.h >>> @@ -189,6 +189,6 @@ void btrfs_drop_extent_map_range(struct >>> btrfs_inode *inode, >>> int btrfs_replace_extent_map_range(struct btrfs_inode *inode, >>> struct extent_map *new_em, >>> bool modified); >>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long >>> nr_to_scan); >>> +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info); >>> >>> #endif >>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h >>> index 3d6d4b503220..a594c8309693 100644 >>> --- a/fs/btrfs/fs.h >>> +++ b/fs/btrfs/fs.h >>> @@ -636,6 +636,7 @@ struct btrfs_fs_info { >>> spinlock_t extent_map_shrinker_lock; >>> u64 extent_map_shrinker_last_root; >>> u64 extent_map_shrinker_last_ino; >>> + long extent_map_shrinker_nr_to_scan; >>> >>> /* Protected by 'trans_lock'. */ >>> struct list_head dirty_cowonly_roots; >>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >>> index 98fa0f382480..5d9958063ddd 100644 >>> --- a/fs/btrfs/super.c >>> +++ b/fs/btrfs/super.c >>> @@ -2402,13 +2402,7 @@ static long btrfs_nr_cached_objects(struct >>> super_block *sb, struct shrink_contro >>> >>> trace_btrfs_extent_map_shrinker_count(fs_info, nr); >>> >>> - /* >>> - * Only report the real number for DEBUG builds, as there are >>> reports of >>> - * serious performance degradation caused by too frequent >>> shrinks. >>> - */ >>> - if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) >>> - return nr; >>> - return 0; >>> + return nr; >>> } >>> >>> static long btrfs_free_cached_objects(struct super_block *sb, >>> struct shrink_control *sc) >>> @@ -2417,15 +2411,13 @@ static long btrfs_free_cached_objects(struct >>> super_block *sb, struct shrink_cont >>> struct btrfs_fs_info *fs_info = btrfs_sb(sb); >>> >>> /* >>> - * We may be called from any task trying to allocate memory >>> and we don't >>> - * want to slow it down with scanning and dropping extent >>> maps. It would >>> - * also cause heavy lock contention if many tasks >>> concurrently enter >>> - * here. Therefore only allow kswapd tasks to scan and drop >>> extent maps. >>> + * Only record the latest nr_to_scan value. The real scan >>> will happen >>> + * at cleaner kthread. >>> + * As free_cached_objects() can be triggered very frequently, >>> it's >>> + * not practical to scan the whole fs to reclaim extent maps. >>> */ >>> - if (!current_is_kswapd()) >>> - return 0; >>> - >>> - return btrfs_free_extent_maps(fs_info, nr_to_scan); >>> + WRITE_ONCE(fs_info->extent_map_shrinker_nr_to_scan, nr_to_scan); >> >> This is never reset to 0, so the shrinker will run even when we don't >> need to free extent maps anymore, doing extra work, causing some >> fsyncs to go to the slow path, etc. > > This will be updated by any newer free_cached_objects() call back, with > the calculated nr_to_scan. > > If there is really nothing to free, it will be set to 0, as the > nr_cached_objects() will return 0. My bad, this is not going to work as if nr_cached_objects() returns 0, free_cached_objects() will not be called. So you're totally right, this is not the correct way to go. Thanks, Qu > >> >> If you can wait about 1 week for me to be back from vacation, I can >> continue with plans I have for the shrinker, and just disable the >> shrinker in the meanwhile. > > Sure, that also works for me. > > Thanks, > Qu > >> >> Thanks Qu. >> >>> + return 0; >>> } >>> >>> static const struct super_operations btrfs_super_ops = { >>> -- >>> 2.46.0 >>> >>> >> >
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a6f5441e62d1..624dd7552e0f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1542,6 +1542,9 @@ static int cleaner_kthread(void *arg) * space. */ btrfs_reclaim_bgs(fs_info); + + if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) + btrfs_free_extent_maps(fs_info); sleep: clear_and_wake_up_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags); if (kthread_should_park()) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 25d191f1ac10..1491429f9386 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1248,13 +1248,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx return nr_dropped; } -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info) { struct btrfs_em_shrink_ctx ctx; u64 start_root_id; u64 next_root_id; bool cycled = false; long nr_dropped = 0; + long nr_to_scan = READ_ONCE(fs_info->extent_map_shrinker_nr_to_scan); ctx.scanned = 0; ctx.nr_to_scan = nr_to_scan; diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h index 5154a8f1d26c..070621b4467f 100644 --- a/fs/btrfs/extent_map.h +++ b/fs/btrfs/extent_map.h @@ -189,6 +189,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, int btrfs_replace_extent_map_range(struct btrfs_inode *inode, struct extent_map *new_em, bool modified); -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan); +long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info); #endif diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 3d6d4b503220..a594c8309693 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -636,6 +636,7 @@ struct btrfs_fs_info { spinlock_t extent_map_shrinker_lock; u64 extent_map_shrinker_last_root; u64 extent_map_shrinker_last_ino; + long extent_map_shrinker_nr_to_scan; /* Protected by 'trans_lock'. */ struct list_head dirty_cowonly_roots; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 98fa0f382480..5d9958063ddd 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2402,13 +2402,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro trace_btrfs_extent_map_shrinker_count(fs_info, nr); - /* - * Only report the real number for DEBUG builds, as there are reports of - * serious performance degradation caused by too frequent shrinks. - */ - if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) - return nr; - return 0; + return nr; } static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc) @@ -2417,15 +2411,13 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont struct btrfs_fs_info *fs_info = btrfs_sb(sb); /* - * We may be called from any task trying to allocate memory and we don't - * want to slow it down with scanning and dropping extent maps. It would - * also cause heavy lock contention if many tasks concurrently enter - * here. Therefore only allow kswapd tasks to scan and drop extent maps. + * Only record the latest nr_to_scan value. The real scan will happen + * at cleaner kthread. + * As free_cached_objects() can be triggered very frequently, it's + * not practical to scan the whole fs to reclaim extent maps. */ - if (!current_is_kswapd()) - return 0; - - return btrfs_free_extent_maps(fs_info, nr_to_scan); + WRITE_ONCE(fs_info->extent_map_shrinker_nr_to_scan, nr_to_scan); + return 0; } static const struct super_operations btrfs_super_ops = {
There are still reports about extent map shrinker is causing high CPU usage. It turns out that the call backs can be very frequent, even inside kswapd (and we can have multiple kswapd if the system have several nodes). For the only other fs implementing the reclaim callbacks, XFS does it very differently by ensure the following conditions: - Make sure there is only one reclaim work queued - Add a delay before queuing the reclaim workload The default delay is 18s (60% of xfs_syncd_centisecs) In btrfs, there is already a context which is very similar to the XFS condition: cleaner kthread. There is only one cleaner kthread for the fs, and it's waken periodically (the same time interval as commit interval, which is 30s by default). So it's much better to run the extent map shrinker inside cleaner kthread. And make the free_cached_objects() callback to only record the latest number to free. Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Reported-by: Jannik Glückert <jannik.glueckert@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> --- Reason for RFC: I do not have an environment which can trigger the reclaim that frequently, thus more tests would be appreciated. If one wants to test, please use this branch: https://github.com/adam900710/linux.git em_shrink_freq And enable CONFIG_BTRFS_DEBUG (since the previous patch hides the shrinker behind DEBUG builds). --- fs/btrfs/disk-io.c | 3 +++ fs/btrfs/extent_map.c | 3 ++- fs/btrfs/extent_map.h | 2 +- fs/btrfs/fs.h | 1 + fs/btrfs/super.c | 22 +++++++--------------- 5 files changed, 14 insertions(+), 17 deletions(-)