btrfs: Don't block system suspend during fstrim

Message ID	20240822164908.4957-1-luca.stefani.ge1@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E518182DF; Thu, 22 Aug 2024 16:50:21 +0000 (UTC) From: Luca Stefani <luca.stefani.ge1@gmail.com> To: Cc: Luca Stefani <luca.stefani.ge1@gmail.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com>, linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] btrfs: Don't block system suspend during fstrim Date: Thu, 22 Aug 2024 18:48:21 +0200 Message-ID: <20240822164908.4957-1-luca.stefani.ge1@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: Don't block system suspend during fstrim \| expand btrfs: Don't block system suspend during fstrim

Luca Stefani Aug. 22, 2024, 4:48 p.m. UTC

Sometimes the system isn't able to suspend because
the task responsible for trimming the device isn't
able to finish in time.

Since discard isn't a critical call it can be interrupted
at any time, we can simply report the amount of discarded
bytes in such cases and stop the trim.

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
I have no idea if that's correct, just something I implemented
looking at the same solution made in ext4 by 5229a658f645.

The patch in itself seems to solve the issue.

repro is as follows:
sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported &
sudo ./sleepgraph.py -m mem -rtcwake 5

[836563.289069] PM: suspend exit
[836563.909298] PM: suspend entry (s2idle)
[836563.935447] Filesystems sync: 0.026 seconds
[836563.951391] Freezing user space processes
[836583.958957] Freezing user space processes failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0):
[836583.959582] task:fstrim          state:D stack:0     pid:241865 tgid:241865 ppid:241864 flags:0x00004006
[836583.959592] Call Trace:
[836583.959595]  <TASK>
[836583.959600]  __schedule+0x400/0x1720
[836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
[836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
[836583.959640]  schedule+0x27/0xf0
[836583.959644]  schedule_timeout+0x12f/0x160
[836583.959652]  io_schedule_timeout+0x51/0x70
[836583.959657]  wait_for_completion_io+0x8a/0x160
[836583.959663]  submit_bio_wait+0x60/0x90
[836583.959671]  blkdev_issue_discard+0x91/0x100
[836583.959680]  btrfs_issue_discard+0xc4/0x140
[836583.959689]  btrfs_discard_extent+0x241/0x2a0
[836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959702]  do_trimming+0xd2/0x240
[836583.959712]  trim_bitmaps+0x350/0x4c0
[836583.959723]  btrfs_trim_block_group+0xb8/0x110
[836583.959729]  btrfs_trim_fs+0x118/0x440
[836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959738]  ? security_capable+0x41/0x70
[836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
[836583.959752]  btrfs_ioctl+0xdaf/0x2670
[836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
[836583.959774]  __x64_sys_ioctl+0x94/0xd0
[836583.959782]  do_syscall_64+0x82/0x160
[836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
[836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959802]  ? do_syscall_64+0x8e/0x160
[836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959811]  ? do_sys_openat2+0x9c/0xe0
[836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
[836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959832]  ? do_syscall_64+0x8e/0x160
[836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
[836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959842]  ? do_syscall_64+0x8e/0x160
[836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959849]  ? do_syscall_64+0x8e/0x160
[836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959855]  ? do_syscall_64+0x8e/0x160
[836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959861]  ? do_syscall_64+0x8e/0x160
[836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
[836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[836583.959878] RIP: 0033:0x7f3e4261af2d
[836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX: 00007f3e4261af2d
[836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI: 0000000000000003
[836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09: 00007ffec002e7f2
[836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562d74daafc0
[836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15: 0000562d74daad40
[836583.959967]  </TASK>
---
 fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Luca Stefani Sept. 2, 2024, 8:32 a.m. UTC | #1

Any update on this? It's not critical but I'd like to know if it's in 
some part proper.
Thanks, Luca.
> Sometimes the system isn't able to suspend because
> the task responsible for trimming the device isn't
> able to finish in time.
>
> Since discard isn't a critical call it can be interrupted
> at any time, we can simply report the amount of discarded
> bytes in such cases and stop the trim.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
> I have no idea if that's correct, just something I implemented
> looking at the same solution made in ext4 by 5229a658f645.
>
> The patch in itself seems to solve the issue.
>
> repro is as follows:
> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported &
> sudo ./sleepgraph.py -m mem -rtcwake 5
>
> [836563.289069] PM: suspend exit
> [836563.909298] PM: suspend entry (s2idle)
> [836563.935447] Filesystems sync: 0.026 seconds
> [836563.951391] Freezing user space processes
> [836583.958957] Freezing user space processes failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0):
> [836583.959582] task:fstrim          state:D stack:0     pid:241865 tgid:241865 ppid:241864 flags:0x00004006
> [836583.959592] Call Trace:
> [836583.959595]  <TASK>
> [836583.959600]  __schedule+0x400/0x1720
> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
> [836583.959640]  schedule+0x27/0xf0
> [836583.959644]  schedule_timeout+0x12f/0x160
> [836583.959652]  io_schedule_timeout+0x51/0x70
> [836583.959657]  wait_for_completion_io+0x8a/0x160
> [836583.959663]  submit_bio_wait+0x60/0x90
> [836583.959671]  blkdev_issue_discard+0x91/0x100
> [836583.959680]  btrfs_issue_discard+0xc4/0x140
> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959702]  do_trimming+0xd2/0x240
> [836583.959712]  trim_bitmaps+0x350/0x4c0
> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
> [836583.959729]  btrfs_trim_fs+0x118/0x440
> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959738]  ? security_capable+0x41/0x70
> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
> [836583.959782]  do_syscall_64+0x82/0x160
> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959802]  ? do_syscall_64+0x8e/0x160
> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959832]  ? do_syscall_64+0x8e/0x160
> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959842]  ? do_syscall_64+0x8e/0x160
> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959849]  ? do_syscall_64+0x8e/0x160
> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959855]  ? do_syscall_64+0x8e/0x160
> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959861]  ? do_syscall_64+0x8e/0x160
> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [836583.959878] RIP: 0033:0x7f3e4261af2d
> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX: 00007f3e4261af2d
> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI: 0000000000000003
> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09: 00007ffec002e7f2
> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562d74daafc0
> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15: 0000562d74daad40
> [836583.959967]  </TASK>
> ---
>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>   1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index feec49e6f9c8..7e4c1d4f2f7c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -16,6 +16,7 @@
>   #include <linux/percpu_counter.h>
>   #include <linux/lockdep.h>
>   #include <linux/crc32c.h>
> +#include <linux/freezer.h>
>   #include "ctree.h"
>   #include "extent-tree.h"
>   #include "transaction.h"
> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info, u64 start, u6
>   	unpin_extent_range(fs_info, start, end, false);
>   }
>   
> +static bool btrfs_trim_interrupted(void)
> +{
> +	return fatal_signal_pending(current) || freezing(current);
> +}
> +
>   /*
>    * It used to be that old block groups would be left around forever.
>    * Iterating over them would be enough to trim unused space.  Since we
> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
>   		start += len;
>   		*trimmed += bytes;
>   
> -		if (fatal_signal_pending(current)) {
> -			ret = -ERESTARTSYS;
> +		if (btrfs_trim_interrupted()) {
> +			ret = 0;
>   			break;
>   		}
>   
> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>   
>   	cache = btrfs_lookup_first_block_group(fs_info, range->start);
>   	for (; cache; cache = btrfs_next_block_group(cache)) {
> +		if (btrfs_trim_interrupted())
> +			break;
> +
>   		if (cache->start >= range_end) {
>   			btrfs_put_block_group(cache);
>   			break;
> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>   
>   	mutex_lock(&fs_devices->device_list_mutex);
>   	list_for_each_entry(device, &fs_devices->devices, dev_list) {
> +		if (btrfs_trim_interrupted())
> +			break;
> +
>   		if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>   			continue;
>   
>   		ret = btrfs_trim_free_extents(device, &group_trimmed);
> +
> +		trimmed += group_trimmed;
>   		if (ret) {
>   			dev_failed++;
>   			dev_ret = ret;
>   			break;
>   		}
> -
> -		trimmed += group_trimmed;
>   	}
>   	mutex_unlock(&fs_devices->device_list_mutex);
>

Qu Wenruo Sept. 2, 2024, 8:49 a.m. UTC | #2

在 2024/9/2 18:02, Luca Stefani 写道:
> Any update on this? It's not critical but I'd like to know if it's in
> some part proper.
> Thanks, Luca.

Sorry I didn't see your patch in the list, thus sent a different fix for
it later:

https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t

>> Sometimes the system isn't able to suspend because
>> the task responsible for trimming the device isn't
>> able to finish in time.
>>
>> Since discard isn't a critical call it can be interrupted
>> at any time, we can simply report the amount of discarded
>> bytes in such cases and stop the trim.
>>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
>> ---
>> I have no idea if that's correct, just something I implemented
>> looking at the same solution made in ext4 by 5229a658f645.
>>
>> The patch in itself seems to solve the issue.
>>
>> repro is as follows:
>> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>> --verbose --quiet-unsupported &
>> sudo ./sleepgraph.py -m mem -rtcwake 5
>>
>> [836563.289069] PM: suspend exit
>> [836563.909298] PM: suspend entry (s2idle)
>> [836563.935447] Filesystems sync: 0.026 seconds
>> [836563.951391] Freezing user space processes
>> [836583.958957] Freezing user space processes failed after 20.007
>> seconds (1 tasks refusing to freeze, wq_busy=0):
>> [836583.959582] task:fstrim          state:D stack:0     pid:241865
>> tgid:241865 ppid:241864 flags:0x00004006
>> [836583.959592] Call Trace:
>> [836583.959595]  <TASK>
>> [836583.959600]  __schedule+0x400/0x1720
>> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>> [836583.959640]  schedule+0x27/0xf0
>> [836583.959644]  schedule_timeout+0x12f/0x160
>> [836583.959652]  io_schedule_timeout+0x51/0x70
>> [836583.959657]  wait_for_completion_io+0x8a/0x160
>> [836583.959663]  submit_bio_wait+0x60/0x90
>> [836583.959671]  blkdev_issue_discard+0x91/0x100
>> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959702]  do_trimming+0xd2/0x240
>> [836583.959712]  trim_bitmaps+0x350/0x4c0
>> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>> [836583.959729]  btrfs_trim_fs+0x118/0x440
>> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959738]  ? security_capable+0x41/0x70
>> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>> [836583.959782]  do_syscall_64+0x82/0x160
>> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959802]  ? do_syscall_64+0x8e/0x160
>> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959832]  ? do_syscall_64+0x8e/0x160
>> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959842]  ? do_syscall_64+0x8e/0x160
>> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959849]  ? do_syscall_64+0x8e/0x160
>> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959855]  ? do_syscall_64+0x8e/0x160
>> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959861]  ? do_syscall_64+0x8e/0x160
>> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [836583.959878] RIP: 0033:0x7f3e4261af2d
>> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>> 00007f3e4261af2d
>> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>> 0000000000000003
>> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>> 00007ffec002e7f2
>> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>> 0000562d74daafc0
>> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>> 0000562d74daad40
>> [836583.959967]  </TASK>
>> ---
>>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>>   1 file changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index feec49e6f9c8..7e4c1d4f2f7c 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/percpu_counter.h>
>>   #include <linux/lockdep.h>
>>   #include <linux/crc32c.h>
>> +#include <linux/freezer.h>
>>   #include "ctree.h"
>>   #include "extent-tree.h"
>>   #include "transaction.h"
>> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct
>> btrfs_fs_info *fs_info, u64 start, u6
>>       unpin_extent_range(fs_info, start, end, false);
>>   }
>> +static bool btrfs_trim_interrupted(void)
>> +{
>> +    return fatal_signal_pending(current) || freezing(current);
>> +}
>> +
>>   /*
>>    * It used to be that old block groups would be left around forever.
>>    * Iterating over them would be enough to trim unused space.  Since we
>> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>> btrfs_device *device, u64 *trimmed)
>>           start += len;
>>           *trimmed += bytes;
>> -        if (fatal_signal_pending(current)) {
>> -            ret = -ERESTARTSYS;
>> +        if (btrfs_trim_interrupted()) {
>> +            ret = 0;
>>               break;

Here we should still return the same error number other than 0, to let
the caller know the operation is interrupted, other than finished normally.

>>           }
>> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info,
>> struct fstrim_range *range)
>>       cache = btrfs_lookup_first_block_group(fs_info, range->start);
>>       for (; cache; cache = btrfs_next_block_group(cache)) {
>> +        if (btrfs_trim_interrupted())
>> +            break;
>> +

The same here.

>>           if (cache->start >= range_end) {
>>               btrfs_put_block_group(cache);
>>               break;
>> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>> *fs_info, struct fstrim_range *range)
>>       mutex_lock(&fs_devices->device_list_mutex);
>>       list_for_each_entry(device, &fs_devices->devices, dev_list) {
>> +        if (btrfs_trim_interrupted())
>> +            break;
>> +

The same here.

Furthermore, I think we may not need the extra checks.

The fstrim is based on block groups, and a block group is normally 1GiB,
at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block group
boundary should be enough to meet the hibernation/suspension timeout.



>>           if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>>               continue;
>>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>> +
>> +        trimmed += group_trimmed;
>>           if (ret) {
>>               dev_failed++;
>>               dev_ret = ret;
>>               break;
>>           }
>> -
>> -        trimmed += group_trimmed;

Any special reason moving the code here?

Thanks,
Qu

>>       }
>>       mutex_unlock(&fs_devices->device_list_mutex);
>

Qu Wenruo Sept. 2, 2024, 9:12 a.m. UTC | #3

在 2024/9/2 18:30, Luca Stefani 写道:
>
>
> On Mon, 2 Sept 2024 at 10:49, Qu Wenruo <quwenruo.btrfs@gmx.com
> <mailto:quwenruo.btrfs@gmx.com>> wrote:
>
>
>
>     在 2024/9/2 18:02, Luca Stefani 写道:
>      > Any update on this? It's not critical but I'd like to know if it's in
>      > some part proper.
>      > Thanks, Luca.
>
>     Sorry I didn't see your patch in the list, thus sent a different fix for
>     it later:
>
>     https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t <https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t>
>
>      >> Sometimes the system isn't able to suspend because
>      >> the task responsible for trimming the device isn't
>      >> able to finish in time.
>      >>
>      >> Since discard isn't a critical call it can be interrupted
>      >> at any time, we can simply report the amount of discarded
>      >> bytes in such cases and stop the trim.
>      >>
>      >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>     <https://bugzilla.kernel.org/show_bug.cgi?id=219180>
>      >> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com
>     <mailto:luca.stefani.ge1@gmail.com>>
>      >> ---
>      >> I have no idea if that's correct, just something I implemented
>      >> looking at the same solution made in ext4 by 5229a658f645.
>      >>
>      >> The patch in itself seems to solve the issue.
>      >>
>      >> repro is as follows:
>      >> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>      >> --verbose --quiet-unsupported &
>      >> sudo ./sleepgraph.py -m mem -rtcwake 5
>      >>
>      >> [836563.289069] PM: suspend exit
>      >> [836563.909298] PM: suspend entry (s2idle)
>      >> [836563.935447] Filesystems sync: 0.026 seconds
>      >> [836563.951391] Freezing user space processes
>      >> [836583.958957] Freezing user space processes failed after 20.007
>      >> seconds (1 tasks refusing to freeze, wq_busy=0):
>      >> [836583.959582] task:fstrim          state:D stack:0     pid:241865
>      >> tgid:241865 ppid:241864 flags:0x00004006
>      >> [836583.959592] Call Trace:
>      >> [836583.959595]  <TASK>
>      >> [836583.959600]  __schedule+0x400/0x1720
>      >> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>      >> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>      >> [836583.959640]  schedule+0x27/0xf0
>      >> [836583.959644]  schedule_timeout+0x12f/0x160
>      >> [836583.959652]  io_schedule_timeout+0x51/0x70
>      >> [836583.959657]  wait_for_completion_io+0x8a/0x160
>      >> [836583.959663]  submit_bio_wait+0x60/0x90
>      >> [836583.959671]  blkdev_issue_discard+0x91/0x100
>      >> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>      >> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>      >> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959702]  do_trimming+0xd2/0x240
>      >> [836583.959712]  trim_bitmaps+0x350/0x4c0
>      >> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>      >> [836583.959729]  btrfs_trim_fs+0x118/0x440
>      >> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959738]  ? security_capable+0x41/0x70
>      >> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>      >> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>      >> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>      >> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>      >> [836583.959782]  do_syscall_64+0x82/0x160
>      >> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>      >> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959802]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>      >> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>      >> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959832]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>      >> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959842]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959849]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959855]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959861]  ? do_syscall_64+0x8e/0x160
>      >> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>      >> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>      >> [836583.959878] RIP: 0033:0x7f3e4261af2d
>      >> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246
>     ORIG_RAX:
>      >> 0000000000000010
>      >> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>      >> 00007f3e4261af2d
>      >> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>      >> 0000000000000003
>      >> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>      >> 00007ffec002e7f2
>      >> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>      >> 0000562d74daafc0
>      >> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>      >> 0000562d74daad40
>      >> [836583.959967]  </TASK>
>      >> ---
>      >>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>      >>   1 file changed, 16 insertions(+), 4 deletions(-)
>      >>
>      >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>      >> index feec49e6f9c8..7e4c1d4f2f7c 100644
>      >> --- a/fs/btrfs/extent-tree.c
>      >> +++ b/fs/btrfs/extent-tree.c
>      >> @@ -16,6 +16,7 @@
>      >>   #include <linux/percpu_counter.h>
>      >>   #include <linux/lockdep.h>
>      >>   #include <linux/crc32c.h>
>      >> +#include <linux/freezer.h>
>      >>   #include "ctree.h"
>      >>   #include "extent-tree.h"
>      >>   #include "transaction.h"
>      >> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct
>      >> btrfs_fs_info *fs_info, u64 start, u6
>      >>       unpin_extent_range(fs_info, start, end, false);
>      >>   }
>      >> +static bool btrfs_trim_interrupted(void)
>      >> +{
>      >> +    return fatal_signal_pending(current) || freezing(current);
>      >> +}
>      >> +
>      >>   /*
>      >>    * It used to be that old block groups would be left around
>     forever.
>      >>    * Iterating over them would be enough to trim unused space.
>     Since we
>      >> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>      >> btrfs_device *device, u64 *trimmed)
>      >>           start += len;
>      >>           *trimmed += bytes;
>      >> -        if (fatal_signal_pending(current)) {
>      >> -            ret = -ERESTARTSYS;
>      >> +        if (btrfs_trim_interrupted()) {
>      >> +            ret = 0;
>      >>               break;
>
>     Here we should still return the same error number other than 0, to let
>     the caller know the operation is interrupted, other than finished
>     normally.
>
> Here I was following how ext4 did it, my explanation for that was that
> the kernel
> may have still discarded some data before the thread was interrupted
> thus it made
> sense to report success.

In that case, progress is reported through fstrim_range structure, not
through the return value.
Even if we returned some error code, the fstrim_range::len is still
updated to indicate the progress.

So we need to keep the error code.

>
>
>      >>           }
>      >> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info
>     *fs_info,
>      >> struct fstrim_range *range)
>      >>       cache = btrfs_lookup_first_block_group(fs_info, range->start);
>      >>       for (; cache; cache = btrfs_next_block_group(cache)) {
>      >> +        if (btrfs_trim_interrupted())
>      >> +            break;
>      >> +
>
>     The same here.
>
>      >>           if (cache->start >= range_end) {
>      >>               btrfs_put_block_group(cache);
>      >>               break;
>      >> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>      >> *fs_info, struct fstrim_range *range)
>      >>       mutex_lock(&fs_devices->device_list_mutex);
>      >>       list_for_each_entry(device, &fs_devices->devices, dev_list) {
>      >> +        if (btrfs_trim_interrupted())
>      >> +            break;
>      >> +
>
>     The same here.
>
>     Furthermore, I think we may not need the extra checks.
>
>     The fstrim is based on block groups, and a block group is normally 1GiB,
>     at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block group
>     boundary should be enough to meet the hibernation/suspension timeout.
>
> That's probably true, but 10seconds here wasn't enough and forcing the
> early return in the other cases
> was also required.
> I tried the current patch you linked earlier in my testing and that was
> the conclusion that led to me adding more checks.

My bad, I forgot that we have free extents trimming, which is not
limited to block group boundary, and that may be the root cause.

So in that case your extra checks are indeed needed.

Just need to change the return value.
>
>
>
>
>      >>           if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>      >>               continue;
>      >>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>      >> +
>      >> +        trimmed += group_trimmed;
>      >>           if (ret) {
>      >>               dev_failed++;
>      >>               dev_ret = ret;
>      >>               break;
>      >>           }
>      >> -
>      >> -        trimmed += group_trimmed;
>
>     Any special reason moving the code here?
>
> Same as not returning errno before in case of interrupt, I checked the
> code paths and it's still
> possible to trim some data (group_trimmed != 0) even in case of failure.

Oh, then it's fine.

Except the return code, everything looks fine to me now.

Just please update the commit message to explicitly mention that, we
have a free extent discarding phase, which can trim a lot of unallocated
space, and there is no limits on the trim size (unlike the block group
part).

Thanks,
Qu
>
>     Thanks,
>     Qu
>
>      >>       }
>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>      >
>

Luca Stefani Sept. 2, 2024, 9:14 a.m. UTC | #4

Re-sending, was discarded because of HTML blocks, sorry!

On 02/09/24 10:49, Qu Wenruo wrote:
> 
> 
> 在 2024/9/2 18:02, Luca Stefani 写道:
>> Any update on this? It's not critical but I'd like to know if it's in
>> some part proper.
>> Thanks, Luca.
> 
> Sorry I didn't see your patch in the list, thus sent a different fix for
> it later:
> 
> https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t
> 
No worries, glad it got picked up!
>>> Sometimes the system isn't able to suspend because
>>> the task responsible for trimming the device isn't
>>> able to finish in time.
>>>
>>> Since discard isn't a critical call it can be interrupted
>>> at any time, we can simply report the amount of discarded
>>> bytes in such cases and stop the trim.
>>>
>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>>> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
>>> ---
>>> I have no idea if that's correct, just something I implemented
>>> looking at the same solution made in ext4 by 5229a658f645.
>>>
>>> The patch in itself seems to solve the issue.
>>>
>>> repro is as follows:
>>> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>>> --verbose --quiet-unsupported &
>>> sudo ./sleepgraph.py -m mem -rtcwake 5
>>>
>>> [836563.289069] PM: suspend exit
>>> [836563.909298] PM: suspend entry (s2idle)
>>> [836563.935447] Filesystems sync: 0.026 seconds
>>> [836563.951391] Freezing user space processes
>>> [836583.958957] Freezing user space processes failed after 20.007
>>> seconds (1 tasks refusing to freeze, wq_busy=0):
>>> [836583.959582] task:fstrim          state:D stack:0     pid:241865
>>> tgid:241865 ppid:241864 flags:0x00004006
>>> [836583.959592] Call Trace:
>>> [836583.959595]  <TASK>
>>> [836583.959600]  __schedule+0x400/0x1720
>>> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>>> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>>> [836583.959640]  schedule+0x27/0xf0
>>> [836583.959644]  schedule_timeout+0x12f/0x160
>>> [836583.959652]  io_schedule_timeout+0x51/0x70
>>> [836583.959657]  wait_for_completion_io+0x8a/0x160
>>> [836583.959663]  submit_bio_wait+0x60/0x90
>>> [836583.959671]  blkdev_issue_discard+0x91/0x100
>>> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>>> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>>> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959702]  do_trimming+0xd2/0x240
>>> [836583.959712]  trim_bitmaps+0x350/0x4c0
>>> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>>> [836583.959729]  btrfs_trim_fs+0x118/0x440
>>> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959738]  ? security_capable+0x41/0x70
>>> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>>> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>>> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>>> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>>> [836583.959782]  do_syscall_64+0x82/0x160
>>> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>>> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959802]  ? do_syscall_64+0x8e/0x160
>>> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>>> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>>> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959832]  ? do_syscall_64+0x8e/0x160
>>> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>>> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959842]  ? do_syscall_64+0x8e/0x160
>>> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959849]  ? do_syscall_64+0x8e/0x160
>>> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959855]  ? do_syscall_64+0x8e/0x160
>>> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959861]  ? do_syscall_64+0x8e/0x160
>>> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> [836583.959878] RIP: 0033:0x7f3e4261af2d
>>> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>>> 00007f3e4261af2d
>>> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>>> 0000000000000003
>>> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>>> 00007ffec002e7f2
>>> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>>> 0000562d74daafc0
>>> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>>> 0000562d74daad40
>>> [836583.959967]  </TASK>
>>> ---
>>>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>>>   1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index feec49e6f9c8..7e4c1d4f2f7c 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -16,6 +16,7 @@
>>>   #include <linux/percpu_counter.h>
>>>   #include <linux/lockdep.h>
>>>   #include <linux/crc32c.h>
>>> +#include <linux/freezer.h>
>>>   #include "ctree.h"
>>>   #include "extent-tree.h"
>>>   #include "transaction.h"
>>> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct
>>> btrfs_fs_info *fs_info, u64 start, u6
>>>       unpin_extent_range(fs_info, start, end, false);
>>>   }
>>> +static bool btrfs_trim_interrupted(void)
>>> +{
>>> +    return fatal_signal_pending(current) || freezing(current);
>>> +}
>>> +
>>>   /*
>>>    * It used to be that old block groups would be left around forever.
>>>    * Iterating over them would be enough to trim unused space.  Since we
>>> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>>> btrfs_device *device, u64 *trimmed)
>>>           start += len;
>>>           *trimmed += bytes;
>>> -        if (fatal_signal_pending(current)) {
>>> -            ret = -ERESTARTSYS;
>>> +        if (btrfs_trim_interrupted()) {
>>> +            ret = 0;
>>>               break;
> 
> Here we should still return the same error number other than 0, to let
> the caller know the operation is interrupted, other than finished normally.
> 
Here I was following how ext4 did it, my explanation for that was that 
the kernel may have still discarded some data before the thread was 
interrupted thus it made sense to report success.
>>>           }
>>> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info,
>>> struct fstrim_range *range)
>>>       cache = btrfs_lookup_first_block_group(fs_info, range->start);
>>>       for (; cache; cache = btrfs_next_block_group(cache)) {
>>> +        if (btrfs_trim_interrupted())
>>> +            break;
>>> +
> 
> The same here.
> 
ditto
>>>           if (cache->start >= range_end) {
>>>               btrfs_put_block_group(cache);
>>>               break;
>>> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>> *fs_info, struct fstrim_range *range)
>>>       mutex_lock(&fs_devices->device_list_mutex);
>>>       list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>> +        if (btrfs_trim_interrupted())
>>> +            break;
>>> +
> 
> The same here.
> 
> Furthermore, I think we may not need the extra checks.
> 
> The fstrim is based on block groups, and a block group is normally 1GiB,
> at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block group
> boundary should be enough to meet the hibernation/suspension timeout.
> 
> 
That's probably true, but 10 seconds here wasn't enough and forcing the 
early return in the other cases was also required.
I tried the current patch you linked earlier in my testings and that was 
the conclusion that led to me adding more checks.
> 
>>>           if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>>>               continue;
>>>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>>> +
>>> +        trimmed += group_trimmed;
>>>           if (ret) {
>>>               dev_failed++;
>>>               dev_ret = ret;
>>>               break;
>>>           }
>>> -
>>> -        trimmed += group_trimmed;
> 
> Any special reason moving the code here?

Same as not returning errno before in case of interrupt.
I checked the code paths and it's still possible to trim some data 
(group_trimmed != 0) even in case of failure.

> 
> Thanks,
> Qu
> 
>>>       }
>>>       mutex_unlock(&fs_devices->device_list_mutex);
>>

Thanks, Luca.

Qu Wenruo Sept. 2, 2024, 9:17 a.m. UTC | #5

在 2024/9/2 18:42, Qu Wenruo 写道:
> 
> 
> 在 2024/9/2 18:30, Luca Stefani 写道:
>>
>>
>> On Mon, 2 Sept 2024 at 10:49, Qu Wenruo <quwenruo.btrfs@gmx.com
>> <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>
>>
>>
>>     在 2024/9/2 18:02, Luca Stefani 写道:
>>      > Any update on this? It's not critical but I'd like to know if 
>> it's in
>>      > some part proper.
>>      > Thanks, Luca.
>>
>>     Sorry I didn't see your patch in the list, thus sent a different 
>> fix for
>>     it later:
>>
>>     
>> https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t <https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t>
>>
>>      >> Sometimes the system isn't able to suspend because
>>      >> the task responsible for trimming the device isn't
>>      >> able to finish in time.
>>      >>
>>      >> Since discard isn't a critical call it can be interrupted
>>      >> at any time, we can simply report the amount of discarded
>>      >> bytes in such cases and stop the trim.
>>      >>
>>      >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>>     <https://bugzilla.kernel.org/show_bug.cgi?id=219180>
>>      >> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com
>>     <mailto:luca.stefani.ge1@gmail.com>>
>>      >> ---
>>      >> I have no idea if that's correct, just something I implemented
>>      >> looking at the same solution made in ext4 by 5229a658f645.
>>      >>
>>      >> The patch in itself seems to solve the issue.
>>      >>
>>      >> repro is as follows:
>>      >> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>>      >> --verbose --quiet-unsupported &
>>      >> sudo ./sleepgraph.py -m mem -rtcwake 5
>>      >>
>>      >> [836563.289069] PM: suspend exit
>>      >> [836563.909298] PM: suspend entry (s2idle)
>>      >> [836563.935447] Filesystems sync: 0.026 seconds
>>      >> [836563.951391] Freezing user space processes
>>      >> [836583.958957] Freezing user space processes failed after 20.007
>>      >> seconds (1 tasks refusing to freeze, wq_busy=0):
>>      >> [836583.959582] task:fstrim          state:D stack:0     
>> pid:241865
>>      >> tgid:241865 ppid:241864 flags:0x00004006
>>      >> [836583.959592] Call Trace:
>>      >> [836583.959595]  <TASK>
>>      >> [836583.959600]  __schedule+0x400/0x1720
>>      >> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>>      >> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>>      >> [836583.959640]  schedule+0x27/0xf0
>>      >> [836583.959644]  schedule_timeout+0x12f/0x160
>>      >> [836583.959652]  io_schedule_timeout+0x51/0x70
>>      >> [836583.959657]  wait_for_completion_io+0x8a/0x160
>>      >> [836583.959663]  submit_bio_wait+0x60/0x90
>>      >> [836583.959671]  blkdev_issue_discard+0x91/0x100
>>      >> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>>      >> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>>      >> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959702]  do_trimming+0xd2/0x240
>>      >> [836583.959712]  trim_bitmaps+0x350/0x4c0
>>      >> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>>      >> [836583.959729]  btrfs_trim_fs+0x118/0x440
>>      >> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959738]  ? security_capable+0x41/0x70
>>      >> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>>      >> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>>      >> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>>      >> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>>      >> [836583.959782]  do_syscall_64+0x82/0x160
>>      >> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>>      >> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959802]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>>      >> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>>      >> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959832]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>>      >> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959842]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959849]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959855]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959861]  ? do_syscall_64+0x8e/0x160
>>      >> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>>      >> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>      >> [836583.959878] RIP: 0033:0x7f3e4261af2d
>>      >> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246
>>     ORIG_RAX:
>>      >> 0000000000000010
>>      >> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>>      >> 00007f3e4261af2d
>>      >> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>>      >> 0000000000000003
>>      >> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>>      >> 00007ffec002e7f2
>>      >> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>>      >> 0000562d74daafc0
>>      >> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>>      >> 0000562d74daad40
>>      >> [836583.959967]  </TASK>
>>      >> ---
>>      >>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>>      >>   1 file changed, 16 insertions(+), 4 deletions(-)
>>      >>
>>      >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>      >> index feec49e6f9c8..7e4c1d4f2f7c 100644
>>      >> --- a/fs/btrfs/extent-tree.c
>>      >> +++ b/fs/btrfs/extent-tree.c
>>      >> @@ -16,6 +16,7 @@
>>      >>   #include <linux/percpu_counter.h>
>>      >>   #include <linux/lockdep.h>
>>      >>   #include <linux/crc32c.h>
>>      >> +#include <linux/freezer.h>
>>      >>   #include "ctree.h"
>>      >>   #include "extent-tree.h"
>>      >>   #include "transaction.h"
>>      >> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct
>>      >> btrfs_fs_info *fs_info, u64 start, u6
>>      >>       unpin_extent_range(fs_info, start, end, false);
>>      >>   }
>>      >> +static bool btrfs_trim_interrupted(void)
>>      >> +{
>>      >> +    return fatal_signal_pending(current) || freezing(current);
>>      >> +}
>>      >> +
>>      >>   /*
>>      >>    * It used to be that old block groups would be left around
>>     forever.
>>      >>    * Iterating over them would be enough to trim unused space.
>>     Since we
>>      >> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>>      >> btrfs_device *device, u64 *trimmed)
>>      >>           start += len;
>>      >>           *trimmed += bytes;
>>      >> -        if (fatal_signal_pending(current)) {
>>      >> -            ret = -ERESTARTSYS;
>>      >> +        if (btrfs_trim_interrupted()) {
>>      >> +            ret = 0;
>>      >>               break;
>>
>>     Here we should still return the same error number other than 0, to 
>> let
>>     the caller know the operation is interrupted, other than finished
>>     normally.
>>
>> Here I was following how ext4 did it, my explanation for that was that
>> the kernel
>> may have still discarded some data before the thread was interrupted
>> thus it made
>> sense to report success.
> 
> In that case, progress is reported through fstrim_range structure, not
> through the return value.
> Even if we returned some error code, the fstrim_range::len is still
> updated to indicate the progress.
> 
> So we need to keep the error code.
> 
>>
>>
>>      >>           }
>>      >> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>     *fs_info,
>>      >> struct fstrim_range *range)
>>      >>       cache = btrfs_lookup_first_block_group(fs_info, 
>> range->start);
>>      >>       for (; cache; cache = btrfs_next_block_group(cache)) {
>>      >> +        if (btrfs_trim_interrupted())
>>      >> +            break;
>>      >> +
>>
>>     The same here.
>>
>>      >>           if (cache->start >= range_end) {
>>      >>               btrfs_put_block_group(cache);
>>      >>               break;
>>      >> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>      >> *fs_info, struct fstrim_range *range)
>>      >>       mutex_lock(&fs_devices->device_list_mutex);
>>      >>       list_for_each_entry(device, &fs_devices->devices, 
>> dev_list) {
>>      >> +        if (btrfs_trim_interrupted())
>>      >> +            break;
>>      >> +
>>
>>     The same here.
>>
>>     Furthermore, I think we may not need the extra checks.
>>
>>     The fstrim is based on block groups, and a block group is normally 
>> 1GiB,
>>     at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block 
>> group
>>     boundary should be enough to meet the hibernation/suspension timeout.
>>
>> That's probably true, but 10seconds here wasn't enough and forcing the
>> early return in the other cases
>> was also required.
>> I tried the current patch you linked earlier in my testing and that was
>> the conclusion that led to me adding more checks.
> 
> My bad, I forgot that we have free extents trimming, which is not
> limited to block group boundary, and that may be the root cause.
> 
> So in that case your extra checks are indeed needed.
> 
> Just need to change the return value.
>>
>>
>>
>>
>>      >>           if (test_bit(BTRFS_DEV_STATE_MISSING, 
>> &device->dev_state))
>>      >>               continue;
>>      >>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>>      >> +
>>      >> +        trimmed += group_trimmed;
>>      >>           if (ret) {
>>      >>               dev_failed++;
>>      >>               dev_ret = ret;
>>      >>               break;
>>      >>           }
>>      >> -
>>      >> -        trimmed += group_trimmed;
>>
>>     Any special reason moving the code here?
>>
>> Same as not returning errno before in case of interrupt, I checked the
>> code paths and it's still
>> possible to trim some data (group_trimmed != 0) even in case of failure.
> 
> Oh, then it's fine.
> 
> Except the return code, everything looks fine to me now.

Forgot to mention that, even for error case, we should copy the 
fstrim_range structure to the ioctl parameter to indicate any progress 
we made.

Thanks,
Qu
> 
> Just please update the commit message to explicitly mention that, we
> have a free extent discarding phase, which can trim a lot of unallocated
> space, and there is no limits on the trim size (unlike the block group
> part).
> 
> Thanks,
> Qu
>>
>>     Thanks,
>>     Qu
>>
>>      >>       }
>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>      >
>>
>

Luca Stefani Sept. 2, 2024, 9:31 a.m. UTC | #6

On 02/09/24 11:17, Qu Wenruo wrote:
> 
> 
> 在 2024/9/2 18:42, Qu Wenruo 写道:
>>
>>
>> 在 2024/9/2 18:30, Luca Stefani 写道:
>>>
>>>
>>> On Mon, 2 Sept 2024 at 10:49, Qu Wenruo <quwenruo.btrfs@gmx.com
>>> <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>
>>>
>>>
>>>     在 2024/9/2 18:02, Luca Stefani 写道:
>>>      > Any update on this? It's not critical but I'd like to know if 
>>> it's in
>>>      > some part proper.
>>>      > Thanks, Luca.
>>>
>>>     Sorry I didn't see your patch in the list, thus sent a different 
>>> fix for
>>>     it later:
>>>
>>> https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t <https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t>
>>>
>>>      >> Sometimes the system isn't able to suspend because
>>>      >> the task responsible for trimming the device isn't
>>>      >> able to finish in time.
>>>      >>
>>>      >> Since discard isn't a critical call it can be interrupted
>>>      >> at any time, we can simply report the amount of discarded
>>>      >> bytes in such cases and stop the trim.
>>>      >>
>>>      >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>>>     <https://bugzilla.kernel.org/show_bug.cgi?id=219180>
>>>      >> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com
>>>     <mailto:luca.stefani.ge1@gmail.com>>
>>>      >> ---
>>>      >> I have no idea if that's correct, just something I implemented
>>>      >> looking at the same solution made in ext4 by 5229a658f645.
>>>      >>
>>>      >> The patch in itself seems to solve the issue.
>>>      >>
>>>      >> repro is as follows:
>>>      >> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>>>      >> --verbose --quiet-unsupported &
>>>      >> sudo ./sleepgraph.py -m mem -rtcwake 5
>>>      >>
>>>      >> [836563.289069] PM: suspend exit
>>>      >> [836563.909298] PM: suspend entry (s2idle)
>>>      >> [836563.935447] Filesystems sync: 0.026 seconds
>>>      >> [836563.951391] Freezing user space processes
>>>      >> [836583.958957] Freezing user space processes failed after 
>>> 20.007
>>>      >> seconds (1 tasks refusing to freeze, wq_busy=0):
>>>      >> [836583.959582] task:fstrim          state:D stack:0 pid:241865
>>>      >> tgid:241865 ppid:241864 flags:0x00004006
>>>      >> [836583.959592] Call Trace:
>>>      >> [836583.959595]  <TASK>
>>>      >> [836583.959600]  __schedule+0x400/0x1720
>>>      >> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>>>      >> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>>>      >> [836583.959640]  schedule+0x27/0xf0
>>>      >> [836583.959644]  schedule_timeout+0x12f/0x160
>>>      >> [836583.959652]  io_schedule_timeout+0x51/0x70
>>>      >> [836583.959657]  wait_for_completion_io+0x8a/0x160
>>>      >> [836583.959663]  submit_bio_wait+0x60/0x90
>>>      >> [836583.959671]  blkdev_issue_discard+0x91/0x100
>>>      >> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>>>      >> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>>>      >> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959702]  do_trimming+0xd2/0x240
>>>      >> [836583.959712]  trim_bitmaps+0x350/0x4c0
>>>      >> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>>>      >> [836583.959729]  btrfs_trim_fs+0x118/0x440
>>>      >> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959738]  ? security_capable+0x41/0x70
>>>      >> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>>>      >> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>>>      >> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>>>      >> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>>>      >> [836583.959782]  do_syscall_64+0x82/0x160
>>>      >> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959802]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>>>      >> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959832]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959842]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959849]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959855]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959861]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>      >> [836583.959878] RIP: 0033:0x7f3e4261af2d
>>>      >> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246
>>>     ORIG_RAX:
>>>      >> 0000000000000010
>>>      >> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>>>      >> 00007f3e4261af2d
>>>      >> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>>>      >> 0000000000000003
>>>      >> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>>>      >> 00007ffec002e7f2
>>>      >> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>      >> 0000562d74daafc0
>>>      >> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>>>      >> 0000562d74daad40
>>>      >> [836583.959967]  </TASK>
>>>      >> ---
>>>      >>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>>>      >>   1 file changed, 16 insertions(+), 4 deletions(-)
>>>      >>
>>>      >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>      >> index feec49e6f9c8..7e4c1d4f2f7c 100644
>>>      >> --- a/fs/btrfs/extent-tree.c
>>>      >> +++ b/fs/btrfs/extent-tree.c
>>>      >> @@ -16,6 +16,7 @@
>>>      >>   #include <linux/percpu_counter.h>
>>>      >>   #include <linux/lockdep.h>
>>>      >>   #include <linux/crc32c.h>
>>>      >> +#include <linux/freezer.h>
>>>      >>   #include "ctree.h"
>>>      >>   #include "extent-tree.h"
>>>      >>   #include "transaction.h"
>>>      >> @@ -6361,6 +6362,11 @@ void 
>>> btrfs_error_unpin_extent_range(struct
>>>      >> btrfs_fs_info *fs_info, u64 start, u6
>>>      >>       unpin_extent_range(fs_info, start, end, false);
>>>      >>   }
>>>      >> +static bool btrfs_trim_interrupted(void)
>>>      >> +{
>>>      >> +    return fatal_signal_pending(current) || freezing(current);
>>>      >> +}
>>>      >> +
>>>      >>   /*
>>>      >>    * It used to be that old block groups would be left around
>>>     forever.
>>>      >>    * Iterating over them would be enough to trim unused space.
>>>     Since we
>>>      >> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>>>      >> btrfs_device *device, u64 *trimmed)
>>>      >>           start += len;
>>>      >>           *trimmed += bytes;
>>>      >> -        if (fatal_signal_pending(current)) {
>>>      >> -            ret = -ERESTARTSYS;
>>>      >> +        if (btrfs_trim_interrupted()) {
>>>      >> +            ret = 0;
>>>      >>               break;
>>>
>>>     Here we should still return the same error number other than 0, 
>>> to let
>>>     the caller know the operation is interrupted, other than finished
>>>     normally.
>>>
>>> Here I was following how ext4 did it, my explanation for that was that
>>> the kernel
>>> may have still discarded some data before the thread was interrupted
>>> thus it made
>>> sense to report success.
>>
>> In that case, progress is reported through fstrim_range structure, not
>> through the return value.
>> Even if we returned some error code, the fstrim_range::len is still
>> updated to indicate the progress.
>>
>> So we need to keep the error code.
>>
>>>
>>>
>>>      >>           }
>>>      >> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>>     *fs_info,
>>>      >> struct fstrim_range *range)
>>>      >>       cache = btrfs_lookup_first_block_group(fs_info, 
>>> range->start);
>>>      >>       for (; cache; cache = btrfs_next_block_group(cache)) {
>>>      >> +        if (btrfs_trim_interrupted())
>>>      >> +            break;
>>>      >> +
>>>
>>>     The same here.
>>>
>>>      >>           if (cache->start >= range_end) {
>>>      >>               btrfs_put_block_group(cache);
>>>      >>               break;
>>>      >> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>>      >> *fs_info, struct fstrim_range *range)
>>>      >>       mutex_lock(&fs_devices->device_list_mutex);
>>>      >>       list_for_each_entry(device, &fs_devices->devices, 
>>> dev_list) {
>>>      >> +        if (btrfs_trim_interrupted())
>>>      >> +            break;
>>>      >> +
>>>
>>>     The same here.
>>>
>>>     Furthermore, I think we may not need the extra checks.
>>>
>>>     The fstrim is based on block groups, and a block group is 
>>> normally 1GiB,
>>>     at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block 
>>> group
>>>     boundary should be enough to meet the hibernation/suspension 
>>> timeout.
>>>
>>> That's probably true, but 10seconds here wasn't enough and forcing the
>>> early return in the other cases
>>> was also required.
>>> I tried the current patch you linked earlier in my testing and that was
>>> the conclusion that led to me adding more checks.
>>
>> My bad, I forgot that we have free extents trimming, which is not
>> limited to block group boundary, and that may be the root cause.
>>
>> So in that case your extra checks are indeed needed.
>>
>> Just need to change the return value.
>>>
>>>
>>>
>>>
>>>      >>           if (test_bit(BTRFS_DEV_STATE_MISSING, 
>>> &device->dev_state))
>>>      >>               continue;
>>>      >>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>>>      >> +
>>>      >> +        trimmed += group_trimmed;
>>>      >>           if (ret) {
>>>      >>               dev_failed++;
>>>      >>               dev_ret = ret;
>>>      >>               break;
>>>      >>           }
>>>      >> -
>>>      >> -        trimmed += group_trimmed;
>>>
>>>     Any special reason moving the code here?
>>>
>>> Same as not returning errno before in case of interrupt, I checked the
>>> code paths and it's still
>>> possible to trim some data (group_trimmed != 0) even in case of failure.
>>
>> Oh, then it's fine.
>>
>> Except the return code, everything looks fine to me now.
> 
> Forgot to mention that, even for error case, we should copy the 
> fstrim_range structure to the ioctl parameter to indicate any progress 
> we made.
This seems to be already the case.
range->len = trimmed; is always executed regardless of previous failures
and there doesn't seem to be any early return.

Will try adding back the errno and try the repro.

Thanks.
> 
> Thanks,
> Qu
>>
>> Just please update the commit message to explicitly mention that, we
>> have a free extent discarding phase, which can trim a lot of unallocated
>> space, and there is no limits on the trim size (unlike the block group
>> part).
>>
>> Thanks,
>> Qu
>>>
>>>     Thanks,
>>>     Qu
>>>
>>>      >>       }
>>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>>      >
>>>
>>

Qu Wenruo Sept. 2, 2024, 9:40 a.m. UTC | #7

在 2024/9/2 19:01, Luca Stefani 写道:
[...]
>>>
>>> Oh, then it's fine.
>>>
>>> Except the return code, everything looks fine to me now.
>>
>> Forgot to mention that, even for error case, we should copy the 
>> fstrim_range structure to the ioctl parameter to indicate any progress 
>> we made.
> This seems to be already the case.
> range->len = trimmed; is always executed regardless of previous failures
> and there doesn't seem to be any early return.

What I mean is inside btrfs_ioctl_fitrim(), at the end if 
btrfs_trim_fs() returned error (including interrupted), copy_to_user() 
will not be call, that's the problem needs to be solved, as long as we 
return error for interrupted cases.

Thanks,
Qu
> 
> Will try adding back the errno and try the repro.
> 
> Thanks.
>>
>> Thanks,
>> Qu
>>>
>>> Just please update the commit message to explicitly mention that, we
>>> have a free extent discarding phase, which can trim a lot of unallocated
>>> space, and there is no limits on the trim size (unlike the block group
>>> part).
>>>
>>> Thanks,
>>> Qu
>>>>
>>>>     Thanks,
>>>>     Qu
>>>>
>>>>      >>       }
>>>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>>>      >
>>>>
>>>
>

Qu Wenruo Sept. 2, 2024, 11:01 a.m. UTC | #8

在 2024/9/2 18:47, Qu Wenruo 写道:
[...]
> Forgot to mention that, even for error case, we should copy the
> fstrim_range structure to the ioctl parameter to indicate any progress
> we made.

Sorry to bother you again, I should have notice it earlier.

There is another possible cause of the huge delay for freezing, that's
the blkdev_issue_discard() calls inside btrfs_issue_discard() itself.

The problem here is, we can have a very large disk, e.g. 8TiB device,
mostly empty.

In that case, although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a super large delay.

So the proper way here is to limit the size of each discard (e.g. limit
it to 1GiB, just as the chunk stripe size limit), and do the check after
each 1GiB discard.

So this may be a larger problem than we thought.

I would recommend to split the fix into the following parts:

- Simple small fixes
   Like always update the fstrim_range structure, no matter the return
   value.

- Proper discard size split and new freezing checks

Thanks,
Qu
>
> Thanks,
> Qu
>>
>> Just please update the commit message to explicitly mention that, we
>> have a free extent discarding phase, which can trim a lot of unallocated
>> space, and there is no limits on the trim size (unlike the block group
>> part).
>>
>> Thanks,
>> Qu
>>>
>>>     Thanks,
>>>     Qu
>>>
>>>      >>       }
>>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>>      >
>>>
>>

Luca Stefani Sept. 2, 2024, 11:10 a.m. UTC | #9

On 02/09/24 13:01, Qu Wenruo wrote:
> 
> 
> 在 2024/9/2 18:47, Qu Wenruo 写道:
> [...]
>> Forgot to mention that, even for error case, we should copy the
>> fstrim_range structure to the ioctl parameter to indicate any progress
>> we made.
> 
> Sorry to bother you again, I should have notice it earlier.
> 
> There is another possible cause of the huge delay for freezing, that's
> the blkdev_issue_discard() calls inside btrfs_issue_discard() itself.
> 
> The problem here is, we can have a very large disk, e.g. 8TiB device,
> mostly empty.
> 
> In that case, although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
> 
> So the proper way here is to limit the size of each discard (e.g. limit
> it to 1GiB, just as the chunk stripe size limit), and do the check after
> each 1GiB discard.
> 
> So this may be a larger problem than we thought.
> 
> I would recommend to split the fix into the following parts:
> 
> - Simple small fixes
>    Like always update the fstrim_range structure, no matter the return
>    value.
Sure, that's already done. Will upload separately.
> 
> - Proper discard size split and new freezing checks
I'll try to do the first part, and fallback to the mailing list for help 
in case of failure, thanks.
> 
> Thanks,
> Qu
>>
>> Thanks,
>> Qu
>>>
>>> Just please update the commit message to explicitly mention that, we
>>> have a free extent discarding phase, which can trim a lot of unallocated
>>> space, and there is no limits on the trim size (unlike the block group
>>> part).
>>>
>>> Thanks,
>>> Qu
>>>>
>>>>     Thanks,
>>>>     Qu
>>>>
>>>>      >>       }
>>>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>>>      >
>>>>
>>>

btrfs: Don't block system suspend during fstrim

Commit Message

Comments

Patch