Message ID | 20240220204127.1937085-1-kbusch@meta.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-lib: let user kill a zereout process | expand |
On 2/20/24 12:41, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > If a user runs something like `blkdiscard -z /dev/sda`, and the device > does not have an efficient write zero offload, the kernel will dispatch > long chains of bio's using the ZERO_PAGE for the entire capacity of the > device. If the capacity is very large, this process could take a long > time in an uninterruptable state, which the user may want to abort. > Check between batches for the user's request to kill the process so they > don't need to wait potentially many hours. > > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > block/blk-lib.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index e59c3069e8351..d5c334aa98e0d 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, > break; > } > cond_resched(); > + if (fatal_signal_pending(current)) > + break; > } > > *biop = bio; We are exiting before completion of the whole I/O request, does it makes sense to return 0 == success if I/O is interrupted by the signal ? that means I/O not completed, hence it is actually an error, can we return the -EINTR when you are handling outstanding signal ? something like this untested :- linux-block (for-next) # git diff diff --git a/block/blk-lib.c b/block/blk-lib.c index e59c3069e835..bfadd9eecc6e 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -172,6 +172,7 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, struct bio *bio = *biop; int bi_size = 0; unsigned int sz; + int ret = 0; if (bdev_read_only(bdev)) return -EPERM; @@ -190,10 +191,14 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, break; } cond_resched(); + if (fatal_signal_pending(current)) { + ret = -EINTR; + break; + } } *biop = bio; - return 0; + return ret; } /** -ck
On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote: > On 2/20/24 12:41, Keith Busch wrote: > > From: Keith Busch <kbusch@kernel.org> > > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, > > break; > > } > > cond_resched(); > > + if (fatal_signal_pending(current)) > > + break; > > } > > > > *biop = bio; > > We are exiting before completion of the whole I/O request, does it makes > sense to return 0 == success if I/O is interrupted by the signal ? > that means I/O not completed, hence it is actually an error, can we return > the -EINTR when you are handling outstanding signal ? I initially thought so too, but it doesn't matter. Once the process returns to user space, the signal kills it and the exit status reflects accordingly. That's true even before this patch, but it would just take longer for the exit.
On Tue, Feb 20, 2024 at 08:16:29PM -0700, Keith Busch wrote: > On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote: > > On 2/20/24 12:41, Keith Busch wrote: > > > From: Keith Busch <kbusch@kernel.org> > > > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, > > > break; > > > } > > > cond_resched(); > > > + if (fatal_signal_pending(current)) > > > + break; > > > } > > > > > > *biop = bio; > > > > We are exiting before completion of the whole I/O request, does it makes > > sense to return 0 == success if I/O is interrupted by the signal ? > > that means I/O not completed, hence it is actually an error, can we return > > the -EINTR when you are handling outstanding signal ? > > I initially thought so too, but it doesn't matter. Once the process > returns to user space, the signal kills it and the exit status reflects > accordingly. That's true even before this patch, but it would just take > longer for the exit. Also consider that we have bio's in flight here, and an error return will skip waiting for them. The kill signal handling here doesn't abort inflight requests (that's too hard); it just prevents creating and waiting for the rest of them, which could be millions.
On 2/20/24 19:21, Keith Busch wrote: > On Tue, Feb 20, 2024 at 08:16:29PM -0700, Keith Busch wrote: >> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote: >>> On 2/20/24 12:41, Keith Busch wrote: >>>> From: Keith Busch <kbusch@kernel.org> >>>> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, >>>> break; >>>> } >>>> cond_resched(); >>>> + if (fatal_signal_pending(current)) >>>> + break; >>>> } >>>> >>>> *biop = bio; >>> We are exiting before completion of the whole I/O request, does it makes >>> sense to return 0 == success if I/O is interrupted by the signal ? >>> that means I/O not completed, hence it is actually an error, can we return >>> the -EINTR when you are handling outstanding signal ? >> I initially thought so too, but it doesn't matter. Once the process >> returns to user space, the signal kills it and the exit status reflects >> accordingly. That's true even before this patch, but it would just take >> longer for the exit. > Also consider that we have bio's in flight here, and an error return > will skip waiting for them. The kill signal handling here doesn't abort > inflight requests (that's too hard); it just prevents creating and > waiting for the rest of them, which could be millions. comment would be nice but not necessary, irrespective of that, looks good :- Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
On 2/21/24 02:11, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > If a user runs something like `blkdiscard -z /dev/sda`, and the device > does not have an efficient write zero offload, the kernel will dispatch > long chains of bio's using the ZERO_PAGE for the entire capacity of the > device. If the capacity is very large, this process could take a long > time in an uninterruptable state, which the user may want to abort. > Check between batches for the user's request to kill the process so they > don't need to wait potentially many hours. > > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > block/blk-lib.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index e59c3069e8351..d5c334aa98e0d 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, > break; > } > cond_resched(); > + if (fatal_signal_pending(current)) > + break; > } > > *biop = bio; I have an NVMe device which supports write offload. The total size of this disk is ~1.5 TB. When I tried zero out the whole NVMe drive it took more than 10 minutes. Please see below: # cat /sys/block/nvme0n1/queue/write_zeroes_max_bytes 8388608 # nvme id-ns /dev/nvme0n1 -H NVME Identify Namespace 1: nsze : 0x1749a956 ncap : 0x1749a956 nuse : 0x1749a956 <snip> nvmcap : 1600321314816 <snip> LBA Format 0 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use) <snip> # time blkdiscard -z /dev/nvme0n1 real 10m27.514s user 0m0.000s sys 0m0.369s So shouldn't we need to add the same code (allowing user to kill the process) under __blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if drive has a very large capacity then it would take up a lot of time to zero out the complete drive. Yes the time required may not be in hours in this case but it could be in tens of minutes depending on the drive capacity. Thanks, --Nilay
On Wed, Feb 21, 2024 at 11:16 AM Keith Busch <kbusch@kernel.org> wrote: > > On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote: > > On 2/20/24 12:41, Keith Busch wrote: > > > From: Keith Busch <kbusch@kernel.org> > > > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, > > > break; > > > } > > > cond_resched(); > > > + if (fatal_signal_pending(current)) > > > + break; > > > } > > > > > > *biop = bio; > > > > We are exiting before completion of the whole I/O request, does it makes > > sense to return 0 == success if I/O is interrupted by the signal ? > > that means I/O not completed, hence it is actually an error, can we return > > the -EINTR when you are handling outstanding signal ? > > I initially thought so too, but it doesn't matter. Once the process > returns to user space, the signal kills it and the exit status reflects > accordingly. That's true even before this patch, but it would just take > longer for the exit. The zeroout API could be called from FS code in user context, and this way may confuse FS code, given it returns successfully, but actually it does not. Thanks,
On 2/21/24 00:54, Ming Lei wrote: > On Wed, Feb 21, 2024 at 11:16 AM Keith Busch <kbusch@kernel.org> wrote: >> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote: >>> On 2/20/24 12:41, Keith Busch wrote: >>>> From: Keith Busch <kbusch@kernel.org> >>>> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, >>>> break; >>>> } >>>> cond_resched(); >>>> + if (fatal_signal_pending(current)) >>>> + break; >>>> } >>>> >>>> *biop = bio; >>> We are exiting before completion of the whole I/O request, does it makes >>> sense to return 0 == success if I/O is interrupted by the signal ? >>> that means I/O not completed, hence it is actually an error, can we return >>> the -EINTR when you are handling outstanding signal ? >> I initially thought so too, but it doesn't matter. Once the process >> returns to user space, the signal kills it and the exit status reflects >> accordingly. That's true even before this patch, but it would just take >> longer for the exit. > The zeroout API could be called from FS code in user context, and this way > may confuse FS code, given it returns successfully, but actually it does not. > > Thanks, > that is why I asked for returning -EINTR initially, but it seems like it will not have any effect ? given that process is about to exit ? note that that it may also get called from other places e.g. NVMeOF target or any future callers :- nvmet_bdev_execute_write_zeroes() __blkdev_issue_zeroout() __blkdev_issue_zero_pages() -ck
On 2/21/24 00:46, Nilay Shroff wrote: > > On 2/21/24 02:11, Keith Busch wrote: >> From: Keith Busch <kbusch@kernel.org> >> >> If a user runs something like `blkdiscard -z /dev/sda`, and the device >> does not have an efficient write zero offload, the kernel will dispatch >> long chains of bio's using the ZERO_PAGE for the entire capacity of the >> device. If the capacity is very large, this process could take a long >> time in an uninterruptable state, which the user may want to abort. >> Check between batches for the user's request to kill the process so they >> don't need to wait potentially many hours. >> >> Signed-off-by: Keith Busch <kbusch@kernel.org> >> --- >> block/blk-lib.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index e59c3069e8351..d5c334aa98e0d 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, >> break; >> } >> cond_resched(); >> + if (fatal_signal_pending(current)) >> + break; >> } >> >> *biop = bio; > I have an NVMe device which supports write offload. The total size of this disk is ~1.5 TB. > When I tried zero out the whole NVMe drive it took more than 10 minutes. Please see below: > > # cat /sys/block/nvme0n1/queue/write_zeroes_max_bytes > 8388608 > > # nvme id-ns /dev/nvme0n1 -H > NVME Identify Namespace 1: > nsze : 0x1749a956 > ncap : 0x1749a956 > nuse : 0x1749a956 > <snip> > nvmcap : 1600321314816 > <snip> > LBA Format 0 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use) > <snip> > > # time blkdiscard -z /dev/nvme0n1 > > real 10m27.514s > user 0m0.000s > sys 0m0.369s > > So shouldn't we need to add the same code (allowing user to kill the process) under > __blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if > drive has a very large capacity then it would take up a lot of time to zero out the complete drive. > Yes the time required may not be in hours in this case but it could be in tens of minutes depending > on the drive capacity. > > Thanks, > --Nilay > > this is long standing problem with discard and write-zeroes, if we are going this route then we might as well make this change for rest of the callers in blk-lib.c .. -ck
On Wed, Feb 21, 2024 at 02:16:23PM +0530, Nilay Shroff wrote: > On 2/21/24 02:11, Keith Busch wrote: > > From: Keith Busch <kbusch@kernel.org> > > # time blkdiscard -z /dev/nvme0n1 > > real 10m27.514s > user 0m0.000s > sys 0m0.369s > > So shouldn't we need to add the same code (allowing user to kill the process) under > __blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if > drive has a very large capacity then it would take up a lot of time to zero out the complete drive. > Yes the time required may not be in hours in this case but it could be in tens of minutes depending > on the drive capacity. Yeah, that's long enough to change your mind and not want to wait around for it to proceed anyway. Between that and the filesystem usage, looks like I have more things to consider for v2.
diff --git a/block/blk-lib.c b/block/blk-lib.c index e59c3069e8351..d5c334aa98e0d 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, break; } cond_resched(); + if (fatal_signal_pending(current)) + break; } *biop = bio;