diff mbox series

blk-lib: let user kill a zereout process

Message ID 20240220204127.1937085-1-kbusch@meta.com (mailing list archive)
State New, archived
Headers show
Series blk-lib: let user kill a zereout process | expand

Commit Message

Keith Busch Feb. 20, 2024, 8:41 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

If a user runs something like `blkdiscard -z /dev/sda`, and the device
does not have an efficient write zero offload, the kernel will dispatch
long chains of bio's using the ZERO_PAGE for the entire capacity of the
device. If the capacity is very large, this process could take a long
time in an uninterruptable state, which the user may want to abort.
Check between batches for the user's request to kill the process so they
don't need to wait potentially many hours.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-lib.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Chaitanya Kulkarni Feb. 21, 2024, 3:05 a.m. UTC | #1
On 2/20/24 12:41, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> If a user runs something like `blkdiscard -z /dev/sda`, and the device
> does not have an efficient write zero offload, the kernel will dispatch
> long chains of bio's using the ZERO_PAGE for the entire capacity of the
> device. If the capacity is very large, this process could take a long
> time in an uninterruptable state, which the user may want to abort.
> Check between batches for the user's request to kill the process so they
> don't need to wait potentially many hours.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   block/blk-lib.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e8351..d5c334aa98e0d 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>   				break;
>   		}
>   		cond_resched();
> +		if (fatal_signal_pending(current))
> +			break;
>   	}
>   
>   	*biop = bio;

We are exiting before completion of the whole I/O request, does it makes
sense to return 0 == success if I/O is interrupted by the signal ?
that means I/O not completed, hence it is actually an error, can we return
the -EINTR when you are handling outstanding signal ?

something like this untested :-

linux-block (for-next) # git diff
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..bfadd9eecc6e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -172,6 +172,7 @@ static int __blkdev_issue_zero_pages(struct 
block_device *bdev,
         struct bio *bio = *biop;
         int bi_size = 0;
         unsigned int sz;
+       int ret = 0;

         if (bdev_read_only(bdev))
                 return -EPERM;
@@ -190,10 +191,14 @@ static int __blkdev_issue_zero_pages(struct 
block_device *bdev,
                                 break;
                 }
                 cond_resched();
+               if (fatal_signal_pending(current)) {
+                       ret = -EINTR;
+                       break;
+               }
         }

         *biop = bio;
-       return 0;
+       return ret;
  }

  /**


-ck
Keith Busch Feb. 21, 2024, 3:16 a.m. UTC | #2
On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote:
> On 2/20/24 12:41, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
> >   				break;
> >   		}
> >   		cond_resched();
> > +		if (fatal_signal_pending(current))
> > +			break;
> >   	}
> >   
> >   	*biop = bio;
> 
> We are exiting before completion of the whole I/O request, does it makes
> sense to return 0 == success if I/O is interrupted by the signal ?
> that means I/O not completed, hence it is actually an error, can we return
> the -EINTR when you are handling outstanding signal ?

I initially thought so too, but it doesn't matter. Once the process
returns to user space, the signal kills it and the exit status reflects
accordingly. That's true even before this patch, but it would just take
longer for the exit.
Keith Busch Feb. 21, 2024, 3:21 a.m. UTC | #3
On Tue, Feb 20, 2024 at 08:16:29PM -0700, Keith Busch wrote:
> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote:
> > On 2/20/24 12:41, Keith Busch wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
> > >   				break;
> > >   		}
> > >   		cond_resched();
> > > +		if (fatal_signal_pending(current))
> > > +			break;
> > >   	}
> > >   
> > >   	*biop = bio;
> > 
> > We are exiting before completion of the whole I/O request, does it makes
> > sense to return 0 == success if I/O is interrupted by the signal ?
> > that means I/O not completed, hence it is actually an error, can we return
> > the -EINTR when you are handling outstanding signal ?
> 
> I initially thought so too, but it doesn't matter. Once the process
> returns to user space, the signal kills it and the exit status reflects
> accordingly. That's true even before this patch, but it would just take
> longer for the exit.

Also consider that we have bio's in flight here, and an error return
will skip waiting for them. The kill signal handling here doesn't abort
inflight requests (that's too hard); it just prevents creating and
waiting for the rest of them, which could be millions.
Chaitanya Kulkarni Feb. 21, 2024, 5:32 a.m. UTC | #4
On 2/20/24 19:21, Keith Busch wrote:
> On Tue, Feb 20, 2024 at 08:16:29PM -0700, Keith Busch wrote:
>> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote:
>>> On 2/20/24 12:41, Keith Busch wrote:
>>>> From: Keith Busch <kbusch@kernel.org>
>>>> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>>>>    				break;
>>>>    		}
>>>>    		cond_resched();
>>>> +		if (fatal_signal_pending(current))
>>>> +			break;
>>>>    	}
>>>>    
>>>>    	*biop = bio;
>>> We are exiting before completion of the whole I/O request, does it makes
>>> sense to return 0 == success if I/O is interrupted by the signal ?
>>> that means I/O not completed, hence it is actually an error, can we return
>>> the -EINTR when you are handling outstanding signal ?
>> I initially thought so too, but it doesn't matter. Once the process
>> returns to user space, the signal kills it and the exit status reflects
>> accordingly. That's true even before this patch, but it would just take
>> longer for the exit.
> Also consider that we have bio's in flight here, and an error return
> will skip waiting for them. The kill signal handling here doesn't abort
> inflight requests (that's too hard); it just prevents creating and
> waiting for the rest of them, which could be millions.

comment would be nice but not necessary, irrespective of that,
looks good :-

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck
Nilay Shroff Feb. 21, 2024, 8:46 a.m. UTC | #5
On 2/21/24 02:11, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> If a user runs something like `blkdiscard -z /dev/sda`, and the device
> does not have an efficient write zero offload, the kernel will dispatch
> long chains of bio's using the ZERO_PAGE for the entire capacity of the
> device. If the capacity is very large, this process could take a long
> time in an uninterruptable state, which the user may want to abort.
> Check between batches for the user's request to kill the process so they
> don't need to wait potentially many hours.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/blk-lib.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e8351..d5c334aa98e0d 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>  				break;
>  		}
>  		cond_resched();
> +		if (fatal_signal_pending(current))
> +			break;
>  	}
>  
>  	*biop = bio;

I have an NVMe device which supports write offload. The total size of this disk is ~1.5 TB.
When I tried zero out the whole NVMe drive it took more than 10 minutes. Please see below:

# cat /sys/block/nvme0n1/queue/write_zeroes_max_bytes 
8388608

# nvme id-ns /dev/nvme0n1 -H 
NVME Identify Namespace 1:
nsze    : 0x1749a956
ncap    : 0x1749a956
nuse    : 0x1749a956
<snip>
nvmcap  : 1600321314816
<snip>
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
<snip>

# time blkdiscard -z /dev/nvme0n1 

real	10m27.514s
user	0m0.000s
sys	0m0.369s

So shouldn't we need to add the same code (allowing user to kill the process) under 
__blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if
drive has a very large capacity then it would take up a lot of time to zero out the complete drive. 
Yes the time required may not be in hours in this case but it could be in tens of minutes depending 
on the drive capacity.

Thanks,
--Nilay
Ming Lei Feb. 21, 2024, 8:54 a.m. UTC | #6
On Wed, Feb 21, 2024 at 11:16 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote:
> > On 2/20/24 12:41, Keith Busch wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > > @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
> > >                             break;
> > >             }
> > >             cond_resched();
> > > +           if (fatal_signal_pending(current))
> > > +                   break;
> > >     }
> > >
> > >     *biop = bio;
> >
> > We are exiting before completion of the whole I/O request, does it makes
> > sense to return 0 == success if I/O is interrupted by the signal ?
> > that means I/O not completed, hence it is actually an error, can we return
> > the -EINTR when you are handling outstanding signal ?
>
> I initially thought so too, but it doesn't matter. Once the process
> returns to user space, the signal kills it and the exit status reflects
> accordingly. That's true even before this patch, but it would just take
> longer for the exit.

The zeroout API could be called from FS code in user context, and this way
may confuse FS code, given it returns successfully, but actually it does not.

Thanks,
Chaitanya Kulkarni Feb. 21, 2024, 9:45 a.m. UTC | #7
On 2/21/24 00:54, Ming Lei wrote:
> On Wed, Feb 21, 2024 at 11:16 AM Keith Busch <kbusch@kernel.org> wrote:
>> On Wed, Feb 21, 2024 at 03:05:30AM +0000, Chaitanya Kulkarni wrote:
>>> On 2/20/24 12:41, Keith Busch wrote:
>>>> From: Keith Busch <kbusch@kernel.org>
>>>> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>>>>                              break;
>>>>              }
>>>>              cond_resched();
>>>> +           if (fatal_signal_pending(current))
>>>> +                   break;
>>>>      }
>>>>
>>>>      *biop = bio;
>>> We are exiting before completion of the whole I/O request, does it makes
>>> sense to return 0 == success if I/O is interrupted by the signal ?
>>> that means I/O not completed, hence it is actually an error, can we return
>>> the -EINTR when you are handling outstanding signal ?
>> I initially thought so too, but it doesn't matter. Once the process
>> returns to user space, the signal kills it and the exit status reflects
>> accordingly. That's true even before this patch, but it would just take
>> longer for the exit.
> The zeroout API could be called from FS code in user context, and this way
> may confuse FS code, given it returns successfully, but actually it does not.
>
> Thanks,
>

that is why I asked for returning -EINTR initially, but it seems like it 
will
not have any effect ? given that process is about to exit ?

note that that it may also get called from other places e.g. NVMeOF target
or any future callers :-

nvmet_bdev_execute_write_zeroes()
     __blkdev_issue_zeroout()
         __blkdev_issue_zero_pages()

-ck
Chaitanya Kulkarni Feb. 21, 2024, 9:46 a.m. UTC | #8
On 2/21/24 00:46, Nilay Shroff wrote:
>
> On 2/21/24 02:11, Keith Busch wrote:
>> From: Keith Busch <kbusch@kernel.org>
>>
>> If a user runs something like `blkdiscard -z /dev/sda`, and the device
>> does not have an efficient write zero offload, the kernel will dispatch
>> long chains of bio's using the ZERO_PAGE for the entire capacity of the
>> device. If the capacity is very large, this process could take a long
>> time in an uninterruptable state, which the user may want to abort.
>> Check between batches for the user's request to kill the process so they
>> don't need to wait potentially many hours.
>>
>> Signed-off-by: Keith Busch <kbusch@kernel.org>
>> ---
>>   block/blk-lib.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index e59c3069e8351..d5c334aa98e0d 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -190,6 +190,8 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev,
>>   				break;
>>   		}
>>   		cond_resched();
>> +		if (fatal_signal_pending(current))
>> +			break;
>>   	}
>>   
>>   	*biop = bio;
> I have an NVMe device which supports write offload. The total size of this disk is ~1.5 TB.
> When I tried zero out the whole NVMe drive it took more than 10 minutes. Please see below:
>
> # cat /sys/block/nvme0n1/queue/write_zeroes_max_bytes
> 8388608
>
> # nvme id-ns /dev/nvme0n1 -H
> NVME Identify Namespace 1:
> nsze    : 0x1749a956
> ncap    : 0x1749a956
> nuse    : 0x1749a956
> <snip>
> nvmcap  : 1600321314816
> <snip>
> LBA Format  0 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
> <snip>
>
> # time blkdiscard -z /dev/nvme0n1
>
> real	10m27.514s
> user	0m0.000s
> sys	0m0.369s
>
> So shouldn't we need to add the same code (allowing user to kill the process) under
> __blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if
> drive has a very large capacity then it would take up a lot of time to zero out the complete drive.
> Yes the time required may not be in hours in this case but it could be in tens of minutes depending
> on the drive capacity.
>
> Thanks,
> --Nilay
>
>

this is long standing problem with discard and write-zeroes, if we are 
going this route
then we might as well make this change for rest of the callers in 
blk-lib.c ..

-ck
Keith Busch Feb. 21, 2024, 4:27 p.m. UTC | #9
On Wed, Feb 21, 2024 at 02:16:23PM +0530, Nilay Shroff wrote:
> On 2/21/24 02:11, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> 
> # time blkdiscard -z /dev/nvme0n1 
> 
> real	10m27.514s
> user	0m0.000s
> sys	0m0.369s
> 
> So shouldn't we need to add the same code (allowing user to kill the process) under 
> __blkdev_issue_write_zeroes()? I think even though a drive supports "write zero offload", if
> drive has a very large capacity then it would take up a lot of time to zero out the complete drive. 
> Yes the time required may not be in hours in this case but it could be in tens of minutes depending 
> on the drive capacity.

Yeah, that's long enough to change your mind and not want to wait around
for it to proceed anyway. Between that and the filesystem usage, looks
like I have more things to consider for v2.
diff mbox series

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e8351..d5c334aa98e0d 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -190,6 +190,8 @@  static int __blkdev_issue_zero_pages(struct block_device *bdev,
 				break;
 		}
 		cond_resched();
+		if (fatal_signal_pending(current))
+			break;
 	}
 
 	*biop = bio;