diff mbox series

[4/4] ublk: improve handling of saturated queues when ublk server exits

Message ID 20250325-ublk_timeout-v1-4-262f0121a7bd@purestorage.com (mailing list archive)
State New
Headers show
Series ublk: improve handling of saturated queues when ublk server exits | expand

Commit Message

Uday Shankar March 25, 2025, 10:19 p.m. UTC
There are currently two ways in which ublk server exit is detected by
ublk_drv:

1. uring_cmd cancellation. If there are any outstanding uring_cmds which
   have not been completed to the ublk server when it exits, io_uring
   calls the uring_cmd callback with a special cancellation flag as the
   issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
   "saturated queue" case, when all I/Os for a given queue are in the
   ublk server, and therefore there are no outstanding uring_cmds to
   cancel when the ublk server exits.

The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). Any
applications using the ublk device will be left hanging for these 30s
before seeing an error/knowing anything went wrong. This problem is
illustrated by running the new test_generic_02 against a ublk_drv which
doesn't have the fix:

selftests: ublk: test_generic_02.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_02 : [FAIL]

Fix this by instead handling the saturated queue case in the ublk
character file release callback. This happens during ublk server exit
and handles the issue much more quickly than an I/O timeout:

selftests: ublk: test_generic_02.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_02 : [PASS]

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
---
 drivers/block/ublk_drv.c                        | 40 +++++++++++------------
 tools/testing/selftests/ublk/Makefile           |  1 +
 tools/testing/selftests/ublk/kublk.c            |  3 ++
 tools/testing/selftests/ublk/kublk.h            |  3 ++
 tools/testing/selftests/ublk/null.c             |  4 +++
 tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
 6 files changed, 72 insertions(+), 22 deletions(-)

Comments

Ming Lei March 26, 2025, 5:38 a.m. UTC | #1
On Tue, Mar 25, 2025 at 04:19:34PM -0600, Uday Shankar wrote:
> There are currently two ways in which ublk server exit is detected by
> ublk_drv:
> 
> 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
>    have not been completed to the ublk server when it exits, io_uring
>    calls the uring_cmd callback with a special cancellation flag as the
>    issuing task is exiting.
> 2. I/O timeout. This is needed in addition to the above to handle the
>    "saturated queue" case, when all I/Os for a given queue are in the
>    ublk server, and therefore there are no outstanding uring_cmds to
>    cancel when the ublk server exits.
> 
> The second method detects ublk server exit only after a long delay
> (~30s, the default timeout assigned by the block layer). Any
> applications using the ublk device will be left hanging for these 30s
> before seeing an error/knowing anything went wrong. This problem is
> illustrated by running the new test_generic_02 against a ublk_drv which
> doesn't have the fix:
> 
> selftests: ublk: test_generic_02.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 30.0611 s, 0.0 kB/s
> DEAD
> dd took 31 seconds to exit (>= 5s tolerance)!
> generic_02 : [FAIL]
> 
> Fix this by instead handling the saturated queue case in the ublk
> character file release callback. This happens during ublk server exit
> and handles the issue much more quickly than an I/O timeout:

Another solution is to override default 30sec 'timeout'.

> 
> selftests: ublk: test_generic_02.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.0376731 s, 0.0 kB/s
> DEAD
> generic_02 : [PASS]
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> ---
>  drivers/block/ublk_drv.c                        | 40 +++++++++++------------
>  tools/testing/selftests/ublk/Makefile           |  1 +
>  tools/testing/selftests/ublk/kublk.c            |  3 ++
>  tools/testing/selftests/ublk/kublk.h            |  3 ++
>  tools/testing/selftests/ublk/null.c             |  4 +++
>  tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
>  6 files changed, 72 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c060da409ed8a888b7e414c9065efd2cbd6d57d7..1816b2cac01056dc9d01455759594af43c5f78d6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1247,8 +1247,6 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  {
>  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> -	unsigned int nr_inflight = 0;
> -	int i;
>  
>  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
>  		if (!ubq->timeout) {
> @@ -1259,26 +1257,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  		return BLK_EH_DONE;
>  	}
>  
> -	if (!ubq_daemon_is_dying(ubq))
> -		return BLK_EH_RESET_TIMER;
> -
> -	for (i = 0; i < ubq->q_depth; i++) {
> -		struct ublk_io *io = &ubq->ios[i];
> -
> -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> -			nr_inflight++;
> -	}
> -
> -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> -	if (nr_inflight == ubq->q_depth) {
> -		struct ublk_device *ub = ubq->dev;
> -
> -		if (ublk_abort_requests(ub, ubq)) {
> -			schedule_work(&ub->nosrv_work);
> -		}
> -		return BLK_EH_DONE;
> -	}
> -
>  	return BLK_EH_RESET_TIMER;
>  }
>  
> @@ -1351,6 +1329,24 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
>  static int ublk_ch_release(struct inode *inode, struct file *filp)
>  {
>  	struct ublk_device *ub = filp->private_data;
> +	bool need_schedule = false;
> +	int i;
> +
> +	/*
> +	 * Error out any requests outstanding to the ublk server. This
> +	 * may have happened already (via uring_cmd cancellation), in
> +	 * which case it is not harmful to repeat. But uring_cmd
> +	 * cancellation does not handle queues which are fully saturated
> +	 * (all requests in ublk server), because from the kernel's POV,
> +	 * there are no outstanding uring_cmds to cancel. This code
> +	 * handles such queues.
> +	 */
> +
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +		need_schedule |= ublk_abort_requests(ub, ublk_get_queue(ub, i));
> +
> +	if (need_schedule)
> +		schedule_work(&ub->nosrv_work);

ublk_abort_requests() should be called only in case of queue dying,
since ublk server may open & close the char device multiple times.

For understanding if queue is dying, ->ubq_damon need to be checked,
however it may not be set yet and the current context is not same with
the ubq_daemon context, so I feel it is a bit fragile to bring queue
reference into ->release() callback.

Many libublksrv tests are failed with this patch or kernel panic, even
with the above check added:

        make test T=generic
>  
>  	clear_bit(UB_STATE_OPEN, &ub->state);
>  	return 0;
> diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
> index 7817afe290053853ce31d28a8f4bbca570c3046c..dcc514b6d8f6e485597320636ab111a17b7e5448 100644
> --- a/tools/testing/selftests/ublk/Makefile
> +++ b/tools/testing/selftests/ublk/Makefile
> @@ -4,6 +4,7 @@ CFLAGS += -O3 -Wl,-no-as-needed -Wall -I $(top_srcdir)
>  LDLIBS += -lpthread -lm -luring
>  
>  TEST_PROGS := test_generic_01.sh
> +TEST_PROGS := test_generic_02.sh
>  
>  TEST_PROGS += test_null_01.sh
>  TEST_PROGS += test_null_02.sh
> diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
> index 064a5bb6f12f35892065b8dfacb6f57f6fc16aee..e883cd0f9e330eb15da5a00f6085343333a9355d 100644
> --- a/tools/testing/selftests/ublk/kublk.c
> +++ b/tools/testing/selftests/ublk/kublk.c
> @@ -1065,6 +1065,7 @@ int main(int argc, char *argv[])
>  		{ "zero_copy",          0,      NULL, 'z' },
>  		{ "foreground",		0,	NULL,  0  },
>  		{ "chunk_size", 	1,	NULL,  0  },
> +		{ "delay_us",		1,	NULL,  0  },
>  		{ 0, 0, 0, 0 }
>  	};
>  	int option_idx, opt;
> @@ -1113,6 +1114,8 @@ int main(int argc, char *argv[])
>  				ctx.fg = 1;
>  			if (!strcmp(longopts[option_idx].name, "chunk_size"))
>  				ctx.chunk_size = strtol(optarg, NULL, 10);
> +			if (!strcmp(longopts[option_idx].name, "delay_us"))
> +				ctx.delay_us = strtoul(optarg, NULL, 10);
>  		}
>  	}
>  
> diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
> index f31a5c4d4143e28f13d4cd98d611e37f93b0c25a..6414d482ea3986a9d1973f04a1832d6fe16231bf 100644
> --- a/tools/testing/selftests/ublk/kublk.h
> +++ b/tools/testing/selftests/ublk/kublk.h
> @@ -67,6 +67,9 @@ struct dev_ctx {
>  	unsigned int	all:1;
>  	unsigned int	fg:1;
>  
> +	/* null */
> +	unsigned long	delay_us;
> +
>  	/* stripe */
>  	unsigned int    chunk_size;
>  
> diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
> index 899875ff50feadbd734fbbf1f8fad1f19abd1e8f..8bf58e540f1bffc8361450484a6dc484e815378c 100644
> --- a/tools/testing/selftests/ublk/null.c
> +++ b/tools/testing/selftests/ublk/null.c
> @@ -30,6 +30,8 @@ static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
>  
>  	if (info->flags & UBLK_F_SUPPORT_ZERO_COPY)
>  		dev->tgt.sq_depth = dev->tgt.cq_depth = 2 * info->queue_depth;
> +
> +	dev->private_data = (void *)ctx->delay_us;
>  	return 0;
>  }
>  
> @@ -88,6 +90,8 @@ static int ublk_null_queue_io(struct ublk_queue *q, int tag)
>  	int zc = ublk_queue_use_zc(q);
>  	int queued;
>  
> +	usleep((unsigned long)q->dev->private_data);

We usually use ublk/null for evaluating communication cost and benchmark,
so it isn't good to add more stuff in the io path.

I'd suggest to add one 'delay' target which can be useful in future for
simulating any kind of delay behavior for test purpose at least, and
io_uring IORING_TIMEOUT can be used here.


Thanks, 
Ming
Uday Shankar March 26, 2025, 5:54 p.m. UTC | #2
On Wed, Mar 26, 2025 at 01:38:35PM +0800, Ming Lei wrote:
> On Tue, Mar 25, 2025 at 04:19:34PM -0600, Uday Shankar wrote:
> > There are currently two ways in which ublk server exit is detected by
> > ublk_drv:
> > 
> > 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
> >    have not been completed to the ublk server when it exits, io_uring
> >    calls the uring_cmd callback with a special cancellation flag as the
> >    issuing task is exiting.
> > 2. I/O timeout. This is needed in addition to the above to handle the
> >    "saturated queue" case, when all I/Os for a given queue are in the
> >    ublk server, and therefore there are no outstanding uring_cmds to
> >    cancel when the ublk server exits.
> > 
> > The second method detects ublk server exit only after a long delay
> > (~30s, the default timeout assigned by the block layer). Any
> > applications using the ublk device will be left hanging for these 30s
> > before seeing an error/knowing anything went wrong. This problem is
> > illustrated by running the new test_generic_02 against a ublk_drv which
> > doesn't have the fix:
> > 
> > selftests: ublk: test_generic_02.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 30.0611 s, 0.0 kB/s
> > DEAD
> > dd took 31 seconds to exit (>= 5s tolerance)!
> > generic_02 : [FAIL]
> > 
> > Fix this by instead handling the saturated queue case in the ublk
> > character file release callback. This happens during ublk server exit
> > and handles the issue much more quickly than an I/O timeout:
> 
> Another solution is to override default 30sec 'timeout'.

Yes, but that still will introduce unnecessary delays, since it is a
polling-based solution (very similar to monitor_work we used to have).
Also it will add complexity to the unprivileged case, since that
actually cares about timeout and we will have to track the "real"
timeout separately.

> 
> > 
> > selftests: ublk: test_generic_02.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 0.0376731 s, 0.0 kB/s
> > DEAD
> > generic_02 : [PASS]
> > 
> > Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> > ---
> >  drivers/block/ublk_drv.c                        | 40 +++++++++++------------
> >  tools/testing/selftests/ublk/Makefile           |  1 +
> >  tools/testing/selftests/ublk/kublk.c            |  3 ++
> >  tools/testing/selftests/ublk/kublk.h            |  3 ++
> >  tools/testing/selftests/ublk/null.c             |  4 +++
> >  tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
> >  6 files changed, 72 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index c060da409ed8a888b7e414c9065efd2cbd6d57d7..1816b2cac01056dc9d01455759594af43c5f78d6 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -1247,8 +1247,6 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  {
> >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > -	unsigned int nr_inflight = 0;
> > -	int i;
> >  
> >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> >  		if (!ubq->timeout) {
> > @@ -1259,26 +1257,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  		return BLK_EH_DONE;
> >  	}
> >  
> > -	if (!ubq_daemon_is_dying(ubq))
> > -		return BLK_EH_RESET_TIMER;
> > -
> > -	for (i = 0; i < ubq->q_depth; i++) {
> > -		struct ublk_io *io = &ubq->ios[i];
> > -
> > -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> > -			nr_inflight++;
> > -	}
> > -
> > -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> > -	if (nr_inflight == ubq->q_depth) {
> > -		struct ublk_device *ub = ubq->dev;
> > -
> > -		if (ublk_abort_requests(ub, ubq)) {
> > -			schedule_work(&ub->nosrv_work);
> > -		}
> > -		return BLK_EH_DONE;
> > -	}
> > -
> >  	return BLK_EH_RESET_TIMER;
> >  }
> >  
> > @@ -1351,6 +1329,24 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
> >  static int ublk_ch_release(struct inode *inode, struct file *filp)
> >  {
> >  	struct ublk_device *ub = filp->private_data;
> > +	bool need_schedule = false;
> > +	int i;
> > +
> > +	/*
> > +	 * Error out any requests outstanding to the ublk server. This
> > +	 * may have happened already (via uring_cmd cancellation), in
> > +	 * which case it is not harmful to repeat. But uring_cmd
> > +	 * cancellation does not handle queues which are fully saturated
> > +	 * (all requests in ublk server), because from the kernel's POV,
> > +	 * there are no outstanding uring_cmds to cancel. This code
> > +	 * handles such queues.
> > +	 */
> > +
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > +		need_schedule |= ublk_abort_requests(ub, ublk_get_queue(ub, i));
> > +
> > +	if (need_schedule)
> > +		schedule_work(&ub->nosrv_work);
> 
> ublk_abort_requests() should be called only in case of queue dying,
> since ublk server may open & close the char device multiple times.

Sure that is technically possible, however is any real ublk server doing
this? Seems like a strange thing to do, and seems reasonable for the
driver to transition the device to the nosrv state (dead or recovery,
depending on flags) when the char device is closed, since in this case,
no one can be handling I/O anymore.

In general I feel like char device close is a nice place to centralize
the transition to the nosrv state. It has a few nice properties:
- Because all file references are released at this point, we're
  guaranteed that all file-related activity (uring_cmds, pread/pwrite)
  is quiesced.
- This one place can handle both saturated and unsaturated queues.
- It is "event-driven," i.e. our callback gets called when a certain
  condition is met, instead of having to poll for a condition (like the
  old monitor_work, or the timeout now)
- It looks like we can sleep in the char device close context, so we
  could inline nosrv_work.

This also is a step in the right direction IMO for resurrecting this old
work to get rid of 1:1 ublk server thread to hctx restriction

https://lore.kernel.org/linux-block/20241002224437.3088981-1-ushankar@purestorage.com/T/#u

> For understanding if queue is dying, ->ubq_damon need to be checked,
> however it may not be set yet and the current context is not same with
> the ubq_daemon context, so I feel it is a bit fragile to bring queue
> reference into ->release() callback.
> 
> Many libublksrv tests are failed with this patch or kernel panic, even
> with the above check added:
> 
>         make test T=generic

Thanks, I will look at and address these failures.

Is there any plan to bring these tests into the new ublk selftests
framework?
Uday Shankar March 26, 2025, 6:56 p.m. UTC | #3
On Wed, Mar 26, 2025 at 11:54:16AM -0600, Uday Shankar wrote:
> > ublk_abort_requests() should be called only in case of queue dying,
> > since ublk server may open & close the char device multiple times.
> 
> Sure that is technically possible, however is any real ublk server doing
> this? Seems like a strange thing to do, and seems reasonable for the
> driver to transition the device to the nosrv state (dead or recovery,
> depending on flags) when the char device is closed, since in this case,
> no one can be handling I/O anymore.

I see ublksrv itself is doing this :(

/* Wait until ublk device is setup by udev */
static void ublksrv_check_dev(const struct ublksrv_ctrl_dev_info *info)
{
	unsigned int max_time = 1000000, wait = 0;
	char buf[64];

	snprintf(buf, 64, "%s%d", "/dev/ublkc", info->dev_id);

	while (wait < max_time) {
		int fd = open(buf, O_RDWR);

		if (fd > 0) {
			close(fd);
			break;
		}

		usleep(100000);
		wait += 100000;
	}
}

This seems related to some failures in ublksrv tests
Uday Shankar March 26, 2025, 11:08 p.m. UTC | #4
On Wed, Mar 26, 2025 at 12:56:56PM -0600, Uday Shankar wrote:
> On Wed, Mar 26, 2025 at 11:54:16AM -0600, Uday Shankar wrote:
> > > ublk_abort_requests() should be called only in case of queue dying,
> > > since ublk server may open & close the char device multiple times.
> > 
> > Sure that is technically possible, however is any real ublk server doing
> > this? Seems like a strange thing to do, and seems reasonable for the
> > driver to transition the device to the nosrv state (dead or recovery,
> > depending on flags) when the char device is closed, since in this case,
> > no one can be handling I/O anymore.
> 
> I see ublksrv itself is doing this :(
> 
> /* Wait until ublk device is setup by udev */
> static void ublksrv_check_dev(const struct ublksrv_ctrl_dev_info *info)
> {
> 	unsigned int max_time = 1000000, wait = 0;
> 	char buf[64];
> 
> 	snprintf(buf, 64, "%s%d", "/dev/ublkc", info->dev_id);
> 
> 	while (wait < max_time) {
> 		int fd = open(buf, O_RDWR);
> 
> 		if (fd > 0) {
> 			close(fd);
> 			break;
> 		}
> 
> 		usleep(100000);
> 		wait += 100000;
> 	}
> }
> 
> This seems related to some failures in ublksrv tests

Actually this is the only issue I'm seeing - after patching this up in
ublksrv, make T=generic test appears to pass - I don't see any logs
indicating failures, and no kernel panics.

So the question is, does this patch break existing ublk servers? It does
break ublksrv as shown above, but I think one could argue that the above
code is just testing for file existence, and it's a bit weird to do that
by opening and closing the file (especially given that it's a device
special file). It can be patched to just use access or something
instead.
Ming Lei March 27, 2025, 1:23 a.m. UTC | #5
On Wed, Mar 26, 2025 at 11:54:16AM -0600, Uday Shankar wrote:
> On Wed, Mar 26, 2025 at 01:38:35PM +0800, Ming Lei wrote:
> > On Tue, Mar 25, 2025 at 04:19:34PM -0600, Uday Shankar wrote:
> > > There are currently two ways in which ublk server exit is detected by
> > > ublk_drv:
> > > 
> > > 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
> > >    have not been completed to the ublk server when it exits, io_uring
> > >    calls the uring_cmd callback with a special cancellation flag as the
> > >    issuing task is exiting.
> > > 2. I/O timeout. This is needed in addition to the above to handle the
> > >    "saturated queue" case, when all I/Os for a given queue are in the
> > >    ublk server, and therefore there are no outstanding uring_cmds to
> > >    cancel when the ublk server exits.
> > > 
> > > The second method detects ublk server exit only after a long delay
> > > (~30s, the default timeout assigned by the block layer). Any
> > > applications using the ublk device will be left hanging for these 30s
> > > before seeing an error/knowing anything went wrong. This problem is
> > > illustrated by running the new test_generic_02 against a ublk_drv which
> > > doesn't have the fix:
> > > 
> > > selftests: ublk: test_generic_02.sh
> > > dev id is 0
> > > dd: error writing '/dev/ublkb0': Input/output error
> > > 1+0 records in
> > > 0+0 records out
> > > 0 bytes copied, 30.0611 s, 0.0 kB/s
> > > DEAD
> > > dd took 31 seconds to exit (>= 5s tolerance)!
> > > generic_02 : [FAIL]
> > > 
> > > Fix this by instead handling the saturated queue case in the ublk
> > > character file release callback. This happens during ublk server exit
> > > and handles the issue much more quickly than an I/O timeout:
> > 
> > Another solution is to override default 30sec 'timeout'.
> 
> Yes, but that still will introduce unnecessary delays, since it is a
> polling-based solution (very similar to monitor_work we used to have).
> Also it will add complexity to the unprivileged case, since that
> actually cares about timeout and we will have to track the "real"
> timeout separately.
> 
> > 
> > > 
> > > selftests: ublk: test_generic_02.sh
> > > dev id is 0
> > > dd: error writing '/dev/ublkb0': Input/output error
> > > 1+0 records in
> > > 0+0 records out
> > > 0 bytes copied, 0.0376731 s, 0.0 kB/s
> > > DEAD
> > > generic_02 : [PASS]
> > > 
> > > Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> > > ---
> > >  drivers/block/ublk_drv.c                        | 40 +++++++++++------------
> > >  tools/testing/selftests/ublk/Makefile           |  1 +
> > >  tools/testing/selftests/ublk/kublk.c            |  3 ++
> > >  tools/testing/selftests/ublk/kublk.h            |  3 ++
> > >  tools/testing/selftests/ublk/null.c             |  4 +++
> > >  tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
> > >  6 files changed, 72 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index c060da409ed8a888b7e414c9065efd2cbd6d57d7..1816b2cac01056dc9d01455759594af43c5f78d6 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -1247,8 +1247,6 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > >  {
> > >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > -	unsigned int nr_inflight = 0;
> > > -	int i;
> > >  
> > >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > >  		if (!ubq->timeout) {
> > > @@ -1259,26 +1257,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > >  		return BLK_EH_DONE;
> > >  	}
> > >  
> > > -	if (!ubq_daemon_is_dying(ubq))
> > > -		return BLK_EH_RESET_TIMER;
> > > -
> > > -	for (i = 0; i < ubq->q_depth; i++) {
> > > -		struct ublk_io *io = &ubq->ios[i];
> > > -
> > > -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> > > -			nr_inflight++;
> > > -	}
> > > -
> > > -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> > > -	if (nr_inflight == ubq->q_depth) {
> > > -		struct ublk_device *ub = ubq->dev;
> > > -
> > > -		if (ublk_abort_requests(ub, ubq)) {
> > > -			schedule_work(&ub->nosrv_work);
> > > -		}
> > > -		return BLK_EH_DONE;
> > > -	}
> > > -
> > >  	return BLK_EH_RESET_TIMER;
> > >  }
> > >  
> > > @@ -1351,6 +1329,24 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
> > >  static int ublk_ch_release(struct inode *inode, struct file *filp)
> > >  {
> > >  	struct ublk_device *ub = filp->private_data;
> > > +	bool need_schedule = false;
> > > +	int i;
> > > +
> > > +	/*
> > > +	 * Error out any requests outstanding to the ublk server. This
> > > +	 * may have happened already (via uring_cmd cancellation), in
> > > +	 * which case it is not harmful to repeat. But uring_cmd
> > > +	 * cancellation does not handle queues which are fully saturated
> > > +	 * (all requests in ublk server), because from the kernel's POV,
> > > +	 * there are no outstanding uring_cmds to cancel. This code
> > > +	 * handles such queues.
> > > +	 */
> > > +
> > > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > > +		need_schedule |= ublk_abort_requests(ub, ublk_get_queue(ub, i));
> > > +
> > > +	if (need_schedule)
> > > +		schedule_work(&ub->nosrv_work);
> > 
> > ublk_abort_requests() should be called only in case of queue dying,
> > since ublk server may open & close the char device multiple times.
> 
> Sure that is technically possible, however is any real ublk server doing
> this? Seems like a strange thing to do, and seems reasonable for the
> driver to transition the device to the nosrv state (dead or recovery,
> depending on flags) when the char device is closed, since in this case,
> no one can be handling I/O anymore.

ublk server should be free to open & close the char device multiple times,
but you patch limits ublk server to open & close the char device just once.

The limit looks too strong...

> 
> In general I feel like char device close is a nice place to centralize
> the transition to the nosrv state. It has a few nice properties:
> - Because all file references are released at this point, we're
>   guaranteed that all file-related activity (uring_cmds, pread/pwrite)
>   is quiesced.
> - This one place can handle both saturated and unsaturated queues.
> - It is "event-driven," i.e. our callback gets called when a certain
>   condition is met, instead of having to poll for a condition (like the
>   old monitor_work, or the timeout now)
> - It looks like we can sleep in the char device close context, so we
>   could inline nosrv_work.

I agree all above, unless:

1) open() / close() need to be allowed multiple times

2) for dealing with 1), you may have to check if queue is dying, and this
way may have to use ->ubq_daemon, which is set when starting ublk, and cleared
when freeing ublk char. So race is added here, which need to be addressed.

> 
> This also is a step in the right direction IMO for resurrecting this old
> work to get rid of 1:1 ublk server thread to hctx restriction
> 
> https://lore.kernel.org/linux-block/20241002224437.3088981-1-ushankar@purestorage.com/T/#u

That is definitely one good direction.

> 
> > For understanding if queue is dying, ->ubq_damon need to be checked,
> > however it may not be set yet and the current context is not same with
> > the ubq_daemon context, so I feel it is a bit fragile to bring queue
> > reference into ->release() callback.
> > 
> > Many libublksrv tests are failed with this patch or kernel panic, even
> > with the above check added:
> > 
> >         make test T=generic
> 
> Thanks, I will look at and address these failures.
> 
> Is there any plan to bring these tests into the new ublk selftests
> framework?

The two stress tests should be very similar with ublksrv's, just MQ isn't enabled.

I will enable them later.

The other big part is recovery test, which may take some time. I am a little busy
recently, it is great if anyone would like to pull recovery test in. Otherwise,
it may take a while.



thanks,
Ming
Ming Lei March 27, 2025, 1:38 a.m. UTC | #6
On Wed, Mar 26, 2025 at 05:08:19PM -0600, Uday Shankar wrote:
> On Wed, Mar 26, 2025 at 12:56:56PM -0600, Uday Shankar wrote:
> > On Wed, Mar 26, 2025 at 11:54:16AM -0600, Uday Shankar wrote:
> > > > ublk_abort_requests() should be called only in case of queue dying,
> > > > since ublk server may open & close the char device multiple times.
> > > 
> > > Sure that is technically possible, however is any real ublk server doing
> > > this? Seems like a strange thing to do, and seems reasonable for the
> > > driver to transition the device to the nosrv state (dead or recovery,
> > > depending on flags) when the char device is closed, since in this case,
> > > no one can be handling I/O anymore.
> > 
> > I see ublksrv itself is doing this :(
> > 
> > /* Wait until ublk device is setup by udev */
> > static void ublksrv_check_dev(const struct ublksrv_ctrl_dev_info *info)
> > {
> > 	unsigned int max_time = 1000000, wait = 0;
> > 	char buf[64];
> > 
> > 	snprintf(buf, 64, "%s%d", "/dev/ublkc", info->dev_id);
> > 
> > 	while (wait < max_time) {
> > 		int fd = open(buf, O_RDWR);
> > 
> > 		if (fd > 0) {
> > 			close(fd);
> > 			break;
> > 		}
> > 
> > 		usleep(100000);
> > 		wait += 100000;
> > 	}
> > }
> > 
> > This seems related to some failures in ublksrv tests
> 
> Actually this is the only issue I'm seeing - after patching this up in
> ublksrv, make T=generic test appears to pass - I don't see any logs
> indicating failures, and no kernel panics.

Yes, that is one example.

Your patch breaks any application which opens ublk char more than one
times. And it is usually one common practice to allow device to be opened
multiple times.

Maybe one utility opens the char device unexpected, systemd usually open &
read block device, not sure if it opens char device.

I try to add change against your patch to abort requests only in ->release()
when queue becomes dying, and the added check triggers kernel panic.

> 
> So the question is, does this patch break existing ublk servers? It does

It should break any application which depends on libublksrv

> break ublksrv as shown above, but I think one could argue that the above
> code is just testing for file existence, and it's a bit weird to do that
> by opening and closing the file (especially given that it's a device
> special file). It can be patched to just use access or something
> instead.

Even though libublksrv is the only one which has such usage, it is
impossible to force the user to upgrade the library, but user still may
upgrade to the latest kernel...


Thanks,
Ming
Ming Lei March 27, 2025, 2:06 a.m. UTC | #7
On Tue, Mar 25, 2025 at 04:19:34PM -0600, Uday Shankar wrote:
> There are currently two ways in which ublk server exit is detected by
> ublk_drv:
> 
> 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
>    have not been completed to the ublk server when it exits, io_uring
>    calls the uring_cmd callback with a special cancellation flag as the
>    issuing task is exiting.
> 2. I/O timeout. This is needed in addition to the above to handle the
>    "saturated queue" case, when all I/Os for a given queue are in the
>    ublk server, and therefore there are no outstanding uring_cmds to
>    cancel when the ublk server exits.
> 
> The second method detects ublk server exit only after a long delay
> (~30s, the default timeout assigned by the block layer). Any
> applications using the ublk device will be left hanging for these 30s
> before seeing an error/knowing anything went wrong. This problem is
> illustrated by running the new test_generic_02 against a ublk_drv which
> doesn't have the fix:
> 
> selftests: ublk: test_generic_02.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 30.0611 s, 0.0 kB/s
> DEAD
> dd took 31 seconds to exit (>= 5s tolerance)!
> generic_02 : [FAIL]
> 
> Fix this by instead handling the saturated queue case in the ublk
> character file release callback. This happens during ublk server exit
> and handles the issue much more quickly than an I/O timeout:
> 
> selftests: ublk: test_generic_02.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.0376731 s, 0.0 kB/s
> DEAD
> generic_02 : [PASS]
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> ---
>  drivers/block/ublk_drv.c                        | 40 +++++++++++------------
>  tools/testing/selftests/ublk/Makefile           |  1 +
>  tools/testing/selftests/ublk/kublk.c            |  3 ++
>  tools/testing/selftests/ublk/kublk.h            |  3 ++
>  tools/testing/selftests/ublk/null.c             |  4 +++
>  tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
>  6 files changed, 72 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c060da409ed8a888b7e414c9065efd2cbd6d57d7..1816b2cac01056dc9d01455759594af43c5f78d6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1247,8 +1247,6 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  {
>  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> -	unsigned int nr_inflight = 0;
> -	int i;
>  
>  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
>  		if (!ubq->timeout) {
> @@ -1259,26 +1257,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  		return BLK_EH_DONE;
>  	}
>  
> -	if (!ubq_daemon_is_dying(ubq))
> -		return BLK_EH_RESET_TIMER;
> -
> -	for (i = 0; i < ubq->q_depth; i++) {
> -		struct ublk_io *io = &ubq->ios[i];
> -
> -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> -			nr_inflight++;
> -	}
> -
> -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> -	if (nr_inflight == ubq->q_depth) {
> -		struct ublk_device *ub = ubq->dev;
> -
> -		if (ublk_abort_requests(ub, ubq)) {
> -			schedule_work(&ub->nosrv_work);
> -		}
> -		return BLK_EH_DONE;
> -	}
> -
>  	return BLK_EH_RESET_TIMER;
>  }
>  
> @@ -1351,6 +1329,24 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
>  static int ublk_ch_release(struct inode *inode, struct file *filp)
>  {
>  	struct ublk_device *ub = filp->private_data;
> +	bool need_schedule = false;
> +	int i;
> +
> +	/*
> +	 * Error out any requests outstanding to the ublk server. This
> +	 * may have happened already (via uring_cmd cancellation), in
> +	 * which case it is not harmful to repeat. But uring_cmd
> +	 * cancellation does not handle queues which are fully saturated
> +	 * (all requests in ublk server), because from the kernel's POV,
> +	 * there are no outstanding uring_cmds to cancel. This code
> +	 * handles such queues.
> +	 */
> +
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +		need_schedule |= ublk_abort_requests(ub, ublk_get_queue(ub, i));
> +
> +	if (need_schedule)
> +		schedule_work(&ub->nosrv_work);

Thinking of it further, you needn't to call ublk_abort_requests() and schedule 
stop work to remove disk here.

It can be the following way:

1) do nothing if 'ubq->canceling' is set, and it is safe to check this flag
because all uring commands are done now

2) otherwise, abort any in-flight request only by calling
ublk_abort_requests() and skipping the get & set ->canceling part.

which should avoid the 30sec delay, right?

Thanks, 
Ming
diff mbox series

Patch

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c060da409ed8a888b7e414c9065efd2cbd6d57d7..1816b2cac01056dc9d01455759594af43c5f78d6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1247,8 +1247,6 @@  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
 static enum blk_eh_timer_return ublk_timeout(struct request *rq)
 {
 	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
-	unsigned int nr_inflight = 0;
-	int i;
 
 	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
 		if (!ubq->timeout) {
@@ -1259,26 +1257,6 @@  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
 		return BLK_EH_DONE;
 	}
 
-	if (!ubq_daemon_is_dying(ubq))
-		return BLK_EH_RESET_TIMER;
-
-	for (i = 0; i < ubq->q_depth; i++) {
-		struct ublk_io *io = &ubq->ios[i];
-
-		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
-			nr_inflight++;
-	}
-
-	/* cancelable uring_cmd can't help us if all commands are in-flight */
-	if (nr_inflight == ubq->q_depth) {
-		struct ublk_device *ub = ubq->dev;
-
-		if (ublk_abort_requests(ub, ubq)) {
-			schedule_work(&ub->nosrv_work);
-		}
-		return BLK_EH_DONE;
-	}
-
 	return BLK_EH_RESET_TIMER;
 }
 
@@ -1351,6 +1329,24 @@  static int ublk_ch_open(struct inode *inode, struct file *filp)
 static int ublk_ch_release(struct inode *inode, struct file *filp)
 {
 	struct ublk_device *ub = filp->private_data;
+	bool need_schedule = false;
+	int i;
+
+	/*
+	 * Error out any requests outstanding to the ublk server. This
+	 * may have happened already (via uring_cmd cancellation), in
+	 * which case it is not harmful to repeat. But uring_cmd
+	 * cancellation does not handle queues which are fully saturated
+	 * (all requests in ublk server), because from the kernel's POV,
+	 * there are no outstanding uring_cmds to cancel. This code
+	 * handles such queues.
+	 */
+
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+		need_schedule |= ublk_abort_requests(ub, ublk_get_queue(ub, i));
+
+	if (need_schedule)
+		schedule_work(&ub->nosrv_work);
 
 	clear_bit(UB_STATE_OPEN, &ub->state);
 	return 0;
diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 7817afe290053853ce31d28a8f4bbca570c3046c..dcc514b6d8f6e485597320636ab111a17b7e5448 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -4,6 +4,7 @@  CFLAGS += -O3 -Wl,-no-as-needed -Wall -I $(top_srcdir)
 LDLIBS += -lpthread -lm -luring
 
 TEST_PROGS := test_generic_01.sh
+TEST_PROGS := test_generic_02.sh
 
 TEST_PROGS += test_null_01.sh
 TEST_PROGS += test_null_02.sh
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 064a5bb6f12f35892065b8dfacb6f57f6fc16aee..e883cd0f9e330eb15da5a00f6085343333a9355d 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -1065,6 +1065,7 @@  int main(int argc, char *argv[])
 		{ "zero_copy",          0,      NULL, 'z' },
 		{ "foreground",		0,	NULL,  0  },
 		{ "chunk_size", 	1,	NULL,  0  },
+		{ "delay_us",		1,	NULL,  0  },
 		{ 0, 0, 0, 0 }
 	};
 	int option_idx, opt;
@@ -1113,6 +1114,8 @@  int main(int argc, char *argv[])
 				ctx.fg = 1;
 			if (!strcmp(longopts[option_idx].name, "chunk_size"))
 				ctx.chunk_size = strtol(optarg, NULL, 10);
+			if (!strcmp(longopts[option_idx].name, "delay_us"))
+				ctx.delay_us = strtoul(optarg, NULL, 10);
 		}
 	}
 
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index f31a5c4d4143e28f13d4cd98d611e37f93b0c25a..6414d482ea3986a9d1973f04a1832d6fe16231bf 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -67,6 +67,9 @@  struct dev_ctx {
 	unsigned int	all:1;
 	unsigned int	fg:1;
 
+	/* null */
+	unsigned long	delay_us;
+
 	/* stripe */
 	unsigned int    chunk_size;
 
diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
index 899875ff50feadbd734fbbf1f8fad1f19abd1e8f..8bf58e540f1bffc8361450484a6dc484e815378c 100644
--- a/tools/testing/selftests/ublk/null.c
+++ b/tools/testing/selftests/ublk/null.c
@@ -30,6 +30,8 @@  static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
 
 	if (info->flags & UBLK_F_SUPPORT_ZERO_COPY)
 		dev->tgt.sq_depth = dev->tgt.cq_depth = 2 * info->queue_depth;
+
+	dev->private_data = (void *)ctx->delay_us;
 	return 0;
 }
 
@@ -88,6 +90,8 @@  static int ublk_null_queue_io(struct ublk_queue *q, int tag)
 	int zc = ublk_queue_use_zc(q);
 	int queued;
 
+	usleep((unsigned long)q->dev->private_data);
+
 	if (!zc) {
 		ublk_complete_io(q, tag, iod->nr_sectors << 9);
 		return 0;
diff --git a/tools/testing/selftests/ublk/test_generic_02.sh b/tools/testing/selftests/ublk/test_generic_02.sh
new file mode 100755
index 0000000000000000000000000000000000000000..bc73a17923517ace9590698d82b084fd8d885371
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_02.sh
@@ -0,0 +1,43 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_02"
+ERR_CODE=0
+
+_prep_test "null" "fast cleanup when all I/Os of one hctx are in server"
+
+# configure ublk server to sleep 2s before completing each I/O
+dev_id=$(_add_ublk_dev -t null -q 1 -d 1 --delay_us 2000000)
+_check_add_dev $TID $?
+
+echo "dev id is ${dev_id}"
+
+STARTTIME=${SECONDS}
+
+dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 &
+dd_pid=$!
+
+__ublk_kill_daemon ${dev_id} "DEAD"
+
+wait $dd_pid
+dd_exitcode=$?
+
+ENDTIME=${SECONDS}
+ELAPSED=$(($ENDTIME - $STARTTIME))
+
+# assert that dd sees an error and exits quickly after ublk server is
+# killed. previously this relied on seeing an I/O timeout and so would
+# take ~30s
+if [ $dd_exitcode -eq 0 ]; then
+        echo "dd unexpectedly exited successfully!"
+        ERR_CODE=255
+fi
+if [ $ELAPSED -ge 5 ]; then
+        echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
+        ERR_CODE=255
+fi
+
+_cleanup_test "null"
+_show_result $TID $ERR_CODE