ublk: decouple hctx and ublk server threads

Message ID	20241002224437.3088981-1-ushankar@purestorage.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-vk1-f225.google.com (mail-vk1-f225.google.com [209.85.221.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F5AE19538A for <linux-block@vger.kernel.org>; Wed, 2 Oct 2024 22:44:51 +0000 (UTC) From: Uday Shankar <ushankar@purestorage.com> To: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk> Cc: Uday Shankar <ushankar@purestorage.com>, linux-block@vger.kernel.org Subject: [PATCH] ublk: decouple hctx and ublk server threads Date: Wed, 2 Oct 2024 16:44:37 -0600 Message-Id: <20241002224437.3088981-1-ushankar@purestorage.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	ublk: decouple hctx and ublk server threads \| expand ublk: decouple hctx and ublk server threads

Uday Shankar Oct. 2, 2024, 10:44 p.m. UTC

Currently, ublk_drv associates to each hardware queue (hctx) a unique
task (called the queue's ubq_daemon) which is allowed to issue
COMMIT_AND_FETCH commands against the hctx. If any other task attempts
to do so, the command fails immediately with EINVAL. When considered
together with the block layer architecture, the result is that for each
CPU C on the system, there is a unique ublk server thread which is
allowed to handle I/O submitted on CPU C. This can lead to suboptimal
performance under imbalanced load generation. For an extreme example,
suppose all the load is generated on CPUs mapping to a single ublk
server thread. Then that thread may be fully utilized and become the
bottleneck in the system, while other ublk server threads are totally
idle.

This issue can also be addressed directly in the ublk server without
kernel support by having threads dequeue I/Os and pass them around to
ensure even load. But this solution involves moving I/Os between threads
several times. This is a bad pattern for performance, and we observed a
significant regression in peak bandwidth with this solution.

Therefore, address this issue by removing the association of a unique
ubq_daemon to each hctx. This association is fairly artificial anyways,
and removing it results in simpler driver code. Imbalanced load can then
be balanced across all ublk server threads by having the threads fetch
I/Os for the same QID in a round robin manner. For example, in a system
with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
could issue fetch requests as follows (where each entry is of the form
qid, tag):

poller thread:	T0	T1	T2	T3
		0,0	0,1	0,2	0,3
		1,3	1,0	1,1	1,2

Since tags appear to be allocated in sequential chunks, this provides a
rough approximation to distributing I/Os round-robin across all ublk
server threads, while letting I/Os stay fully thread-local.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
---
 drivers/block/ublk_drv.c | 105 ++++++++++++---------------------------
 1 file changed, 33 insertions(+), 72 deletions(-)


base-commit: ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd

Uday Shankar Oct. 3, 2024, 11:40 p.m. UTC | #1

On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> @@ -1036,9 +1028,13 @@ static inline void __ublk_complete_rq(struct request *req)
>  	else
>  		__blk_mq_end_request(req, BLK_STS_OK);
>  
> -	return;
> +	goto put_task;
>  exit:
>  	blk_mq_end_request(req, res);
> +put_task:
> +	WARN_ON_ONCE(!data->task);
> +	put_task_struct(data->task);
> +	data->task = NULL;
>  }

I suppose this is a bug, since the request could be reused once we call
blk_mq_end_request, so we need to drop the task reference before that
happens. Will fix this in v2, along with any other comments I receive on
this.

kernel test robot Oct. 5, 2024, 9:24 p.m. UTC | #2

Hi Uday,

kernel test robot noticed the following build warnings:

[auto build test WARNING on ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd]

url:    https://github.com/intel-lab-lkp/linux/commits/Uday-Shankar/ublk-decouple-hctx-and-ublk-server-threads/20241003-064609
base:   ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd
patch link:    https://lore.kernel.org/r/20241002224437.3088981-1-ushankar%40purestorage.com
patch subject: [PATCH] ublk: decouple hctx and ublk server threads
config: i386-allyesconfig (https://download.01.org/0day-ci/archive/20241006/202410060518.wRBlN5hY-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241006/202410060518.wRBlN5hY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410060518.wRBlN5hY-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/block/ublk_drv.c: In function 'ublk_uring_cmd_cancel_fn':
>> drivers/block/ublk_drv.c:1446:29: warning: variable 'task' set but not used [-Wunused-but-set-variable]
    1446 |         struct task_struct *task;
         |                             ^~~~

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for MODVERSIONS
   Depends on [n]: MODULES [=y] && !COMPILE_TEST [=y]
   Selected by [y]:
   - RANDSTRUCT_FULL [=y] && (CC_HAS_RANDSTRUCT [=n] || GCC_PLUGINS [=y]) && MODULES [=y]
   WARNING: unmet direct dependencies detected for GET_FREE_REGION
   Depends on [n]: SPARSEMEM [=n]
   Selected by [y]:
   - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y]


vim +/task +1446 drivers/block/ublk_drv.c

216c8f5ef0f209 Ming Lei    2023-10-09  1436  
216c8f5ef0f209 Ming Lei    2023-10-09  1437  /*
216c8f5ef0f209 Ming Lei    2023-10-09  1438   * The ublk char device won't be closed when calling cancel fn, so both
216c8f5ef0f209 Ming Lei    2023-10-09  1439   * ublk device and queue are guaranteed to be live
216c8f5ef0f209 Ming Lei    2023-10-09  1440   */
216c8f5ef0f209 Ming Lei    2023-10-09  1441  static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
216c8f5ef0f209 Ming Lei    2023-10-09  1442  		unsigned int issue_flags)
216c8f5ef0f209 Ming Lei    2023-10-09  1443  {
216c8f5ef0f209 Ming Lei    2023-10-09  1444  	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
216c8f5ef0f209 Ming Lei    2023-10-09  1445  	struct ublk_queue *ubq = pdu->ubq;
216c8f5ef0f209 Ming Lei    2023-10-09 @1446  	struct task_struct *task;
216c8f5ef0f209 Ming Lei    2023-10-09  1447  	struct ublk_device *ub;
216c8f5ef0f209 Ming Lei    2023-10-09  1448  	bool need_schedule;
216c8f5ef0f209 Ming Lei    2023-10-09  1449  	struct ublk_io *io;
216c8f5ef0f209 Ming Lei    2023-10-09  1450  
216c8f5ef0f209 Ming Lei    2023-10-09  1451  	if (WARN_ON_ONCE(!ubq))
216c8f5ef0f209 Ming Lei    2023-10-09  1452  		return;
216c8f5ef0f209 Ming Lei    2023-10-09  1453  
216c8f5ef0f209 Ming Lei    2023-10-09  1454  	if (WARN_ON_ONCE(pdu->tag >= ubq->q_depth))
bd23f6c2c2d005 Ming Lei    2023-10-09  1455  		return;
bd23f6c2c2d005 Ming Lei    2023-10-09  1456  
216c8f5ef0f209 Ming Lei    2023-10-09  1457  	task = io_uring_cmd_get_task(cmd);
bd23f6c2c2d005 Ming Lei    2023-10-09  1458  
216c8f5ef0f209 Ming Lei    2023-10-09  1459  	ub = ubq->dev;
216c8f5ef0f209 Ming Lei    2023-10-09  1460  	need_schedule = ublk_abort_requests(ub, ubq);
216c8f5ef0f209 Ming Lei    2023-10-09  1461  
216c8f5ef0f209 Ming Lei    2023-10-09  1462  	io = &ubq->ios[pdu->tag];
216c8f5ef0f209 Ming Lei    2023-10-09  1463  	WARN_ON_ONCE(io->cmd != cmd);
b4e1353f465147 Ming Lei    2023-10-09  1464  	ublk_cancel_cmd(ubq, io, issue_flags);
216c8f5ef0f209 Ming Lei    2023-10-09  1465  
216c8f5ef0f209 Ming Lei    2023-10-09  1466  	if (need_schedule) {
bd23f6c2c2d005 Ming Lei    2023-10-09  1467  		if (ublk_can_use_recovery(ub))
bbae8d1f526b56 ZiyangZhang 2022-09-23  1468  			schedule_work(&ub->quiesce_work);
bbae8d1f526b56 ZiyangZhang 2022-09-23  1469  		else
71f28f3136aff5 Ming Lei    2022-07-13  1470  			schedule_work(&ub->stop_work);
216c8f5ef0f209 Ming Lei    2023-10-09  1471  	}
71f28f3136aff5 Ming Lei    2022-07-13  1472  }
71f28f3136aff5 Ming Lei    2022-07-13  1473

Ming Lei Oct. 6, 2024, 9:20 a.m. UTC | #3

On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> Currently, ublk_drv associates to each hardware queue (hctx) a unique
> task (called the queue's ubq_daemon) which is allowed to issue
> COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> to do so, the command fails immediately with EINVAL. When considered
> together with the block layer architecture, the result is that for each
> CPU C on the system, there is a unique ublk server thread which is
> allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> performance under imbalanced load generation. For an extreme example,
> suppose all the load is generated on CPUs mapping to a single ublk
> server thread. Then that thread may be fully utilized and become the
> bottleneck in the system, while other ublk server threads are totally
> idle.

I am wondering why the problem can't be avoided by setting ublk server's
thread affinity manually.

> 
> This issue can also be addressed directly in the ublk server without
> kernel support by having threads dequeue I/Os and pass them around to
> ensure even load. But this solution involves moving I/Os between threads
> several times. This is a bad pattern for performance, and we observed a
> significant regression in peak bandwidth with this solution.
> 
> Therefore, address this issue by removing the association of a unique
> ubq_daemon to each hctx. This association is fairly artificial anyways,
> and removing it results in simpler driver code. Imbalanced load can then

I did consider to remove the association from simplifying design &
implementation viewpoint, but there are some problems which are hard to solve.

> be balanced across all ublk server threads by having the threads fetch
> I/Os for the same QID in a round robin manner. For example, in a system
> with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> could issue fetch requests as follows (where each entry is of the form
> qid, tag):
> 
> poller thread:	T0	T1	T2	T3
> 		0,0	0,1	0,2	0,3
> 		1,3	1,0	1,1	1,2

How many ublk devices there are? If it is 1, just wondering why you use
4 threads? Usually one thread is enough to drive one queue, and the
actually io command handling can be moved to new work thread if the queue
thread is saturated.

> 
> Since tags appear to be allocated in sequential chunks, this provides a
> rough approximation to distributing I/Os round-robin across all ublk
> server threads, while letting I/Os stay fully thread-local.
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> ---
>  drivers/block/ublk_drv.c | 105 ++++++++++++---------------------------
>  1 file changed, 33 insertions(+), 72 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index a6c8e5cc6051..7e0ce35dbc6f 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -68,12 +68,12 @@
>  	 UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED)
>  
>  struct ublk_rq_data {
> -	struct llist_node node;
> -
> +	struct task_struct *task;
>  	struct kref ref;
>  };
>  
>  struct ublk_uring_cmd_pdu {
> +	struct request *req;
>  	struct ublk_queue *ubq;
>  	u16 tag;
>  };
> @@ -133,15 +133,11 @@ struct ublk_queue {
>  	int q_depth;
>  
>  	unsigned long flags;
> -	struct task_struct	*ubq_daemon;
>  	char *io_cmd_buf;
>  
> -	struct llist_head	io_cmds;
> -
>  	unsigned long io_addr;	/* mapped vm address */
>  	unsigned int max_io_sz;
>  	bool force_abort;
> -	bool timeout;
>  	bool canceling;
>  	unsigned short nr_io_ready;	/* how many ios setup */
>  	spinlock_t		cancel_lock;
> @@ -982,16 +978,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>  	return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
>  }
>  
> -static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
> -{
> -	return ubq->ubq_daemon->flags & PF_EXITING;
> -}
> -
>  /* todo: handle partial completion */
>  static inline void __ublk_complete_rq(struct request *req)
>  {
>  	struct ublk_queue *ubq = req->mq_hctx->driver_data;
>  	struct ublk_io *io = &ubq->ios[req->tag];
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>  	unsigned int unmapped_bytes;
>  	blk_status_t res = BLK_STS_OK;
>  
> @@ -1036,9 +1028,13 @@ static inline void __ublk_complete_rq(struct request *req)
>  	else
>  		__blk_mq_end_request(req, BLK_STS_OK);
>  
> -	return;
> +	goto put_task;
>  exit:
>  	blk_mq_end_request(req, res);
> +put_task:
> +	WARN_ON_ONCE(!data->task);
> +	put_task_struct(data->task);
> +	data->task = NULL;
>  }
>  
>  static void ublk_complete_rq(struct kref *ref)
> @@ -1097,13 +1093,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
>  		blk_mq_end_request(rq, BLK_STS_IOERR);
>  }
>  
> -static inline void __ublk_rq_task_work(struct request *req,
> +static inline void __ublk_rq_task_work(struct io_uring_cmd *cmd,
>  				       unsigned issue_flags)
>  {
> +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +	struct request *req = pdu->req;
>  	struct ublk_queue *ubq = req->mq_hctx->driver_data;
>  	int tag = req->tag;
>  	struct ublk_io *io = &ubq->ios[tag];
>  	unsigned int mapped_bytes;
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>  
>  	pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
>  			__func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
> @@ -1112,13 +1111,14 @@ static inline void __ublk_rq_task_work(struct request *req,
>  	/*
>  	 * Task is exiting if either:
>  	 *
> -	 * (1) current != ubq_daemon.
> +	 * (1) current != io_uring_get_cmd_task(io->cmd).
>  	 * io_uring_cmd_complete_in_task() tries to run task_work
> -	 * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING.
> +	 * in a workqueue if cmd's task is PF_EXITING.
>  	 *
>  	 * (2) current->flags & PF_EXITING.
>  	 */
> -	if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
> +	if (unlikely(current != io_uring_cmd_get_task(io->cmd) ||
> +		     current->flags & PF_EXITING)) {
>  		__ublk_abort_rq(ubq, req);
>  		return;
>  	}
> @@ -1173,55 +1173,32 @@ static inline void __ublk_rq_task_work(struct request *req,
>  	}
>  
>  	ublk_init_req_ref(ubq, req);
> +	WARN_ON_ONCE(data->task);
> +	data->task = get_task_struct(current);

If queue/task association is killed, it doesn't make sense to record the
->task any more, cause this io command can be handled by another thread
in userspace, then UBLK_IO_COMMIT_AND_FETCH_REQ for this io request may
be issued from task which isn't same with data->task.

The main trouble is that error handling can't be done easily.

>  	ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags);
>  }
>  
> -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> -					unsigned issue_flags)
> -{
> -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> -	struct ublk_rq_data *data, *tmp;
> -
> -	io_cmds = llist_reverse_order(io_cmds);
> -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> -}
> -
> -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> -{
> -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> -	struct ublk_queue *ubq = pdu->ubq;
> -
> -	ublk_forward_io_cmds(ubq, issue_flags);
> -}
> -
>  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>  {
> -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> -
> -	if (llist_add(&data->node, &ubq->io_cmds)) {
> -		struct ublk_io *io = &ubq->ios[rq->tag];
> -
> -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> -	}
> +	struct ublk_io *io = &ubq->ios[rq->tag];
> +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> +	pdu->req = rq;
> +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
>  }

It should be fine to convert to io_uring_cmd_complete_in_task() since
the callback list is re-ordered in io_uring.

>  
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  {
>  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
>  	unsigned int nr_inflight = 0;
>  	int i;
>  
>  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> -		if (!ubq->timeout) {
> -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> -			ubq->timeout = true;
> -		}
> -
> +		send_sig(SIGKILL, data->task, 0);
>  		return BLK_EH_DONE;
>  	}
>  
> -	if (!ubq_daemon_is_dying(ubq))
> +	if (!(data->task->flags & PF_EXITING))
>  		return BLK_EH_RESET_TIMER;

->task is only for error handling, but it may not work any more since
who knows which task is for handling the io command actually.


Thanks, 
Ming

Uday Shankar Oct. 7, 2024, 7:50 p.m. UTC | #4

On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > task (called the queue's ubq_daemon) which is allowed to issue
> > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > to do so, the command fails immediately with EINVAL. When considered
> > together with the block layer architecture, the result is that for each
> > CPU C on the system, there is a unique ublk server thread which is
> > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > performance under imbalanced load generation. For an extreme example,
> > suppose all the load is generated on CPUs mapping to a single ublk
> > server thread. Then that thread may be fully utilized and become the
> > bottleneck in the system, while other ublk server threads are totally
> > idle.
> 
> I am wondering why the problem can't be avoided by setting ublk server's
> thread affinity manually.

I don't think the ublk server thread CPU affinity has any effect here.
Assuming that the ublk server threads do not pass I/Os between
themselves to balance the load, each ublk server thread must handle all
the I/O issued to its associated hctx, and each thread is limited by how
much CPU it can get. Since threads are the unit of parallelism, one
thread can make use of at most one CPU, regardless of the affinity of
the thread. And this can become a bottleneck.

> > be balanced across all ublk server threads by having the threads fetch
> > I/Os for the same QID in a round robin manner. For example, in a system
> > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > could issue fetch requests as follows (where each entry is of the form
> > qid, tag):
> > 
> > poller thread:	T0	T1	T2	T3
> > 		0,0	0,1	0,2	0,3
> > 		1,3	1,0	1,1	1,2
> 
> How many ublk devices there are? If it is 1, just wondering why you use
> 4 threads? Usually one thread is enough to drive one queue, and the
> actually io command handling can be moved to new work thread if the queue
> thread is saturated.

This is just a small example to demonstrate the idea, not necessarily a
realistic one.

> > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > -					unsigned issue_flags)
> > -{
> > -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > -	struct ublk_rq_data *data, *tmp;
> > -
> > -	io_cmds = llist_reverse_order(io_cmds);
> > -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > -}
> > -
> > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > -{
> > -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > -	struct ublk_queue *ubq = pdu->ubq;
> > -
> > -	ublk_forward_io_cmds(ubq, issue_flags);
> > -}
> > -
> >  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> >  {
> > -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > -
> > -	if (llist_add(&data->node, &ubq->io_cmds)) {
> > -		struct ublk_io *io = &ubq->ios[rq->tag];
> > -
> > -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > -	}
> > +	struct ublk_io *io = &ubq->ios[rq->tag];
> > +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > +	pdu->req = rq;
> > +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> >  }
> 
> It should be fine to convert to io_uring_cmd_complete_in_task() since
> the callback list is re-ordered in io_uring.

Yes, I noticed that task_work has (lockless) internal queueing, so
there shouldn't be a need to maintain our own queue of commands in
ublk_drv. I can factor this change out into its own patch if that is
useful.

> 
> >  
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  {
> >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> >  	unsigned int nr_inflight = 0;
> >  	int i;
> >  
> >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > -		if (!ubq->timeout) {
> > -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > -			ubq->timeout = true;
> > -		}
> > -
> > +		send_sig(SIGKILL, data->task, 0);
> >  		return BLK_EH_DONE;
> >  	}
> >  
> > -	if (!ubq_daemon_is_dying(ubq))
> > +	if (!(data->task->flags & PF_EXITING))
> >  		return BLK_EH_RESET_TIMER;
> 
> ->task is only for error handling, but it may not work any more since
> who knows which task is for handling the io command actually.

Yes, you are right - this part right here is the only reason we need to
save/take a reference to the task. I have a couple alternative ideas:

1. Don't kill anything if a timeout happens. Instead, synchronize
   against the "normal" completion path (i.e. commit_and_fetch), and if
   timeout happens first, normal completion gets an error. If normal
   completion happens first, timeout does nothing.
2. Require that all threads handling I/O are threads of the same process
   (in the kernel, I think this means their task_struct::group_leader is
   the same?) In the normal completion path, we replace the check that
   exists today (check equality with ubq_daemon) with ensuring that the
   current task is within the process. In the timeout path, we send
   SIGKILL to the top-level process, which should propagate to the
   threads as well.

Does either of those sound okay?

Ming Lei Oct. 8, 2024, 1:47 a.m. UTC | #5

On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote:
> On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > > task (called the queue's ubq_daemon) which is allowed to issue
> > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > > to do so, the command fails immediately with EINVAL. When considered
> > > together with the block layer architecture, the result is that for each
> > > CPU C on the system, there is a unique ublk server thread which is
> > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > > performance under imbalanced load generation. For an extreme example,
> > > suppose all the load is generated on CPUs mapping to a single ublk
> > > server thread. Then that thread may be fully utilized and become the
> > > bottleneck in the system, while other ublk server threads are totally
> > > idle.
> > 
> > I am wondering why the problem can't be avoided by setting ublk server's
> > thread affinity manually.
> 
> I don't think the ublk server thread CPU affinity has any effect here.
> Assuming that the ublk server threads do not pass I/Os between
> themselves to balance the load, each ublk server thread must handle all
> the I/O issued to its associated hctx, and each thread is limited by how
> much CPU it can get. Since threads are the unit of parallelism, one
> thread can make use of at most one CPU, regardless of the affinity of
> the thread. And this can become a bottleneck.

If ublk server may be saturated, there is at least two choices:

- increase nr_hw_queues, so each ublk server thread can handle IOs from
  less CPUs

- let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ
uring_cmd, and moving actual IO handling into new worker thread if ublk
server becomes saturated, and the communication can be done with eventfd,
please see example in:

https://github.com/ublk-org/ublksrv/blob/master/demo_event.c

> 
> > > be balanced across all ublk server threads by having the threads fetch
> > > I/Os for the same QID in a round robin manner. For example, in a system
> > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > > could issue fetch requests as follows (where each entry is of the form
> > > qid, tag):
> > > 
> > > poller thread:	T0	T1	T2	T3
> > > 		0,0	0,1	0,2	0,3
> > > 		1,3	1,0	1,1	1,2
> > 
> > How many ublk devices there are? If it is 1, just wondering why you use
> > 4 threads? Usually one thread is enough to drive one queue, and the
> > actually io command handling can be moved to new work thread if the queue
> > thread is saturated.
> 
> This is just a small example to demonstrate the idea, not necessarily a
> realistic one.

OK, but I'd suggest to share examples closing to reality, then we can
just focus on problems in real cases.

> 
> > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > > -					unsigned issue_flags)
> > > -{
> > > -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > > -	struct ublk_rq_data *data, *tmp;
> > > -
> > > -	io_cmds = llist_reverse_order(io_cmds);
> > > -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > > -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > > -}
> > > -
> > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > > -{
> > > -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > -	struct ublk_queue *ubq = pdu->ubq;
> > > -
> > > -	ublk_forward_io_cmds(ubq, issue_flags);
> > > -}
> > > -
> > >  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > >  {
> > > -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > -
> > > -	if (llist_add(&data->node, &ubq->io_cmds)) {
> > > -		struct ublk_io *io = &ubq->ios[rq->tag];
> > > -
> > > -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > > -	}
> > > +	struct ublk_io *io = &ubq->ios[rq->tag];
> > > +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > > +	pdu->req = rq;
> > > +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> > >  }
> > 
> > It should be fine to convert to io_uring_cmd_complete_in_task() since
> > the callback list is re-ordered in io_uring.
> 
> Yes, I noticed that task_work has (lockless) internal queueing, so
> there shouldn't be a need to maintain our own queue of commands in
> ublk_drv. I can factor this change out into its own patch if that is
> useful.

Yeah, please go ahead, since it does simplify things.

> 
> > 
> > >  
> > >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > >  {
> > >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > >  	unsigned int nr_inflight = 0;
> > >  	int i;
> > >  
> > >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > > -		if (!ubq->timeout) {
> > > -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > > -			ubq->timeout = true;
> > > -		}
> > > -
> > > +		send_sig(SIGKILL, data->task, 0);
> > >  		return BLK_EH_DONE;
> > >  	}
> > >  
> > > -	if (!ubq_daemon_is_dying(ubq))
> > > +	if (!(data->task->flags & PF_EXITING))
> > >  		return BLK_EH_RESET_TIMER;
> > 
> > ->task is only for error handling, but it may not work any more since
> > who knows which task is for handling the io command actually.
> 
> Yes, you are right - this part right here is the only reason we need to
> save/take a reference to the task. I have a couple alternative ideas:
> 
> 1. Don't kill anything if a timeout happens. Instead, synchronize
>    against the "normal" completion path (i.e. commit_and_fetch), and if
>    timeout happens first, normal completion gets an error. If normal
>    completion happens first, timeout does nothing.

But how to synchronize? Looks the only weapon could be RCU.

Also one server thread may have bug and run into dead loop.

> 2. Require that all threads handling I/O are threads of the same process
>    (in the kernel, I think this means their task_struct::group_leader is
>    the same?)

So far we only allow single process to open /dev/ublkcN, so all threads
have to belong to same process.

And that can be thought as another limit of ublk implementation.

> In the normal completion path, we replace the check that
>    exists today (check equality with ubq_daemon) with ensuring that the
>    current task is within the process. In the timeout path, we send
>    SIGKILL to the top-level process, which should propagate to the
>    threads as well.

It should be enough to kill the only process which opens '/dev/ublkcN'.

> 
> Does either of those sound okay?

Looks #2 is more doable.


Thanks,
Ming

Ming Lei Oct. 8, 2024, 2:11 a.m. UTC | #6

On Tue, Oct 08, 2024 at 09:47:39AM +0800, Ming Lei wrote:
> On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote:
> > On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> > > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > > > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > > > task (called the queue's ubq_daemon) which is allowed to issue
> > > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > > > to do so, the command fails immediately with EINVAL. When considered
> > > > together with the block layer architecture, the result is that for each
> > > > CPU C on the system, there is a unique ublk server thread which is
> > > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > > > performance under imbalanced load generation. For an extreme example,
> > > > suppose all the load is generated on CPUs mapping to a single ublk
> > > > server thread. Then that thread may be fully utilized and become the
> > > > bottleneck in the system, while other ublk server threads are totally
> > > > idle.
> > > 
> > > I am wondering why the problem can't be avoided by setting ublk server's
> > > thread affinity manually.
> > 
> > I don't think the ublk server thread CPU affinity has any effect here.
> > Assuming that the ublk server threads do not pass I/Os between
> > themselves to balance the load, each ublk server thread must handle all
> > the I/O issued to its associated hctx, and each thread is limited by how
> > much CPU it can get. Since threads are the unit of parallelism, one
> > thread can make use of at most one CPU, regardless of the affinity of
> > the thread. And this can become a bottleneck.
> 
> If ublk server may be saturated, there is at least two choices:
> 
> - increase nr_hw_queues, so each ublk server thread can handle IOs from
>   less CPUs
> 
> - let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ
> uring_cmd, and moving actual IO handling into new worker thread if ublk
> server becomes saturated, and the communication can be done with eventfd,
> please see example in:
> 
> https://github.com/ublk-org/ublksrv/blob/master/demo_event.c
> 
> > 
> > > > be balanced across all ublk server threads by having the threads fetch
> > > > I/Os for the same QID in a round robin manner. For example, in a system
> > > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > > > could issue fetch requests as follows (where each entry is of the form
> > > > qid, tag):
> > > > 
> > > > poller thread:	T0	T1	T2	T3
> > > > 		0,0	0,1	0,2	0,3
> > > > 		1,3	1,0	1,1	1,2
> > > 
> > > How many ublk devices there are? If it is 1, just wondering why you use
> > > 4 threads? Usually one thread is enough to drive one queue, and the
> > > actually io command handling can be moved to new work thread if the queue
> > > thread is saturated.
> > 
> > This is just a small example to demonstrate the idea, not necessarily a
> > realistic one.
> 
> OK, but I'd suggest to share examples closing to reality, then we can
> just focus on problems in real cases.
> 
> > 
> > > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > > > -					unsigned issue_flags)
> > > > -{
> > > > -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > > > -	struct ublk_rq_data *data, *tmp;
> > > > -
> > > > -	io_cmds = llist_reverse_order(io_cmds);
> > > > -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > > > -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > > > -}
> > > > -
> > > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > > > -{
> > > > -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > > -	struct ublk_queue *ubq = pdu->ubq;
> > > > -
> > > > -	ublk_forward_io_cmds(ubq, issue_flags);
> > > > -}
> > > > -
> > > >  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > > >  {
> > > > -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > > -
> > > > -	if (llist_add(&data->node, &ubq->io_cmds)) {
> > > > -		struct ublk_io *io = &ubq->ios[rq->tag];
> > > > -
> > > > -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > > > -	}
> > > > +	struct ublk_io *io = &ubq->ios[rq->tag];
> > > > +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > > > +	pdu->req = rq;
> > > > +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> > > >  }
> > > 
> > > It should be fine to convert to io_uring_cmd_complete_in_task() since
> > > the callback list is re-ordered in io_uring.
> > 
> > Yes, I noticed that task_work has (lockless) internal queueing, so
> > there shouldn't be a need to maintain our own queue of commands in
> > ublk_drv. I can factor this change out into its own patch if that is
> > useful.
> 
> Yeah, please go ahead, since it does simplify things.
> 
> > 
> > > 
> > > >  
> > > >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > > >  {
> > > >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > > +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > >  	unsigned int nr_inflight = 0;
> > > >  	int i;
> > > >  
> > > >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > > > -		if (!ubq->timeout) {
> > > > -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > > > -			ubq->timeout = true;
> > > > -		}
> > > > -
> > > > +		send_sig(SIGKILL, data->task, 0);
> > > >  		return BLK_EH_DONE;
> > > >  	}
> > > >  
> > > > -	if (!ubq_daemon_is_dying(ubq))
> > > > +	if (!(data->task->flags & PF_EXITING))
> > > >  		return BLK_EH_RESET_TIMER;
> > > 
> > > ->task is only for error handling, but it may not work any more since
> > > who knows which task is for handling the io command actually.
> > 
> > Yes, you are right - this part right here is the only reason we need to
> > save/take a reference to the task. I have a couple alternative ideas:
> > 
> > 1. Don't kill anything if a timeout happens. Instead, synchronize
> >    against the "normal" completion path (i.e. commit_and_fetch), and if
> >    timeout happens first, normal completion gets an error. If normal
> >    completion happens first, timeout does nothing.
> 
> But how to synchronize? Looks the only weapon could be RCU.
> 
> Also one server thread may have bug and run into dead loop.
> 
> > 2. Require that all threads handling I/O are threads of the same process
> >    (in the kernel, I think this means their task_struct::group_leader is
> >    the same?)
> 
> So far we only allow single process to open /dev/ublkcN, so all threads
> have to belong to same process.
> 
> And that can be thought as another limit of ublk implementation.
> 
> > In the normal completion path, we replace the check that
> >    exists today (check equality with ubq_daemon) with ensuring that the
> >    current task is within the process. In the timeout path, we send
> >    SIGKILL to the top-level process, which should propagate to the
> >    threads as well.
> 
> It should be enough to kill the only process which opens '/dev/ublkcN'.
> 
> > 
> > Does either of those sound okay?
> 
> Looks #2 is more doable.

Forget to mention, `struct ublk_queue` and `struct ublk_io` are operated
lockless now, since we suppose both two are read/write in single pthread.

If we kill this limit, READ/WRITE on the two structures have to
protected, which may add extra cost, :-(


Thanks,
Ming

Uday Shankar Oct. 9, 2024, 9:11 p.m. UTC | #7

On Tue, Oct 08, 2024 at 10:11:35AM +0800, Ming Lei wrote:
> Forget to mention, `struct ublk_queue` and `struct ublk_io` are operated
> lockless now, since we suppose both two are read/write in single pthread.
> 
> If we kill this limit, READ/WRITE on the two structures have to
> protected, which may add extra cost, :-(

Argh, yes you're right. Fortunately, ublk_queue looks like it is
read-only in the completion path, so we should be able to get away
without synchronization there. ublk_io will need some synchronization
but with a correctly written ublk server, the synchronization should
almost always be uncontended and therefore, hopefully, unnoticeable.

I'll see what I can come up with in v2.

ublk: decouple hctx and ublk server threads

Commit Message

Comments

Patch