Message ID | 20241002224437.3088981-1-ushankar@purestorage.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ublk: decouple hctx and ublk server threads | expand |
On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote: > @@ -1036,9 +1028,13 @@ static inline void __ublk_complete_rq(struct request *req) > else > __blk_mq_end_request(req, BLK_STS_OK); > > - return; > + goto put_task; > exit: > blk_mq_end_request(req, res); > +put_task: > + WARN_ON_ONCE(!data->task); > + put_task_struct(data->task); > + data->task = NULL; > } I suppose this is a bug, since the request could be reused once we call blk_mq_end_request, so we need to drop the task reference before that happens. Will fix this in v2, along with any other comments I receive on this.
Hi Uday,
kernel test robot noticed the following build warnings:
[auto build test WARNING on ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd]
url: https://github.com/intel-lab-lkp/linux/commits/Uday-Shankar/ublk-decouple-hctx-and-ublk-server-threads/20241003-064609
base: ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd
patch link: https://lore.kernel.org/r/20241002224437.3088981-1-ushankar%40purestorage.com
patch subject: [PATCH] ublk: decouple hctx and ublk server threads
config: i386-allyesconfig (https://download.01.org/0day-ci/archive/20241006/202410060518.wRBlN5hY-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241006/202410060518.wRBlN5hY-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410060518.wRBlN5hY-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/block/ublk_drv.c: In function 'ublk_uring_cmd_cancel_fn':
>> drivers/block/ublk_drv.c:1446:29: warning: variable 'task' set but not used [-Wunused-but-set-variable]
1446 | struct task_struct *task;
| ^~~~
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for MODVERSIONS
Depends on [n]: MODULES [=y] && !COMPILE_TEST [=y]
Selected by [y]:
- RANDSTRUCT_FULL [=y] && (CC_HAS_RANDSTRUCT [=n] || GCC_PLUGINS [=y]) && MODULES [=y]
WARNING: unmet direct dependencies detected for GET_FREE_REGION
Depends on [n]: SPARSEMEM [=n]
Selected by [y]:
- RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y]
vim +/task +1446 drivers/block/ublk_drv.c
216c8f5ef0f209 Ming Lei 2023-10-09 1436
216c8f5ef0f209 Ming Lei 2023-10-09 1437 /*
216c8f5ef0f209 Ming Lei 2023-10-09 1438 * The ublk char device won't be closed when calling cancel fn, so both
216c8f5ef0f209 Ming Lei 2023-10-09 1439 * ublk device and queue are guaranteed to be live
216c8f5ef0f209 Ming Lei 2023-10-09 1440 */
216c8f5ef0f209 Ming Lei 2023-10-09 1441 static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
216c8f5ef0f209 Ming Lei 2023-10-09 1442 unsigned int issue_flags)
216c8f5ef0f209 Ming Lei 2023-10-09 1443 {
216c8f5ef0f209 Ming Lei 2023-10-09 1444 struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
216c8f5ef0f209 Ming Lei 2023-10-09 1445 struct ublk_queue *ubq = pdu->ubq;
216c8f5ef0f209 Ming Lei 2023-10-09 @1446 struct task_struct *task;
216c8f5ef0f209 Ming Lei 2023-10-09 1447 struct ublk_device *ub;
216c8f5ef0f209 Ming Lei 2023-10-09 1448 bool need_schedule;
216c8f5ef0f209 Ming Lei 2023-10-09 1449 struct ublk_io *io;
216c8f5ef0f209 Ming Lei 2023-10-09 1450
216c8f5ef0f209 Ming Lei 2023-10-09 1451 if (WARN_ON_ONCE(!ubq))
216c8f5ef0f209 Ming Lei 2023-10-09 1452 return;
216c8f5ef0f209 Ming Lei 2023-10-09 1453
216c8f5ef0f209 Ming Lei 2023-10-09 1454 if (WARN_ON_ONCE(pdu->tag >= ubq->q_depth))
bd23f6c2c2d005 Ming Lei 2023-10-09 1455 return;
bd23f6c2c2d005 Ming Lei 2023-10-09 1456
216c8f5ef0f209 Ming Lei 2023-10-09 1457 task = io_uring_cmd_get_task(cmd);
bd23f6c2c2d005 Ming Lei 2023-10-09 1458
216c8f5ef0f209 Ming Lei 2023-10-09 1459 ub = ubq->dev;
216c8f5ef0f209 Ming Lei 2023-10-09 1460 need_schedule = ublk_abort_requests(ub, ubq);
216c8f5ef0f209 Ming Lei 2023-10-09 1461
216c8f5ef0f209 Ming Lei 2023-10-09 1462 io = &ubq->ios[pdu->tag];
216c8f5ef0f209 Ming Lei 2023-10-09 1463 WARN_ON_ONCE(io->cmd != cmd);
b4e1353f465147 Ming Lei 2023-10-09 1464 ublk_cancel_cmd(ubq, io, issue_flags);
216c8f5ef0f209 Ming Lei 2023-10-09 1465
216c8f5ef0f209 Ming Lei 2023-10-09 1466 if (need_schedule) {
bd23f6c2c2d005 Ming Lei 2023-10-09 1467 if (ublk_can_use_recovery(ub))
bbae8d1f526b56 ZiyangZhang 2022-09-23 1468 schedule_work(&ub->quiesce_work);
bbae8d1f526b56 ZiyangZhang 2022-09-23 1469 else
71f28f3136aff5 Ming Lei 2022-07-13 1470 schedule_work(&ub->stop_work);
216c8f5ef0f209 Ming Lei 2023-10-09 1471 }
71f28f3136aff5 Ming Lei 2022-07-13 1472 }
71f28f3136aff5 Ming Lei 2022-07-13 1473
On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote: > Currently, ublk_drv associates to each hardware queue (hctx) a unique > task (called the queue's ubq_daemon) which is allowed to issue > COMMIT_AND_FETCH commands against the hctx. If any other task attempts > to do so, the command fails immediately with EINVAL. When considered > together with the block layer architecture, the result is that for each > CPU C on the system, there is a unique ublk server thread which is > allowed to handle I/O submitted on CPU C. This can lead to suboptimal > performance under imbalanced load generation. For an extreme example, > suppose all the load is generated on CPUs mapping to a single ublk > server thread. Then that thread may be fully utilized and become the > bottleneck in the system, while other ublk server threads are totally > idle. I am wondering why the problem can't be avoided by setting ublk server's thread affinity manually. > > This issue can also be addressed directly in the ublk server without > kernel support by having threads dequeue I/Os and pass them around to > ensure even load. But this solution involves moving I/Os between threads > several times. This is a bad pattern for performance, and we observed a > significant regression in peak bandwidth with this solution. > > Therefore, address this issue by removing the association of a unique > ubq_daemon to each hctx. This association is fairly artificial anyways, > and removing it results in simpler driver code. Imbalanced load can then I did consider to remove the association from simplifying design & implementation viewpoint, but there are some problems which are hard to solve. > be balanced across all ublk server threads by having the threads fetch > I/Os for the same QID in a round robin manner. For example, in a system > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads > could issue fetch requests as follows (where each entry is of the form > qid, tag): > > poller thread: T0 T1 T2 T3 > 0,0 0,1 0,2 0,3 > 1,3 1,0 1,1 1,2 How many ublk devices there are? If it is 1, just wondering why you use 4 threads? Usually one thread is enough to drive one queue, and the actually io command handling can be moved to new work thread if the queue thread is saturated. > > Since tags appear to be allocated in sequential chunks, this provides a > rough approximation to distributing I/Os round-robin across all ublk > server threads, while letting I/Os stay fully thread-local. > > Signed-off-by: Uday Shankar <ushankar@purestorage.com> > --- > drivers/block/ublk_drv.c | 105 ++++++++++++--------------------------- > 1 file changed, 33 insertions(+), 72 deletions(-) > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > index a6c8e5cc6051..7e0ce35dbc6f 100644 > --- a/drivers/block/ublk_drv.c > +++ b/drivers/block/ublk_drv.c > @@ -68,12 +68,12 @@ > UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED) > > struct ublk_rq_data { > - struct llist_node node; > - > + struct task_struct *task; > struct kref ref; > }; > > struct ublk_uring_cmd_pdu { > + struct request *req; > struct ublk_queue *ubq; > u16 tag; > }; > @@ -133,15 +133,11 @@ struct ublk_queue { > int q_depth; > > unsigned long flags; > - struct task_struct *ubq_daemon; > char *io_cmd_buf; > > - struct llist_head io_cmds; > - > unsigned long io_addr; /* mapped vm address */ > unsigned int max_io_sz; > bool force_abort; > - bool timeout; > bool canceling; > unsigned short nr_io_ready; /* how many ios setup */ > spinlock_t cancel_lock; > @@ -982,16 +978,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu( > return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu; > } > > -static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) > -{ > - return ubq->ubq_daemon->flags & PF_EXITING; > -} > - > /* todo: handle partial completion */ > static inline void __ublk_complete_rq(struct request *req) > { > struct ublk_queue *ubq = req->mq_hctx->driver_data; > struct ublk_io *io = &ubq->ios[req->tag]; > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); > unsigned int unmapped_bytes; > blk_status_t res = BLK_STS_OK; > > @@ -1036,9 +1028,13 @@ static inline void __ublk_complete_rq(struct request *req) > else > __blk_mq_end_request(req, BLK_STS_OK); > > - return; > + goto put_task; > exit: > blk_mq_end_request(req, res); > +put_task: > + WARN_ON_ONCE(!data->task); > + put_task_struct(data->task); > + data->task = NULL; > } > > static void ublk_complete_rq(struct kref *ref) > @@ -1097,13 +1093,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq, > blk_mq_end_request(rq, BLK_STS_IOERR); > } > > -static inline void __ublk_rq_task_work(struct request *req, > +static inline void __ublk_rq_task_work(struct io_uring_cmd *cmd, > unsigned issue_flags) > { > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); > + struct request *req = pdu->req; > struct ublk_queue *ubq = req->mq_hctx->driver_data; > int tag = req->tag; > struct ublk_io *io = &ubq->ios[tag]; > unsigned int mapped_bytes; > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); > > pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", > __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, > @@ -1112,13 +1111,14 @@ static inline void __ublk_rq_task_work(struct request *req, > /* > * Task is exiting if either: > * > - * (1) current != ubq_daemon. > + * (1) current != io_uring_get_cmd_task(io->cmd). > * io_uring_cmd_complete_in_task() tries to run task_work > - * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. > + * in a workqueue if cmd's task is PF_EXITING. > * > * (2) current->flags & PF_EXITING. > */ > - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { > + if (unlikely(current != io_uring_cmd_get_task(io->cmd) || > + current->flags & PF_EXITING)) { > __ublk_abort_rq(ubq, req); > return; > } > @@ -1173,55 +1173,32 @@ static inline void __ublk_rq_task_work(struct request *req, > } > > ublk_init_req_ref(ubq, req); > + WARN_ON_ONCE(data->task); > + data->task = get_task_struct(current); If queue/task association is killed, it doesn't make sense to record the ->task any more, cause this io command can be handled by another thread in userspace, then UBLK_IO_COMMIT_AND_FETCH_REQ for this io request may be issued from task which isn't same with data->task. The main trouble is that error handling can't be done easily. > ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags); > } > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, > - unsigned issue_flags) > -{ > - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); > - struct ublk_rq_data *data, *tmp; > - > - io_cmds = llist_reverse_order(io_cmds); > - llist_for_each_entry_safe(data, tmp, io_cmds, node) > - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); > -} > - > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) > -{ > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); > - struct ublk_queue *ubq = pdu->ubq; > - > - ublk_forward_io_cmds(ubq, issue_flags); > -} > - > static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) > { > - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > - > - if (llist_add(&data->node, &ubq->io_cmds)) { > - struct ublk_io *io = &ubq->ios[rq->tag]; > - > - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); > - } > + struct ublk_io *io = &ubq->ios[rq->tag]; > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd); > + pdu->req = rq; > + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work); > } It should be fine to convert to io_uring_cmd_complete_in_task() since the callback list is re-ordered in io_uring. > > static enum blk_eh_timer_return ublk_timeout(struct request *rq) > { > struct ublk_queue *ubq = rq->mq_hctx->driver_data; > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > unsigned int nr_inflight = 0; > int i; > > if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { > - if (!ubq->timeout) { > - send_sig(SIGKILL, ubq->ubq_daemon, 0); > - ubq->timeout = true; > - } > - > + send_sig(SIGKILL, data->task, 0); > return BLK_EH_DONE; > } > > - if (!ubq_daemon_is_dying(ubq)) > + if (!(data->task->flags & PF_EXITING)) > return BLK_EH_RESET_TIMER; ->task is only for error handling, but it may not work any more since who knows which task is for handling the io command actually. Thanks, Ming
On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote: > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote: > > Currently, ublk_drv associates to each hardware queue (hctx) a unique > > task (called the queue's ubq_daemon) which is allowed to issue > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts > > to do so, the command fails immediately with EINVAL. When considered > > together with the block layer architecture, the result is that for each > > CPU C on the system, there is a unique ublk server thread which is > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal > > performance under imbalanced load generation. For an extreme example, > > suppose all the load is generated on CPUs mapping to a single ublk > > server thread. Then that thread may be fully utilized and become the > > bottleneck in the system, while other ublk server threads are totally > > idle. > > I am wondering why the problem can't be avoided by setting ublk server's > thread affinity manually. I don't think the ublk server thread CPU affinity has any effect here. Assuming that the ublk server threads do not pass I/Os between themselves to balance the load, each ublk server thread must handle all the I/O issued to its associated hctx, and each thread is limited by how much CPU it can get. Since threads are the unit of parallelism, one thread can make use of at most one CPU, regardless of the affinity of the thread. And this can become a bottleneck. > > be balanced across all ublk server threads by having the threads fetch > > I/Os for the same QID in a round robin manner. For example, in a system > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads > > could issue fetch requests as follows (where each entry is of the form > > qid, tag): > > > > poller thread: T0 T1 T2 T3 > > 0,0 0,1 0,2 0,3 > > 1,3 1,0 1,1 1,2 > > How many ublk devices there are? If it is 1, just wondering why you use > 4 threads? Usually one thread is enough to drive one queue, and the > actually io command handling can be moved to new work thread if the queue > thread is saturated. This is just a small example to demonstrate the idea, not necessarily a realistic one. > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, > > - unsigned issue_flags) > > -{ > > - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); > > - struct ublk_rq_data *data, *tmp; > > - > > - io_cmds = llist_reverse_order(io_cmds); > > - llist_for_each_entry_safe(data, tmp, io_cmds, node) > > - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); > > -} > > - > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) > > -{ > > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); > > - struct ublk_queue *ubq = pdu->ubq; > > - > > - ublk_forward_io_cmds(ubq, issue_flags); > > -} > > - > > static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) > > { > > - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > - > > - if (llist_add(&data->node, &ubq->io_cmds)) { > > - struct ublk_io *io = &ubq->ios[rq->tag]; > > - > > - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); > > - } > > + struct ublk_io *io = &ubq->ios[rq->tag]; > > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd); > > + pdu->req = rq; > > + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work); > > } > > It should be fine to convert to io_uring_cmd_complete_in_task() since > the callback list is re-ordered in io_uring. Yes, I noticed that task_work has (lockless) internal queueing, so there shouldn't be a need to maintain our own queue of commands in ublk_drv. I can factor this change out into its own patch if that is useful. > > > > > static enum blk_eh_timer_return ublk_timeout(struct request *rq) > > { > > struct ublk_queue *ubq = rq->mq_hctx->driver_data; > > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > unsigned int nr_inflight = 0; > > int i; > > > > if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { > > - if (!ubq->timeout) { > > - send_sig(SIGKILL, ubq->ubq_daemon, 0); > > - ubq->timeout = true; > > - } > > - > > + send_sig(SIGKILL, data->task, 0); > > return BLK_EH_DONE; > > } > > > > - if (!ubq_daemon_is_dying(ubq)) > > + if (!(data->task->flags & PF_EXITING)) > > return BLK_EH_RESET_TIMER; > > ->task is only for error handling, but it may not work any more since > who knows which task is for handling the io command actually. Yes, you are right - this part right here is the only reason we need to save/take a reference to the task. I have a couple alternative ideas: 1. Don't kill anything if a timeout happens. Instead, synchronize against the "normal" completion path (i.e. commit_and_fetch), and if timeout happens first, normal completion gets an error. If normal completion happens first, timeout does nothing. 2. Require that all threads handling I/O are threads of the same process (in the kernel, I think this means their task_struct::group_leader is the same?) In the normal completion path, we replace the check that exists today (check equality with ubq_daemon) with ensuring that the current task is within the process. In the timeout path, we send SIGKILL to the top-level process, which should propagate to the threads as well. Does either of those sound okay?
On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote: > On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote: > > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote: > > > Currently, ublk_drv associates to each hardware queue (hctx) a unique > > > task (called the queue's ubq_daemon) which is allowed to issue > > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts > > > to do so, the command fails immediately with EINVAL. When considered > > > together with the block layer architecture, the result is that for each > > > CPU C on the system, there is a unique ublk server thread which is > > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal > > > performance under imbalanced load generation. For an extreme example, > > > suppose all the load is generated on CPUs mapping to a single ublk > > > server thread. Then that thread may be fully utilized and become the > > > bottleneck in the system, while other ublk server threads are totally > > > idle. > > > > I am wondering why the problem can't be avoided by setting ublk server's > > thread affinity manually. > > I don't think the ublk server thread CPU affinity has any effect here. > Assuming that the ublk server threads do not pass I/Os between > themselves to balance the load, each ublk server thread must handle all > the I/O issued to its associated hctx, and each thread is limited by how > much CPU it can get. Since threads are the unit of parallelism, one > thread can make use of at most one CPU, regardless of the affinity of > the thread. And this can become a bottleneck. If ublk server may be saturated, there is at least two choices: - increase nr_hw_queues, so each ublk server thread can handle IOs from less CPUs - let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ uring_cmd, and moving actual IO handling into new worker thread if ublk server becomes saturated, and the communication can be done with eventfd, please see example in: https://github.com/ublk-org/ublksrv/blob/master/demo_event.c > > > > be balanced across all ublk server threads by having the threads fetch > > > I/Os for the same QID in a round robin manner. For example, in a system > > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads > > > could issue fetch requests as follows (where each entry is of the form > > > qid, tag): > > > > > > poller thread: T0 T1 T2 T3 > > > 0,0 0,1 0,2 0,3 > > > 1,3 1,0 1,1 1,2 > > > > How many ublk devices there are? If it is 1, just wondering why you use > > 4 threads? Usually one thread is enough to drive one queue, and the > > actually io command handling can be moved to new work thread if the queue > > thread is saturated. > > This is just a small example to demonstrate the idea, not necessarily a > realistic one. OK, but I'd suggest to share examples closing to reality, then we can just focus on problems in real cases. > > > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, > > > - unsigned issue_flags) > > > -{ > > > - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); > > > - struct ublk_rq_data *data, *tmp; > > > - > > > - io_cmds = llist_reverse_order(io_cmds); > > > - llist_for_each_entry_safe(data, tmp, io_cmds, node) > > > - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); > > > -} > > > - > > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) > > > -{ > > > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); > > > - struct ublk_queue *ubq = pdu->ubq; > > > - > > > - ublk_forward_io_cmds(ubq, issue_flags); > > > -} > > > - > > > static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) > > > { > > > - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > > - > > > - if (llist_add(&data->node, &ubq->io_cmds)) { > > > - struct ublk_io *io = &ubq->ios[rq->tag]; > > > - > > > - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); > > > - } > > > + struct ublk_io *io = &ubq->ios[rq->tag]; > > > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd); > > > + pdu->req = rq; > > > + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work); > > > } > > > > It should be fine to convert to io_uring_cmd_complete_in_task() since > > the callback list is re-ordered in io_uring. > > Yes, I noticed that task_work has (lockless) internal queueing, so > there shouldn't be a need to maintain our own queue of commands in > ublk_drv. I can factor this change out into its own patch if that is > useful. Yeah, please go ahead, since it does simplify things. > > > > > > > > > static enum blk_eh_timer_return ublk_timeout(struct request *rq) > > > { > > > struct ublk_queue *ubq = rq->mq_hctx->driver_data; > > > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > > unsigned int nr_inflight = 0; > > > int i; > > > > > > if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { > > > - if (!ubq->timeout) { > > > - send_sig(SIGKILL, ubq->ubq_daemon, 0); > > > - ubq->timeout = true; > > > - } > > > - > > > + send_sig(SIGKILL, data->task, 0); > > > return BLK_EH_DONE; > > > } > > > > > > - if (!ubq_daemon_is_dying(ubq)) > > > + if (!(data->task->flags & PF_EXITING)) > > > return BLK_EH_RESET_TIMER; > > > > ->task is only for error handling, but it may not work any more since > > who knows which task is for handling the io command actually. > > Yes, you are right - this part right here is the only reason we need to > save/take a reference to the task. I have a couple alternative ideas: > > 1. Don't kill anything if a timeout happens. Instead, synchronize > against the "normal" completion path (i.e. commit_and_fetch), and if > timeout happens first, normal completion gets an error. If normal > completion happens first, timeout does nothing. But how to synchronize? Looks the only weapon could be RCU. Also one server thread may have bug and run into dead loop. > 2. Require that all threads handling I/O are threads of the same process > (in the kernel, I think this means their task_struct::group_leader is > the same?) So far we only allow single process to open /dev/ublkcN, so all threads have to belong to same process. And that can be thought as another limit of ublk implementation. > In the normal completion path, we replace the check that > exists today (check equality with ubq_daemon) with ensuring that the > current task is within the process. In the timeout path, we send > SIGKILL to the top-level process, which should propagate to the > threads as well. It should be enough to kill the only process which opens '/dev/ublkcN'. > > Does either of those sound okay? Looks #2 is more doable. Thanks, Ming
On Tue, Oct 08, 2024 at 09:47:39AM +0800, Ming Lei wrote: > On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote: > > On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote: > > > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote: > > > > Currently, ublk_drv associates to each hardware queue (hctx) a unique > > > > task (called the queue's ubq_daemon) which is allowed to issue > > > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts > > > > to do so, the command fails immediately with EINVAL. When considered > > > > together with the block layer architecture, the result is that for each > > > > CPU C on the system, there is a unique ublk server thread which is > > > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal > > > > performance under imbalanced load generation. For an extreme example, > > > > suppose all the load is generated on CPUs mapping to a single ublk > > > > server thread. Then that thread may be fully utilized and become the > > > > bottleneck in the system, while other ublk server threads are totally > > > > idle. > > > > > > I am wondering why the problem can't be avoided by setting ublk server's > > > thread affinity manually. > > > > I don't think the ublk server thread CPU affinity has any effect here. > > Assuming that the ublk server threads do not pass I/Os between > > themselves to balance the load, each ublk server thread must handle all > > the I/O issued to its associated hctx, and each thread is limited by how > > much CPU it can get. Since threads are the unit of parallelism, one > > thread can make use of at most one CPU, regardless of the affinity of > > the thread. And this can become a bottleneck. > > If ublk server may be saturated, there is at least two choices: > > - increase nr_hw_queues, so each ublk server thread can handle IOs from > less CPUs > > - let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ > uring_cmd, and moving actual IO handling into new worker thread if ublk > server becomes saturated, and the communication can be done with eventfd, > please see example in: > > https://github.com/ublk-org/ublksrv/blob/master/demo_event.c > > > > > > > be balanced across all ublk server threads by having the threads fetch > > > > I/Os for the same QID in a round robin manner. For example, in a system > > > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads > > > > could issue fetch requests as follows (where each entry is of the form > > > > qid, tag): > > > > > > > > poller thread: T0 T1 T2 T3 > > > > 0,0 0,1 0,2 0,3 > > > > 1,3 1,0 1,1 1,2 > > > > > > How many ublk devices there are? If it is 1, just wondering why you use > > > 4 threads? Usually one thread is enough to drive one queue, and the > > > actually io command handling can be moved to new work thread if the queue > > > thread is saturated. > > > > This is just a small example to demonstrate the idea, not necessarily a > > realistic one. > > OK, but I'd suggest to share examples closing to reality, then we can > just focus on problems in real cases. > > > > > > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, > > > > - unsigned issue_flags) > > > > -{ > > > > - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); > > > > - struct ublk_rq_data *data, *tmp; > > > > - > > > > - io_cmds = llist_reverse_order(io_cmds); > > > > - llist_for_each_entry_safe(data, tmp, io_cmds, node) > > > > - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); > > > > -} > > > > - > > > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) > > > > -{ > > > > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); > > > > - struct ublk_queue *ubq = pdu->ubq; > > > > - > > > > - ublk_forward_io_cmds(ubq, issue_flags); > > > > -} > > > > - > > > > static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) > > > > { > > > > - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > > > - > > > > - if (llist_add(&data->node, &ubq->io_cmds)) { > > > > - struct ublk_io *io = &ubq->ios[rq->tag]; > > > > - > > > > - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); > > > > - } > > > > + struct ublk_io *io = &ubq->ios[rq->tag]; > > > > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd); > > > > + pdu->req = rq; > > > > + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work); > > > > } > > > > > > It should be fine to convert to io_uring_cmd_complete_in_task() since > > > the callback list is re-ordered in io_uring. > > > > Yes, I noticed that task_work has (lockless) internal queueing, so > > there shouldn't be a need to maintain our own queue of commands in > > ublk_drv. I can factor this change out into its own patch if that is > > useful. > > Yeah, please go ahead, since it does simplify things. > > > > > > > > > > > > > > static enum blk_eh_timer_return ublk_timeout(struct request *rq) > > > > { > > > > struct ublk_queue *ubq = rq->mq_hctx->driver_data; > > > > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > > > > unsigned int nr_inflight = 0; > > > > int i; > > > > > > > > if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { > > > > - if (!ubq->timeout) { > > > > - send_sig(SIGKILL, ubq->ubq_daemon, 0); > > > > - ubq->timeout = true; > > > > - } > > > > - > > > > + send_sig(SIGKILL, data->task, 0); > > > > return BLK_EH_DONE; > > > > } > > > > > > > > - if (!ubq_daemon_is_dying(ubq)) > > > > + if (!(data->task->flags & PF_EXITING)) > > > > return BLK_EH_RESET_TIMER; > > > > > > ->task is only for error handling, but it may not work any more since > > > who knows which task is for handling the io command actually. > > > > Yes, you are right - this part right here is the only reason we need to > > save/take a reference to the task. I have a couple alternative ideas: > > > > 1. Don't kill anything if a timeout happens. Instead, synchronize > > against the "normal" completion path (i.e. commit_and_fetch), and if > > timeout happens first, normal completion gets an error. If normal > > completion happens first, timeout does nothing. > > But how to synchronize? Looks the only weapon could be RCU. > > Also one server thread may have bug and run into dead loop. > > > 2. Require that all threads handling I/O are threads of the same process > > (in the kernel, I think this means their task_struct::group_leader is > > the same?) > > So far we only allow single process to open /dev/ublkcN, so all threads > have to belong to same process. > > And that can be thought as another limit of ublk implementation. > > > In the normal completion path, we replace the check that > > exists today (check equality with ubq_daemon) with ensuring that the > > current task is within the process. In the timeout path, we send > > SIGKILL to the top-level process, which should propagate to the > > threads as well. > > It should be enough to kill the only process which opens '/dev/ublkcN'. > > > > > Does either of those sound okay? > > Looks #2 is more doable. Forget to mention, `struct ublk_queue` and `struct ublk_io` are operated lockless now, since we suppose both two are read/write in single pthread. If we kill this limit, READ/WRITE on the two structures have to protected, which may add extra cost, :-( Thanks, Ming
On Tue, Oct 08, 2024 at 10:11:35AM +0800, Ming Lei wrote: > Forget to mention, `struct ublk_queue` and `struct ublk_io` are operated > lockless now, since we suppose both two are read/write in single pthread. > > If we kill this limit, READ/WRITE on the two structures have to > protected, which may add extra cost, :-( Argh, yes you're right. Fortunately, ublk_queue looks like it is read-only in the completion path, so we should be able to get away without synchronization there. ublk_io will need some synchronization but with a correctly written ublk server, the synchronization should almost always be uncontended and therefore, hopefully, unnoticeable. I'll see what I can come up with in v2.
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index a6c8e5cc6051..7e0ce35dbc6f 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -68,12 +68,12 @@ UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED) struct ublk_rq_data { - struct llist_node node; - + struct task_struct *task; struct kref ref; }; struct ublk_uring_cmd_pdu { + struct request *req; struct ublk_queue *ubq; u16 tag; }; @@ -133,15 +133,11 @@ struct ublk_queue { int q_depth; unsigned long flags; - struct task_struct *ubq_daemon; char *io_cmd_buf; - struct llist_head io_cmds; - unsigned long io_addr; /* mapped vm address */ unsigned int max_io_sz; bool force_abort; - bool timeout; bool canceling; unsigned short nr_io_ready; /* how many ios setup */ spinlock_t cancel_lock; @@ -982,16 +978,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu( return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu; } -static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) -{ - return ubq->ubq_daemon->flags & PF_EXITING; -} - /* todo: handle partial completion */ static inline void __ublk_complete_rq(struct request *req) { struct ublk_queue *ubq = req->mq_hctx->driver_data; struct ublk_io *io = &ubq->ios[req->tag]; + struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); unsigned int unmapped_bytes; blk_status_t res = BLK_STS_OK; @@ -1036,9 +1028,13 @@ static inline void __ublk_complete_rq(struct request *req) else __blk_mq_end_request(req, BLK_STS_OK); - return; + goto put_task; exit: blk_mq_end_request(req, res); +put_task: + WARN_ON_ONCE(!data->task); + put_task_struct(data->task); + data->task = NULL; } static void ublk_complete_rq(struct kref *ref) @@ -1097,13 +1093,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq, blk_mq_end_request(rq, BLK_STS_IOERR); } -static inline void __ublk_rq_task_work(struct request *req, +static inline void __ublk_rq_task_work(struct io_uring_cmd *cmd, unsigned issue_flags) { + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); + struct request *req = pdu->req; struct ublk_queue *ubq = req->mq_hctx->driver_data; int tag = req->tag; struct ublk_io *io = &ubq->ios[tag]; unsigned int mapped_bytes; + struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, @@ -1112,13 +1111,14 @@ static inline void __ublk_rq_task_work(struct request *req, /* * Task is exiting if either: * - * (1) current != ubq_daemon. + * (1) current != io_uring_get_cmd_task(io->cmd). * io_uring_cmd_complete_in_task() tries to run task_work - * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. + * in a workqueue if cmd's task is PF_EXITING. * * (2) current->flags & PF_EXITING. */ - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { + if (unlikely(current != io_uring_cmd_get_task(io->cmd) || + current->flags & PF_EXITING)) { __ublk_abort_rq(ubq, req); return; } @@ -1173,55 +1173,32 @@ static inline void __ublk_rq_task_work(struct request *req, } ublk_init_req_ref(ubq, req); + WARN_ON_ONCE(data->task); + data->task = get_task_struct(current); ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags); } -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq, - unsigned issue_flags) -{ - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); - struct ublk_rq_data *data, *tmp; - - io_cmds = llist_reverse_order(io_cmds); - llist_for_each_entry_safe(data, tmp, io_cmds, node) - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags); -} - -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags) -{ - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); - struct ublk_queue *ubq = pdu->ubq; - - ublk_forward_io_cmds(ubq, issue_flags); -} - static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq) { - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); - - if (llist_add(&data->node, &ubq->io_cmds)) { - struct ublk_io *io = &ubq->ios[rq->tag]; - - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); - } + struct ublk_io *io = &ubq->ios[rq->tag]; + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd); + pdu->req = rq; + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work); } static enum blk_eh_timer_return ublk_timeout(struct request *rq) { struct ublk_queue *ubq = rq->mq_hctx->driver_data; + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); unsigned int nr_inflight = 0; int i; if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { - if (!ubq->timeout) { - send_sig(SIGKILL, ubq->ubq_daemon, 0); - ubq->timeout = true; - } - + send_sig(SIGKILL, data->task, 0); return BLK_EH_DONE; } - if (!ubq_daemon_is_dying(ubq)) + if (!(data->task->flags & PF_EXITING)) return BLK_EH_RESET_TIMER; for (i = 0; i < ubq->q_depth; i++) { @@ -1380,8 +1357,8 @@ static void ublk_commit_completion(struct ublk_device *ub, } /* - * Called from ubq_daemon context via cancel fn, meantime quiesce ublk - * blk-mq queue, so we are called exclusively with blk-mq and ubq_daemon + * Called from cmd task context via cancel fn, meantime quiesce ublk + * blk-mq queue, so we are called exclusively with blk-mq and cmd task * context, so everything is serialized. */ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) @@ -1478,8 +1455,6 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd, return; task = io_uring_cmd_get_task(cmd); - if (WARN_ON_ONCE(task && task != ubq->ubq_daemon)) - return; ub = ubq->dev; need_schedule = ublk_abort_requests(ub, ubq); @@ -1623,8 +1598,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq) mutex_lock(&ub->mutex); ubq->nr_io_ready++; if (ublk_queue_ready(ubq)) { - ubq->ubq_daemon = current; - get_task_struct(ubq->ubq_daemon); ub->nr_queues_ready++; if (capable(CAP_SYS_ADMIN)) @@ -1703,9 +1676,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, if (!ubq || ub_cmd->q_id != ubq->q_id) goto out; - if (ubq->ubq_daemon && ubq->ubq_daemon != current) - goto out; - if (tag >= ubq->q_depth) goto out; @@ -1994,8 +1964,6 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id) int size = ublk_queue_cmd_buf_size(ub, q_id); struct ublk_queue *ubq = ublk_get_queue(ub, q_id); - if (ubq->ubq_daemon) - put_task_struct(ubq->ubq_daemon); if (ubq->io_cmd_buf) free_pages((unsigned long)ubq->io_cmd_buf, get_order(size)); } @@ -2661,15 +2629,8 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq) { int i; - WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))); - /* All old ioucmds have to be completed */ ubq->nr_io_ready = 0; - /* old daemon is PF_EXITING, put it now */ - put_task_struct(ubq->ubq_daemon); - /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */ - ubq->ubq_daemon = NULL; - ubq->timeout = false; ubq->canceling = false; for (i = 0; i < ubq->q_depth; i++) { @@ -2715,7 +2676,7 @@ static int ublk_ctrl_start_recovery(struct ublk_device *ub, pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id); for (i = 0; i < ub->dev_info.nr_hw_queues; i++) ublk_queue_reinit(ub, ublk_get_queue(ub, i)); - /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ + /* set to NULL, otherwise new tasks cannot mmap the io_cmd_buf */ ub->mm = NULL; ub->nr_queues_ready = 0; ub->nr_privileged_daemon = 0; @@ -2733,14 +2694,14 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub, int ublksrv_pid = (int)header->data[0]; int ret = -EINVAL; - pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n", - __func__, ub->dev_info.nr_hw_queues, header->dev_id); - /* wait until new ubq_daemon sending all FETCH_REQ */ + pr_devel("%s: Waiting for all FETCH_REQs, dev id %d...\n", __func__, + header->dev_id); + if (wait_for_completion_interruptible(&ub->completion)) return -EINTR; - pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n", - __func__, ub->dev_info.nr_hw_queues, header->dev_id); + pr_devel("%s: All FETCH_REQs received, dev id %d\n", __func__, + header->dev_id); mutex_lock(&ub->mutex); if (!ublk_can_use_recovery(ub))
Currently, ublk_drv associates to each hardware queue (hctx) a unique task (called the queue's ubq_daemon) which is allowed to issue COMMIT_AND_FETCH commands against the hctx. If any other task attempts to do so, the command fails immediately with EINVAL. When considered together with the block layer architecture, the result is that for each CPU C on the system, there is a unique ublk server thread which is allowed to handle I/O submitted on CPU C. This can lead to suboptimal performance under imbalanced load generation. For an extreme example, suppose all the load is generated on CPUs mapping to a single ublk server thread. Then that thread may be fully utilized and become the bottleneck in the system, while other ublk server threads are totally idle. This issue can also be addressed directly in the ublk server without kernel support by having threads dequeue I/Os and pass them around to ensure even load. But this solution involves moving I/Os between threads several times. This is a bad pattern for performance, and we observed a significant regression in peak bandwidth with this solution. Therefore, address this issue by removing the association of a unique ubq_daemon to each hctx. This association is fairly artificial anyways, and removing it results in simpler driver code. Imbalanced load can then be balanced across all ublk server threads by having the threads fetch I/Os for the same QID in a round robin manner. For example, in a system with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads could issue fetch requests as follows (where each entry is of the form qid, tag): poller thread: T0 T1 T2 T3 0,0 0,1 0,2 0,3 1,3 1,0 1,1 1,2 Since tags appear to be allocated in sequential chunks, this provides a rough approximation to distributing I/Os round-robin across all ublk server threads, while letting I/Os stay fully thread-local. Signed-off-by: Uday Shankar <ushankar@purestorage.com> --- drivers/block/ublk_drv.c | 105 ++++++++++++--------------------------- 1 file changed, 33 insertions(+), 72 deletions(-) base-commit: ccb1526dda70e0c1bae7fe27aa3b0c96e6e2d0cd