From patchwork Sun Nov 10 14:56:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pavel Begunkov X-Patchwork-Id: 13869917 Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F52038FA6 for ; Sun, 10 Nov 2024 14:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250544; cv=none; b=R6rfmvnzpLkH+cYCdY59OK6Htt2uGTgA8zsjiCDioLybmebTXNtZVpNOD4K+Is1ynaX47SSGmrYP7qLQNnESda+9dxBVDv5Z1rNiRdWySUH5c94MyK0tz0ZDook3ZAPdjyD3gy+IG5jXyzjOyAlmzBNO5E3zWsNc3t44VH4RATk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250544; c=relaxed/simple; bh=AHSLn9OAnXLJ0HEHHrRy+2SAlH1kEMl1sAHPTkvg50A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RzeOjnlmWxVlSyvQzqkOlb0OVclcgdjngr4rigw5UkEnzmXWzdB7sfQMNzZpoUP42aHyar0tW0Km64aBIqnQQ6LcZfu9+dyTrOkvuhFWKq0hZ4wv4BupypqLdpT9kP9jKdfL3xXXih603N2uGKW7czKVead02HEFkZ3b+4dhkpc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=D3HgzLUc; arc=none smtp.client-ip=209.85.221.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="D3HgzLUc" Received: by mail-wr1-f53.google.com with SMTP id ffacd0b85a97d-37d4fd00574so2024622f8f.0 for ; Sun, 10 Nov 2024 06:55:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731250539; x=1731855339; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=D0S9F1LGBxLyKRJ9QzXuymQI3UOu/ax8nqDcfSFVsGM=; b=D3HgzLUcjkzxIQcBA2lHVo/pK7Ng4lwmDW2p1VpbGd/KuCSJdHj1HJdY++5P0n2Q6P BaFloj/0y4yxJUCsKtsWc96Ixa97NMiilyQoGUBDmw8gEUtmo0kS7PW1TH3irqirpWDs oOl9DkQicYvSISEkzjnp++32tEEF7FKEt77KCJ2Tv4wFmKq4wbGq0vqFgD2785hDBPZe FTo5JA7wBmTI9PcB6yqufY4ts+fl080IN74wGvRYiX37MOi05ADXzYTMJuovui8S9y04 6LWsBeYplzQ+LdaENi9sguNS9SeP+vHxAuVEfHCepsbhYDTJ1iZa1RGSFQELD1cuRTkz fr+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731250539; x=1731855339; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D0S9F1LGBxLyKRJ9QzXuymQI3UOu/ax8nqDcfSFVsGM=; b=TGVeVBtlG632DIccst8ypyvCTJn+tHs5evwXxU3lNsQe4oPEabGrZLEUr33Dy3B+gV TFikfaF6zKhLLO7a65IDvtOZUT54mHBerqlcVxg9r1guWwv7yXr2EY9SNaoOD+S71A8V c+6LeIxbDlcc+eb24EJUYns5g5wflBcvlKvk02FlmZz1Eu+sUkX9MRFy0ng3x5zeYW3b Snwa/+pFnedHXXg4QObMiAJlX4Y5Z6fNifqWAa3pCUupSjiCLPoBMcttJXv9AReIwz/H FCwmwjUaBSSRwrhz2YwIXlqFAlPJ8gxX6N5dVxHYsJ2rGIaYXPrY8Tnr+hWQAf/EZJ3T Caog== X-Gm-Message-State: AOJu0YzNVTNYkBhiMPjeS6DOx1152iEvATpZYTsZVUX/18VL47M8QXQa r7YR6ByX5Nhc+1BSd2R9TwsifpyXqHBdSksNTEIfDwqJWvtDBm+lT3G4PQ== X-Google-Smtp-Source: AGHT+IHVr7VGP62l/UXv/kQ5iKmPUA8fEqbnAiy0r8Cktb61NNhmVj+ZGlgRwZ4F5s8PXEzLAg1sPQ== X-Received: by 2002:a05:6000:402c:b0:374:c1ea:2d40 with SMTP id ffacd0b85a97d-381f0f40d9cmr8238964f8f.1.1731250539096; Sun, 10 Nov 2024 06:55:39 -0800 (PST) Received: from 127.0.0.1localhost ([85.255.234.98]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-432aa6beea6sm182445535e9.20.2024.11.10.06.55.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Nov 2024 06:55:38 -0800 (PST) From: Pavel Begunkov To: io-uring@vger.kernel.org Cc: asml.silence@gmail.com Subject: [RFC 1/3] io_uring: introduce request parameter sets Date: Sun, 10 Nov 2024 14:56:20 +0000 Message-ID: <877a43b660a5fec4d658007a8c77bf73471b0b64.1731205010.git.asml.silence@gmail.com> X-Mailer: git-send-email 2.46.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 There are lots of parameters we might want to additionally pass to a request, but SQE has limited space and it may require additional parsing and checking in the hot path. Then requests take an index specifying which parameter set to use. The benefit for the kernel is that we can put any number of arguments in there and then do pre-processing at the initialisation time like renumbering flags and enabling static keys for performance deprecated features. The obvious downside is that the user can't use the entire parameter space as there could only be a limited number of sets. The main target here is tuning the waiting loop with finer grained control when we should wake the task and return to the user. The current implementation is crude, it needs a SETUP flag disabling creds/personalities, and is limited to one registration of maximum 16 sets. It could be made to co-exist with creds and be a bit more flexibly registered and expanded. Signed-off-by: Pavel Begunkov --- include/linux/io_uring_types.h | 8 ++++++ include/uapi/linux/io_uring.h | 9 ++++++ io_uring/io_uring.c | 36 ++++++++++++++++-------- io_uring/msg_ring.c | 1 + io_uring/net.c | 1 + io_uring/register.c | 51 ++++++++++++++++++++++++++++++++++ 6 files changed, 94 insertions(+), 12 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index ad5001102c86..79f38c07642d 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -75,6 +75,10 @@ struct io_hash_table { unsigned hash_bits; }; +struct io_set { + u32 flags; +}; + /* * Arbitrary limit, can be raised if need be */ @@ -268,6 +272,9 @@ struct io_ring_ctx { unsigned cached_sq_head; unsigned sq_entries; + struct io_set iosets[16]; + unsigned int nr_iosets; + /* * Fixed resources fast path, should be accessed only under * uring_lock, and updated through io_uring_register(2) @@ -635,6 +642,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; struct io_uring_task *tctx; + struct io_set *ioset; union { /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ba373deb8406..6a432383e7c3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -158,6 +158,8 @@ enum io_uring_sqe_flags_bit { #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ #define IORING_SETUP_SUBMIT_ALL (1U << 7) /* continue submit on error */ +#define IORING_SETUP_IOSET (1U << 8) + /* * Cooperative task running. When requests complete, they often require * forcing the submitter to transition to the kernel to complete. If this @@ -634,6 +636,8 @@ enum io_uring_register_op { /* register fixed io_uring_reg_wait arguments */ IORING_REGISTER_CQWAIT_REG = 34, + IORING_REGISTER_IOSETS = 35, + /* this goes last */ IORING_REGISTER_LAST, @@ -895,6 +899,11 @@ struct io_uring_recvmsg_out { __u32 flags; }; +struct io_uring_ioset_reg { + __u64 flags; + __u64 __resv[3]; +}; + /* * Argument for IORING_OP_URING_CMD when file is a socket */ diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index f34fa1ead2cf..cf688a9ff737 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2156,6 +2156,7 @@ static void io_init_req_drain(struct io_kiocb *req) static __cold int io_init_fail_req(struct io_kiocb *req, int err) { + req->ioset = &req->ctx->iosets[0]; /* ensure per-opcode data is cleared if we fail before prep */ memset(&req->cmd.data, 0, sizeof(req->cmd.data)); return err; @@ -2238,19 +2239,27 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, } personality = READ_ONCE(sqe->personality); - if (personality) { - int ret; - - req->creds = xa_load(&ctx->personalities, personality); - if (!req->creds) + if (ctx->flags & IORING_SETUP_IOSET) { + if (unlikely(personality >= ctx->nr_iosets)) return io_init_fail_req(req, -EINVAL); - get_cred(req->creds); - ret = security_uring_override_creds(req->creds); - if (ret) { - put_cred(req->creds); - return io_init_fail_req(req, ret); + personality = array_index_nospec(personality, ctx->nr_iosets); + req->ioset = &ctx->iosets[personality]; + } else { + if (personality) { + int ret; + + req->creds = xa_load(&ctx->personalities, personality); + if (!req->creds) + return io_init_fail_req(req, -EINVAL); + get_cred(req->creds); + ret = security_uring_override_creds(req->creds); + if (ret) { + put_cred(req->creds); + return io_init_fail_req(req, ret); + } + req->flags |= REQ_F_CREDS; } - req->flags |= REQ_F_CREDS; + req->ioset = &ctx->iosets[0]; } return def->prep(req, sqe); @@ -3909,6 +3918,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, if (!ctx) return -ENOMEM; + ctx->nr_iosets = 0; + ctx->clockid = CLOCK_MONOTONIC; ctx->clock_offset = 0; @@ -4076,7 +4087,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY | - IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL)) + IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL | + IORING_SETUP_IOSET)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index e63af34004b7..f5a747aa255c 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -98,6 +98,7 @@ static int io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req, io_req_set_res(req, res, cflags); percpu_ref_get(&ctx->refs); req->ctx = ctx; + req->ioset = &ctx->iosets[0]; req->io_task_work.func = io_msg_tw_complete; io_req_task_work_add_remote(req, ctx, IOU_F_TWQ_LAZY_WAKE); return 0; diff --git a/io_uring/net.c b/io_uring/net.c index 2ccc2b409431..785987bf9e6a 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1242,6 +1242,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) notif = zc->notif = io_alloc_notif(ctx); if (!notif) return -ENOMEM; + notif->ioset = req->ioset; notif->cqe.user_data = req->cqe.user_data; notif->cqe.res = 0; notif->cqe.flags = IORING_CQE_F_NOTIF; diff --git a/io_uring/register.c b/io_uring/register.c index 45edfc57963a..e7571dc46da5 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -86,6 +86,48 @@ int io_unregister_personality(struct io_ring_ctx *ctx, unsigned id) return -EINVAL; } +static int io_update_ioset(struct io_ring_ctx *ctx, + const struct io_uring_ioset_reg *reg, + struct io_set *set) +{ + if (!(ctx->flags & IORING_SETUP_IOSET)) + return -EINVAL; + if (reg->flags) + return -EINVAL; + if (reg->__resv[0] || reg->__resv[1] || reg->__resv[2]) + return -EINVAL; + + set->flags = reg->flags; + return 0; +} + +static int io_register_iosets(struct io_ring_ctx *ctx, + void __user *arg, unsigned int nr_args) +{ + struct io_uring_ioset_reg __user *uptr = arg; + struct io_uring_ioset_reg reg[16]; + int i, ret; + + /* TODO: one time setup, max 16 entries, should be made more dynamic */ + if (ctx->nr_iosets) + return -EINVAL; + if (nr_args >= ARRAY_SIZE(ctx->iosets)) + return -EINVAL; + + if (copy_from_user(reg, uptr, sizeof(reg[0]) * nr_args)) + return -EFAULT; + + for (i = 0; i < nr_args; i++) { + ret = io_update_ioset(ctx, ®[i], &ctx->iosets[i]); + if (ret) { + memset(&ctx->iosets[0], 0, sizeof(ctx->iosets[0])); + return ret; + } + } + + ctx->nr_iosets = nr_args; + return 0; +} static int io_register_personality(struct io_ring_ctx *ctx) { @@ -93,6 +135,9 @@ static int io_register_personality(struct io_ring_ctx *ctx) u32 id; int ret; + if (ctx->flags & IORING_SETUP_IOSET) + return -EINVAL; + creds = get_current_cred(); ret = xa_alloc_cyclic(&ctx->personalities, &id, (void *)creds, @@ -846,6 +891,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_cqwait_reg(ctx, arg); break; + case IORING_REGISTER_IOSETS: + ret = -EINVAL; + if (!arg) + break; + ret = io_register_iosets(ctx, arg, nr_args); + break; default: ret = -EINVAL; break; From patchwork Sun Nov 10 14:56:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pavel Begunkov X-Patchwork-Id: 13869918 Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D32543A1BA for ; Sun, 10 Nov 2024 14:55:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250544; cv=none; b=SYLSVqLM7bP9/vfSNx9lwl4yyF8/GHB15SwimgKFZjPivoaJvvPCguoqCEb7DnY1THdHTELIeiCwI6EVCPkfaV55DSEVe9HzAbOBzc6Mr/kDos/1O2kOIykLA7HPWzqgoEG6cEQr4VCVXtHOVDAeHfmGsGFzyK+QEtRA8g4fE9k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250544; c=relaxed/simple; bh=nKTClVBGhsX7zwCwaox4vBWPDBNgwgIEBPiBoPiSRTM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HlDaU6kWr9pSQU32e6F4FSxJaiG5bJ9m4Nm9jwWMp+8n3fWqle0CUJClVdR/I5b/aH5tJEaNEwcc77UYfpQBzwb84+gERSHulFqjc4eSwDaLk5I3Ax8vqha6kq0i2JgNKgoxBaefwq6L7kBXE5tW4mlPOjyIQa3ALUcZMCjF0L8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=TsrBf3lS; arc=none smtp.client-ip=209.85.221.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="TsrBf3lS" Received: by mail-wr1-f44.google.com with SMTP id ffacd0b85a97d-37d4c482844so2536223f8f.0 for ; Sun, 10 Nov 2024 06:55:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731250541; x=1731855341; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oN/036oZWg1nQl48c3sVtA0Ff5fx+R8e47nsnQGPmB0=; b=TsrBf3lSOVkVZro9ex53qvXBzZlJHYAjsFDjxVSlRx5H2EVGqXF22CD9dI0kCKjNJp GB7CgkWXkTPv/93k6HZALhdrLmYz9meut+y46hDWUinnLCPfj5x/IDxo0xOUzU8HCJW/ tv+sSROmpYRybqD3diDUlamfFdWaen6rwGifV9alwv6JuBuH2fO++guqnTgtxPTOmG+n xGgZQcKpH+KyhWVUJ8/e0A/g4+Ki5MTdf9wJwksSSe1G54XZv/joUA2Bk9NTKym4hpqj BGGQX25gycnclwqjNUOPCOBLFJpCCFGaLgGiYQlU5z0KR7gxbuKe9Uk++NeuEU2/K9OA InpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731250541; x=1731855341; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oN/036oZWg1nQl48c3sVtA0Ff5fx+R8e47nsnQGPmB0=; b=HKMFFcxdN/Nl5o5OeSzaRb0H10eqn8N38TtXV9VUm791wrj0DMVsED18se2706Wy9b 759idpg3yEOggvdOkSXVPkzW53qvtdV272EsF1a1pjdjGia2ZYihY6tP4lfLmOdH71E3 2OcPy/TeDLu9DcAnY6OprcGIZJLb1FJ8wdRUBgYxtmZdMZmILjDUpRy5bphGdMwwfLth wwIfYY3nv1eAATL9uiU51YU3V0OsvdHnpBgQZqTPjAQZYgMB0ZvVIvJiDzuXB2MFM8p9 e4D8nT/vj1V/pWrzyE6CZ8FfHbsZCaSycuIqmylUQOZPPfnrLF9Y6e8daXcRO5w+DmVV yeXQ== X-Gm-Message-State: AOJu0YwoY2vXAaOhsUFqiHJpxd1IWD+47DjwVJTvxc0DbdfFyCkKvAux sU8FZYSBkiPPeLrdmlq1nfi4KF4IZ14vNbjNHFG5AM2bTK98BFxa5ZQgsA== X-Google-Smtp-Source: AGHT+IHMtTRM/rssqmS0BRomNvXz9YfEGW3aupMAENcYewfPkSssM749yv0RQ/dhuX2iPeX1Sn69AQ== X-Received: by 2002:a5d:648b:0:b0:37d:47ee:10d9 with SMTP id ffacd0b85a97d-381f1873485mr7798089f8f.34.1731250540857; Sun, 10 Nov 2024 06:55:40 -0800 (PST) Received: from 127.0.0.1localhost ([85.255.234.98]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-432aa6beea6sm182445535e9.20.2024.11.10.06.55.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Nov 2024 06:55:40 -0800 (PST) From: Pavel Begunkov To: io-uring@vger.kernel.org Cc: asml.silence@gmail.com, Jens Axboe Subject: [RFC 2/3] io_uring: add support for ignoring inline completions for waits Date: Sun, 10 Nov 2024 14:56:21 +0000 Message-ID: <90bc3070b66b2a9f832716fd149184309ea6277d.1731205010.git.asml.silence@gmail.com> X-Mailer: git-send-email 2.46.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Jens Axboe io_uring treats all completions the same - they post a completion event, or more, and anyone waiting on event completions will see each event as it gets posted. However, some events may be more interesting that others. For a request and response type model, it's not uncommon to have send/write events that are submitted with a recv/read type of request. While the app does want to see a successful send/write completion eventually, it need not handle it upfront as it would want to do with a recv/read, as it isn't time sensitive. Generally, a send/write completion will just mean that a buffer can get recycled/reused, whereas a recv/read completion needs acting upon (and a response sent). This can be somewhat tricky to handle if many requests and responses are being handled, and the app generally needs to track the number of pending sends/writes to be able to sanely wait on just new incoming recv/read requests. And even with that, an application would still like to see a completion for a short/failed send/write immediately. Add infrastructure to account inline completions, such that they can be deducted from the 'wait_nr' being passed in via a submit_and_wait() type of situation. Inline completions are ones that complete directly inline from submission, such as a send to a socket where there's enough space to accomodate the data being sent. Signed-off-by: Jens Axboe [pavel: rebased onto iosets] Signed-off-by: Pavel Begunkov --- include/linux/io_uring_types.h | 1 + include/uapi/linux/io_uring.h | 4 ++++ io_uring/io_uring.c | 12 ++++++++++-- io_uring/register.c | 2 +- 4 files changed, 16 insertions(+), 3 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 79f38c07642d..f04444f9356a 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -213,6 +213,7 @@ struct io_submit_state { bool need_plug; bool cq_flush; unsigned short submit_nr; + unsigned short inline_completions; struct blk_plug plug; }; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6a432383e7c3..e6d10fba8ae2 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -899,6 +899,10 @@ struct io_uring_recvmsg_out { __u32 flags; }; +enum { + IOSQE_SET_F_HINT_IGNORE_INLINE = 1, +}; + struct io_uring_ioset_reg { __u64 flags; __u64 __resv[3]; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index cf688a9ff737..6e89435c243d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1575,6 +1575,9 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) struct io_kiocb *req = container_of(node, struct io_kiocb, comp_list); + if (req->ioset->flags & IOSQE_SET_F_HINT_IGNORE_INLINE) + state->inline_completions++; + if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_GROUP))) { if (req->flags & REQ_F_GROUP) { io_complete_group_req(req); @@ -2511,6 +2514,7 @@ static void io_submit_state_start(struct io_submit_state *state, state->plug_started = false; state->need_plug = max_ios > 2; state->submit_nr = max_ios; + state->inline_completions = 0; /* set only head, no need to init link_last in advance */ state->link.head = NULL; state->group.head = NULL; @@ -3611,6 +3615,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, size_t, argsz) { struct io_ring_ctx *ctx; + int inline_complete = 0; struct file *file; long ret; @@ -3676,6 +3681,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, mutex_unlock(&ctx->uring_lock); goto out; } + inline_complete = ctx->submit_state.inline_completions; if (flags & IORING_ENTER_GETEVENTS) { if (ctx->syscall_iopoll) goto iopoll_locked; @@ -3713,8 +3719,10 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, ret2 = io_get_ext_arg(ctx, flags, argp, &ext_arg); if (likely(!ret2)) { - min_complete = min(min_complete, - ctx->cq_entries); + if (min_complete > ctx->cq_entries) + min_complete = ctx->cq_entries; + else + min_complete += inline_complete; ret2 = io_cqring_wait(ctx, min_complete, flags, &ext_arg); } diff --git a/io_uring/register.c b/io_uring/register.c index e7571dc46da5..f87ec7b773bd 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -92,7 +92,7 @@ static int io_update_ioset(struct io_ring_ctx *ctx, { if (!(ctx->flags & IORING_SETUP_IOSET)) return -EINVAL; - if (reg->flags) + if (reg->flags & ~IOSQE_SET_F_HINT_IGNORE_INLINE) return -EINVAL; if (reg->__resv[0] || reg->__resv[1] || reg->__resv[2]) return -EINVAL; From patchwork Sun Nov 10 14:56:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pavel Begunkov X-Patchwork-Id: 13869919 Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com [209.85.221.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5560B3B298 for ; Sun, 10 Nov 2024 14:55:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250546; cv=none; b=q1qDBLlGcOK80WmTzUgs3NDp1YU1Hi2eAaCM/MldsyOQ+DLiSuOqyKgTzYEfqN7stk7xDg+eU9P3eEFHR0d0MTNk4qRv3jbU3eSgyHPVSLEX82YXakW5oEQnbuiv1wDi8c51/L3un8e6R6hQpsw4fPSnzYejv6fMAf1ajxGQ6Zc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731250546; c=relaxed/simple; bh=gzBlB2S9hWxCZWyz5YLoZ43Ubwbn8Hm9jMNBwg1G+MU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HW/vHhJoeZPIrK8xn6FHRiizU17Pz4bYs1cA7rP6T9z9JRDyKIsuKdrfccliFiNxObvQBqaqvpg81xGbDDz3ABx69f64tMZ8b7VzEBMyfaITiNHoPqdtgF13xYr0bXAYp8GgzauYgLsuMCZ78pGkO8LNjg9YEdjpSbTijL28EPA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SZcsn0Y2; arc=none smtp.client-ip=209.85.221.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SZcsn0Y2" Received: by mail-wr1-f45.google.com with SMTP id ffacd0b85a97d-37d462c91a9so2302370f8f.2 for ; Sun, 10 Nov 2024 06:55:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731250542; x=1731855342; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Y8zV3Xxbm1BmP6KeJIic2z/32D6ufKuQ8hf49CeI9RI=; b=SZcsn0Y2LORrndAB7MMhw9E7z79fxBEGedYO7qekBz9rZEJsL0pZhxrTrWWrISvMfU O6BShYVZKb95PDufl1+PbSRdOfLK3UY9JuGGUCQuMbjUD4N2Yxwycjz8wV6YzMHQXeJC IMKEY/QqOKSIHGNebXnsNNYCgUy2ba0N7lt3GII2AemUclAyDZCVHm/St6Dl5wS6CLAQ VKjHfnpYhnF4bZT92yEObax/aP4dZTVgG3LG7JAjCBK6ES44lgNa7lwhW2lAQKiK9xjc WFtWW/efo0XrYSd/QTxuFDjOGYmSO97deRJQYyUsanuZCJHrz1TiBLWbc2GJCIwlTjPE T/Bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731250542; x=1731855342; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Y8zV3Xxbm1BmP6KeJIic2z/32D6ufKuQ8hf49CeI9RI=; b=MWj2QtQZhRYJfPYwsiSLkPg4Fas62oJVBBRYvFYfQZQiO4FNtRgMgUNyLRtfJ1ImPz dO+7NqpnM2826wjU3+VWWth5wNm5VEnw/kuHGs9/ibIp4q/sBcBcA8f6xn11m64pzhpk UpYGRUyyAoJWhzBBtvQ7QhEPqxYhxxpaDFa02C/FtHQ9Asjy0T/PRsLc+tbFFhs6Ewpn hYBBqPpRvF4zvIjRPwBC427a8zWLTXNBGi/OM9kRedZSNnpO7TppBE7O3xF7rAtv1QLO PMjhUvYWXD4EXoepZGEyRjCv6VWbMlSzcwc/esf8QbFtKfbcPX6V6g74JzNoJAa5JRQo 8MuQ== X-Gm-Message-State: AOJu0Ywo6taJ2d6V3R2BrnHJJiOldL5PC73R2BUzhQUzxScRkpy3d8+D JkumG/AeqMKzxEr5U9e0s5pGveqJyngz60v6XEga2TwHkBEpvFJjkKY90Q== X-Google-Smtp-Source: AGHT+IGPdSvV5L+oA8X5syO1/CqS/TBDiMp9fp9O6fpLei48ElAJy1BbZzWdi/GrnF5w4sp6JuCaDw== X-Received: by 2002:a05:6000:481b:b0:374:c614:73df with SMTP id ffacd0b85a97d-381f1852a83mr7685838f8f.57.1731250542082; Sun, 10 Nov 2024 06:55:42 -0800 (PST) Received: from 127.0.0.1localhost ([85.255.234.98]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-432aa6beea6sm182445535e9.20.2024.11.10.06.55.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Nov 2024 06:55:41 -0800 (PST) From: Pavel Begunkov To: io-uring@vger.kernel.org Cc: asml.silence@gmail.com Subject: [RFC 3/3] io_uring: allow waiting loop to ignore some CQEs Date: Sun, 10 Nov 2024 14:56:22 +0000 Message-ID: X-Mailer: git-send-email 2.46.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The user might not care about getting results of certain request, but there will still wake up the task (i.e. task_work) and trigger the waiting loop to terminate. IOSQE_SET_F_HINT_SILENT attempts to de-priorities such completions. The completion will be eventually posted, however the execution of the request can and likely will be delayed to batch it with other requests. It's an incomplete prototype, it only works with DEFER_TASKRUN, fails to apply the optimisation for task_works queued before the waiting loop starts, and interaction with IOSQE_SET_F_HINT_IGNORE_INLINE is likely broken. Signed-off-by: Pavel Begunkov --- include/uapi/linux/io_uring.h | 1 + io_uring/io_uring.c | 43 +++++++++++++++++++++++------------ io_uring/register.c | 3 ++- 3 files changed, 31 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e6d10fba8ae2..6dff0ee4e20c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -901,6 +901,7 @@ struct io_uring_recvmsg_out { enum { IOSQE_SET_F_HINT_IGNORE_INLINE = 1, + IOSQE_SET_F_HINT_SILENT = 2, }; struct io_uring_ioset_reg { diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 6e89435c243d..2e1af10fd4f2 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1270,6 +1270,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, { unsigned nr_wait, nr_tw, nr_tw_prev; struct llist_node *head; + bool ignore = req->ioset->flags & IOSQE_SET_F_HINT_SILENT; /* See comment above IO_CQ_WAKE_INIT */ BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES); @@ -1297,13 +1298,17 @@ static inline void io_req_local_work_add(struct io_kiocb *req, nr_tw_prev = READ_ONCE(first_req->nr_tw); } - /* - * Theoretically, it can overflow, but that's fine as one of - * previous adds should've tried to wake the task. - */ - nr_tw = nr_tw_prev + 1; - if (!(flags & IOU_F_TWQ_LAZY_WAKE)) - nr_tw = IO_CQ_WAKE_FORCE; + nr_tw = nr_tw_prev; + + if (!ignore) { + /* + * Theoretically, it can overflow, but that's fine as + * one of previous adds should've tried to wake the task. + */ + nr_tw += 1; + if (!(flags & IOU_F_TWQ_LAZY_WAKE)) + nr_tw = IO_CQ_WAKE_FORCE; + } req->nr_tw = nr_tw; req->io_task_work.node.next = head; @@ -1325,6 +1330,9 @@ static inline void io_req_local_work_add(struct io_kiocb *req, io_eventfd_signal(ctx); } + if (ignore) + return; + nr_wait = atomic_read(&ctx->cq_wait_nr); /* not enough or no one is waiting */ if (nr_tw < nr_wait) @@ -1405,7 +1413,7 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events, } static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, - int min_events) + int min_events, struct io_wait_queue *waitq) { struct llist_node *node; unsigned int loops = 0; @@ -1425,6 +1433,10 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts, struct llist_node *next = node->next; struct io_kiocb *req = container_of(node, struct io_kiocb, io_task_work.node); + + if (req->ioset->flags & IOSQE_SET_F_HINT_SILENT) + waitq->cq_tail++; + INDIRECT_CALL_2(req->io_task_work.func, io_poll_task_func, io_req_rw_complete, req, ts); @@ -1450,16 +1462,17 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx, if (llist_empty(&ctx->work_llist)) return 0; - return __io_run_local_work(ctx, &ts, min_events); + return __io_run_local_work(ctx, &ts, min_events, NULL); } -static int io_run_local_work(struct io_ring_ctx *ctx, int min_events) +static int io_run_local_work(struct io_ring_ctx *ctx, int min_events, + struct io_wait_queue *waitq) { struct io_tw_state ts = {}; int ret; mutex_lock(&ctx->uring_lock); - ret = __io_run_local_work(ctx, &ts, min_events); + ret = __io_run_local_work(ctx, &ts, min_events, waitq); mutex_unlock(&ctx->uring_lock); return ret; } @@ -2643,7 +2656,7 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx) { if (!llist_empty(&ctx->work_llist)) { __set_current_state(TASK_RUNNING); - if (io_run_local_work(ctx, INT_MAX) > 0) + if (io_run_local_work(ctx, INT_MAX, NULL) > 0) return 0; } if (io_run_task_work() > 0) @@ -2806,7 +2819,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, if (!io_allowed_run_tw(ctx)) return -EEXIST; if (!llist_empty(&ctx->work_llist)) - io_run_local_work(ctx, min_events); + io_run_local_work(ctx, min_events, NULL); io_run_task_work(); if (unlikely(test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))) @@ -2877,7 +2890,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, * now rather than let the caller do another wait loop. */ if (!llist_empty(&ctx->work_llist)) - io_run_local_work(ctx, nr_wait); + io_run_local_work(ctx, nr_wait, &iowq); io_run_task_work(); /* @@ -3389,7 +3402,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) && io_allowed_defer_tw_run(ctx)) - ret |= io_run_local_work(ctx, INT_MAX) > 0; + ret |= io_run_local_work(ctx, INT_MAX, NULL) > 0; ret |= io_cancel_defer_files(ctx, tctx, cancel_all); mutex_lock(&ctx->uring_lock); ret |= io_poll_remove_all(ctx, tctx, cancel_all); diff --git a/io_uring/register.c b/io_uring/register.c index f87ec7b773bd..5462c49bebd3 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -92,7 +92,8 @@ static int io_update_ioset(struct io_ring_ctx *ctx, { if (!(ctx->flags & IORING_SETUP_IOSET)) return -EINVAL; - if (reg->flags & ~IOSQE_SET_F_HINT_IGNORE_INLINE) + if (reg->flags & ~(IOSQE_SET_F_HINT_IGNORE_INLINE | + IOSQE_SET_F_HINT_SILENT)) return -EINVAL; if (reg->__resv[0] || reg->__resv[1] || reg->__resv[2]) return -EINVAL;