From patchwork Wed Dec 1 04:23:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649003 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F10EC43217 for ; Wed, 1 Dec 2021 04:23:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346552AbhLAE1E (ORCPT ); Tue, 30 Nov 2021 23:27:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241760AbhLAE1B (ORCPT ); Tue, 30 Nov 2021 23:27:01 -0500 Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7989EC061574; Tue, 30 Nov 2021 20:23:40 -0800 (PST) Received: by mail-pl1-x641.google.com with SMTP id b11so16650045pld.12; Tue, 30 Nov 2021 20:23:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Go51Y2BiE5h2mzVAYbH7sxo6TMFat2Nz0AxJAMmUrTs=; b=GfNSdCXFEu1Zs7qBS8zp1igfVvEB3iZwDTGSf4MVrT7q3I2KUlrs9QC/MI6MoLo9Ig 5W3PCKaBvh3fKPFVBNaZySFyGQeMXnP8YIlTL9O6RZBu6E73X5f9rnUwusXgG+N7Rgaw PAw/uDJEqnF9lXIUuXt0fcx1oomj1eKysxW3Cp2Dmz0OoePyVgEiIDsh/4tclSUEUvdL 9M2pK3+zuR8BqtElGywQZ9B63n0LgGVwThqZaFWmoI/dcOUAX+XZiIK7ivuxvlhcPuzG YMsG/v8jn+p7PkM7PeVgsqj/0z5J0afu5AtOMlJmBsD9aYEFOkUBagRzAlur8BsavaX3 bwJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Go51Y2BiE5h2mzVAYbH7sxo6TMFat2Nz0AxJAMmUrTs=; b=uv+3Y/WPuTAsL5WAdH017/b0QfUJGDrIHG7zFFLjt/M5ZKxrBBJtT759ANc/vQBsuG Lv48daLkVfcsc2u4Kar2i+Nh5IYMbXwxzo6jLKTu/5atTUCpZuUImiE8CjJwniSDY6J8 G1gB1WryT2PWA4D3pK9CJ5T/yXbob8Ibwt2X4QQfDmL5PbZ43VExjRGksV0P9i0X2ov7 yawetWfcpB5YQbTRtuNQ+6LSn9XrSNgjxh76mnvg8dsRDERnkYhCsfcEpeXZUnDpFOxD BrBBq/eDci+VF+JJj38C/xaZ5JoysyKH7+Jd9+9kYg69Z6ilEyjJWjpF1Q7dm+1u0Z/W eaCg== X-Gm-Message-State: AOAM5313rc9+FnPZQq17ZXeowWwG0nlj4WS0/avgX2jDTrsyAqjks9Y1 AJpwtK1Bym62N4KoBQTg6lisGSvGdNc= X-Google-Smtp-Source: ABdhPJzH1HbgozM5Ptszs3tX2jwXuk50EKJlfSHC2R7KpCN9g0VrAzkOukL9ab7pjCfFewWIHr34nA== X-Received: by 2002:a17:90a:e005:: with SMTP id u5mr4278585pjy.17.1638332619802; Tue, 30 Nov 2021 20:23:39 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id g18sm22002734pfb.103.2021.11.30.20.23.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:39 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers Date: Wed, 1 Dec 2021 09:53:24 +0530 Message-Id: <20211201042333.2035153-2-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10970; h=from:subject; bh=fzO1nhPRAyw/l1yTDLhzFaDJvb5Q/r/wRyKrusivRBY=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYxmk+ulZG46nuBuoJPDtb2dwXZjy5T59Qop8iV VnTvQSOJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MQAKCRBM4MiGSL8RytHZD/ wK1MvZ6X1XV1s68nfwPtpDsebriDmbKQ3kJu/UWvgUXH3giewfYVB6vDgmt1KeGHnhpnfKKI3n1K3T 6XYRvxpWdYjLZgYCFUkT2bUZT3lYznqZGpF8ZCSrzF5+N0x6TTIbSd0LVjyRxmFaEZYFB4GC59IaHr 8s1e+2AEILyAWBTpL8Ms3kptzyGTUlDrlvhZ5dLAFo+0XASyCp7O+g/HwTFIrqQkxkqAC7o+Z2yOBT dQFKTni/qcjLLL/N0sJ7AoqJ4VoIx5iHGR2POn6DYf1xJwoIj9sABYIKaG5wSIpGiWKUndst6xujlR 3lTr7hR+eexoXV0/Bb9Y7M75WNQRzws0pA1AvKMl3xSA6cuZG7Pd2A9CruSoGAl47UI7vggm8c70Xj TjZZqt0lg5ec/Rt8vFGGaeP77bJga7NZC0LJ6rHaJWUIlyJ1CFahiMzHf2ffjpz7B+QzHyd0rRkP9P j202dMxrsnEkRwAMoBf4gsLjVHboYMmB5WqMimrX1QMA2QmdjjmR9mj5JK/7EJECMmI00eUHlZxLUC xJhQn3ZkIO+pvK3TUewpTwSAEAAMl8GZt5X/aUnP9jpSSND9hvS3Lve6pbUH34q/VdZ1zW6PBF6A2o 7CLvsnVRbsC6apadgGgAoBAWWQhIkHlNKVT5mVSvjyYNoBcwCHJnVbpYCdCQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the io_uring_ubuf itself. This allows the iterator to save info related to buffers added to an io_uring instance, that isn't easy to export using the fdinfo interface (like exact struct page composing the registered buffer). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. In the future, this iterator will be extended to directly support iteration of bvec Flexible Array Member, so that when there is no corresponding VMA that maps to the registered buffer (e.g. if VMA is destroyed after pinning pages), we are able to reconstruct the registration on restore by dumping the page contents and then replaying them into a temporary mapping used for registration later. All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/io_uring.c | 203 +++++++++++++++++++++++++++++++++ include/linux/bpf.h | 12 ++ include/uapi/linux/bpf.h | 6 + tools/include/uapi/linux/bpf.h | 6 + 4 files changed, 227 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index b07196b4511c..02e628448ebd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -81,6 +81,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -11125,3 +11126,205 @@ static int __init io_uring_init(void) return 0; }; __initcall(io_uring_init); + +#ifdef CONFIG_BPF_SYSCALL + +BTF_ID_LIST(btf_io_uring_ids) +BTF_ID(struct, io_ring_ctx) +BTF_ID(struct, io_mapped_ubuf) + +struct bpf_io_uring_seq_info { + struct io_ring_ctx *ctx; + u64 index; +}; + +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux) +{ + struct bpf_io_uring_seq_info *info = priv_data; + struct io_ring_ctx *ctx = aux->io_uring.ctx; + + info->ctx = ctx; + return 0; +} + +static int bpf_io_uring_iter_attach(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + struct io_ring_ctx *ctx; + struct fd f; + int ret; + + f = fdget(linfo->io_uring.io_uring_fd); + if (unlikely(!f.file)) + return -EBADF; + + ret = -EOPNOTSUPP; + if (unlikely(f.file->f_op != &io_uring_fops)) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (unlikely(!percpu_ref_tryget(&ctx->refs))) + goto out_fput; + + ret = 0; + aux->io_uring.ctx = ctx; + /* each io_uring file's inode is unique, since it uses + * anon_inode_getfile_secure, which can be used to search + * through files and map link fd back to the io_uring. + */ + aux->io_uring.inode = f.file->f_inode->i_ino; + +out_fput: + fdput(f); + return ret; +} + +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux) +{ + percpu_ref_put(&aux->io_uring.ctx->refs); +} + +#ifdef CONFIG_PROC_FS +static void bpf_io_uring_iter_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq) +{ + seq_printf(seq, "io_uring_inode:\t%lu\n", aux->io_uring.inode); +} +#endif + +static int bpf_io_uring_iter_fill_link_info(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info) +{ + info->iter.io_uring.inode = aux->io_uring.inode; + return 0; +} + +/* io_uring iterator for registered buffers */ + +struct bpf_iter__io_uring_buf { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct io_ring_ctx *, ctx); + __bpf_md_ptr(struct io_mapped_ubuf *, ubuf); + u64 index; +}; + +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info) +{ + if (info->index < info->ctx->nr_user_bufs) + return info->ctx->user_bufs[info->index++]; + return NULL; +} + +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct io_mapped_ubuf *ubuf; + + /* Indicate to userspace that the uring lock is contended */ + if (!mutex_trylock(&info->ctx->uring_lock)) + return ERR_PTR(-EDEADLK); + + ubuf = __bpf_io_uring_buf_seq_get_next(info); + if (!ubuf) + return NULL; + + if (*pos == 0) + ++*pos; + return ubuf; +} + +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + ++*pos; + return __bpf_io_uring_buf_seq_get_next(info); +} + +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta, + struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf, + u64 index) + +static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct bpf_iter__io_uring_buf ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ctx = info->ctx; + ctx.ubuf = v; + ctx.index = info->index ? info->index - !in_stop : 0; + + return bpf_iter_run_prog(prog, &ctx); +} + +static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_io_uring_buf_seq_show(seq, v, false); +} + +static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + /* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */ + if (IS_ERR(v)) + return; + if (!v) + __bpf_io_uring_buf_seq_show(seq, v, true); + else if (info->index) /* restart from index */ + info->index--; + mutex_unlock(&info->ctx->uring_lock); +} + +static const struct seq_operations bpf_io_uring_buf_seq_ops = { + .start = bpf_io_uring_buf_seq_start, + .next = bpf_io_uring_buf_seq_next, + .stop = bpf_io_uring_buf_seq_stop, + .show = bpf_io_uring_buf_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = { + .seq_ops = &bpf_io_uring_buf_seq_ops, + .init_seq_private = bpf_io_uring_init_seq, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct bpf_io_uring_seq_info), +}; + +static struct bpf_iter_reg io_uring_buf_reg_info = { + .target = "io_uring_buf", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_io_uring_iter_attach, + .detach_target = bpf_io_uring_iter_detach, +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_io_uring_iter_show_fdinfo, +#endif + .fill_link_info = bpf_io_uring_iter_fill_link_info, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__io_uring_buf, ctx), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__io_uring_buf, ubuf), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_io_uring_buf_seq_info, +}; + +static int __init io_uring_iter_init(void) +{ + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; + io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1]; + return bpf_iter_reg_target(&io_uring_buf_reg_info); +} +late_initcall(io_uring_iter_init); + +#endif diff --git a/include/linux/bpf.h b/include/linux/bpf.h index cc7a0c36e7df..967842881024 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1509,8 +1509,20 @@ int bpf_obj_get_user(const char __user *pathname, int flags); extern int bpf_iter_ ## target(args); \ int __init bpf_iter_ ## target(args) { return 0; } +struct io_ring_ctx; + struct bpf_iter_aux_info { + /* Map member must not alias any other members, due to the check in + * bpf_trace.c:__get_seq_info, since in case of map the seq_ops for + * iterator is different from others. The seq_ops is not from main + * iter registration but from map_ops. Nullability of 'map' allows + * to skip this check for non-map iterator cheaply. + */ struct bpf_map *map; + struct { + struct io_ring_ctx *ctx; + ino_t inode; + } io_uring; }; typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a69e4b04ffeb..1ad1ae85743c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5720,6 +5723,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 inode; + } io_uring; }; } iter; struct { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index a69e4b04ffeb..1ad1ae85743c 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5720,6 +5723,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 inode; + } io_uring; }; } iter; struct { From patchwork Wed Dec 1 04:23:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649005 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E819C43219 for ; Wed, 1 Dec 2021 04:23:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346558AbhLAE1F (ORCPT ); Tue, 30 Nov 2021 23:27:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346550AbhLAE1E (ORCPT ); Tue, 30 Nov 2021 23:27:04 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE503C061574; Tue, 30 Nov 2021 20:23:43 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id i12so22988826pfd.6; Tue, 30 Nov 2021 20:23:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=RJfEa4/kZge1CNliuHpB+nDsCwp3gvmf9mSeXDqfsEo=; b=XllpY8xi9+d/4oPiBs/Zfb/i0MjssXHB/0Mw79tEXncO6Lx/qVNnYGM16BGrHm1G9D uSyHfBTdbtvOlH5exTuDEuwmBRWgg2lGAbgrE14hGjTN807HFX+le6s7G8ytsgNx8kYm 8sHuq9CtVO2P27np18DNelrSluWbSX0+uU+nds1Zrz2Sgi2S05bXCvkYWQskVEZt7noF CzYN/IDLg/n0Da5I7+WdvF9rUx3c1pPTvnmXi/wXTq/p8neIF0ich5GsdT1QCmTpACY2 C7QCsCTJeo3R/3fKugMhJZo72k4qXE3zUnSLt5IusimecGxat5o0D7pI7SxWpmYVGq2i 9DJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=RJfEa4/kZge1CNliuHpB+nDsCwp3gvmf9mSeXDqfsEo=; b=KfxgaqyLtkmmrW41R9E5NxtysFfRs5Iw8XL0BIudrdkG12YjA/pLgwTlVMArCKc2aZ AbptrfqjGi7MdzjenzIf5r9x1nypggIhJ42jFIxN5hn81g68Gwt7a9gOlozPHM7BTx7y 2EWE6dE74Ah3oUuEcrC+ZzjihXqOMbdufdsR93zvCn3qLNiboWYa6hTGLGFWAV2vDBqr zonfWyeu0jQxFhPz3RqFSYwci6s9KHiS7UWYULgCUOQfE7LhIuoGbUKOMP6BGYZTQJoZ I1M/6neu02yD44aI2XCUBcOfE5Ine1rgzjBe3pnV5YnRimJS0Mcq6CtyNcJ5rcvQYt+L Jg3g== X-Gm-Message-State: AOAM533kWtaKuLxavNLtfiRFVr8LTsuIBlFdVzCM4cvcUndzPu/7mYYV IVWmG06PT3jGvWOHqboMSHu5sRLIho4= X-Google-Smtp-Source: ABdhPJwaFu0/pLwLS9N2DMYVWjSSuVHVdhWnTCwU/9p5YJU6igSbdg94UuAzEGWLYa9W6nK8x0P8DQ== X-Received: by 2002:a05:6a00:1a8f:b0:49f:f5ac:b27a with SMTP id e15-20020a056a001a8f00b0049ff5acb27amr3904223pfv.38.1638332623141; Tue, 30 Nov 2021 20:23:43 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id mg12sm4073990pjb.10.2021.11.30.20.23.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:42 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 02/10] bpf: Add bpf_page_to_pfn helper Date: Wed, 1 Dec 2021 09:53:25 +0530 Message-Id: <20211201042333.2035153-3-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=5140; h=from:subject; bh=lixf4LU5wjYILcJzQSrrc/f8UnBXtLkm7kOTUJalkSQ=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYx3UuuxxtcjcyPHfuDglkfto0wfL6OsOhWniBK 6u/+RriJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MQAKCRBM4MiGSL8RyqCCD/ oCZ6ceJT4dcb8IjJpspHFTUK1oEh1A0PnwJx8urTfjbtPqVNAd+wovEasp+hYu2S+vOai0V5AowsjL UJsN8XsZtGKX6ViE24wsQ4Xe0qnabY45XQpDuNpXrJG44t1Yl6Nnxpq3aIy7z0e6KRQCh4CLdnxBik lJFgDd341YWfk3REmcyWyQVgwJbz321T/s4+drVmNYbvGqXkg9bxju+M+TloLsTSi51WtEhTyestk/ DoLl/cnJzsPsQ0dH9tsypCtRt0GbD6CQ72mQeZ7+yJFQBXzdZqB2whDmz+fSll4+QJeApRjATiLb7h Km0qlNhndqze+FDQXT50bAVKvab4a90TIi1MB5cgV6+77XvSyezP2RhTAypbUgAtZlZ2MI2SvguOBz x6wH2/j4rxjZksgSbc2t9uDtaqthbpkB9rw3xOaY/cDKec68N4kLbXFwD2wj7afbqa8RLbdC9+2EEJ U9AcIyXOT27b3qixx3UM08v4gnyQdz+bV5TGAf6P5ZqGF+ySzQ0odquqxBkvEH97Yxd07BFcFlhgCu iXUxTc+xh0d1y5HWA+9F1eVx7z1TerlJEYcGpPRvhwbRb8Dmv03gEgjeqX7YGsp1sNUmAZ3NEzPhA/ ZL/9R5lMOlmEeTyG7FekGhIuWBUXGVllMtMpjhUGmr+527arGG2lbA1XoDAg== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net In CRIU, we need to be able to determine whether the page pinned by io_uring is still present in the same range in the process VMA. /proc//pagemap gives us the PFN, hence using this helper we can establish this mapping easily from the iterator side. It is a simple wrapper over the in-kernel page_to_pfn macro, and ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is obtained from the bvec of io_uring_ubuf for the CRIU usecase. Signed-off-by: Kumar Kartikeya Dwivedi --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 9 +++++++++ kernel/trace/bpf_trace.c | 19 +++++++++++++++++++ scripts/bpf_doc.py | 2 ++ tools/include/uapi/linux/bpf.h | 9 +++++++++ 5 files changed, 40 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 967842881024..e44503158d76 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2176,6 +2176,7 @@ extern const struct bpf_func_proto bpf_sk_setsockopt_proto; extern const struct bpf_func_proto bpf_sk_getsockopt_proto; extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto; extern const struct bpf_func_proto bpf_find_vma_proto; +extern const struct bpf_func_proto bpf_page_to_pfn_proto; const struct bpf_func_proto *tracing_prog_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1ad1ae85743c..885d9293c147 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4960,6 +4960,14 @@ union bpf_attr { * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. * **-EBUSY** if failed to try lock mmap_lock. * **-EINVAL** for invalid **flags**. + * + * long bpf_page_to_pfn(struct page *page) + * Description + * Obtain the page frame number (PFN) for the given *struct page* + * pointer. + * Return + * Page Frame Number corresponding to the page pointed to by the + * *struct page* pointer, or U64_MAX if pointer is NULL. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5143,6 +5151,7 @@ union bpf_attr { FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ FN(find_vma), \ + FN(page_to_pfn), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 25ea521fb8f1..2a6488f14e58 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1091,6 +1091,23 @@ static const struct bpf_func_proto bpf_get_branch_snapshot_proto = { .arg2_type = ARG_CONST_SIZE_OR_ZERO, }; +BPF_CALL_1(bpf_page_to_pfn, struct page *, page) +{ + /* PTR_TO_BTF_ID can be NULL */ + if (!page) + return U64_MAX; + return page_to_pfn(page); +} + +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page) + +const struct bpf_func_proto bpf_page_to_pfn_proto = { + .func = bpf_page_to_pfn, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg1_btf_id = &btf_page_to_pfn_ids[0], +}; + static const struct bpf_func_proto * bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -1212,6 +1229,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_find_vma_proto; case BPF_FUNC_trace_vprintk: return bpf_get_trace_vprintk_proto(); + case BPF_FUNC_page_to_pfn: + return &bpf_page_to_pfn_proto; default: return bpf_base_func_proto(func_id); } diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index a6403ddf5de7..ae68ca794980 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -549,6 +549,7 @@ class PrinterHelpers(Printer): 'struct socket', 'struct file', 'struct bpf_timer', + 'struct page', ] known_types = { '...', @@ -598,6 +599,7 @@ class PrinterHelpers(Printer): 'struct socket', 'struct file', 'struct bpf_timer', + 'struct page', } mapped_types = { 'u8': '__u8', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1ad1ae85743c..885d9293c147 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4960,6 +4960,14 @@ union bpf_attr { * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. * **-EBUSY** if failed to try lock mmap_lock. * **-EINVAL** for invalid **flags**. + * + * long bpf_page_to_pfn(struct page *page) + * Description + * Obtain the page frame number (PFN) for the given *struct page* + * pointer. + * Return + * Page Frame Number corresponding to the page pointed to by the + * *struct page* pointer, or U64_MAX if pointer is NULL. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5143,6 +5151,7 @@ union bpf_attr { FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ FN(find_vma), \ + FN(page_to_pfn), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Wed Dec 1 04:23:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649007 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BF4FC433F5 for ; Wed, 1 Dec 2021 04:24:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346601AbhLAE1h (ORCPT ); Tue, 30 Nov 2021 23:27:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346563AbhLAE1H (ORCPT ); Tue, 30 Nov 2021 23:27:07 -0500 Received: from mail-pj1-x1044.google.com (mail-pj1-x1044.google.com [IPv6:2607:f8b0:4864:20::1044]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C8322C061574; Tue, 30 Nov 2021 20:23:46 -0800 (PST) Received: by mail-pj1-x1044.google.com with SMTP id w33-20020a17090a6ba400b001a722a06212so231979pjj.0; Tue, 30 Nov 2021 20:23:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=47uoyKSt85EOPYyE7dejHJdAFCjTvLI61KjOly3mw+k=; b=dMm1GreMa/ClrIZdMATDvp15J0mmRyTPEU1Uttr48eLHVmkKoGUbdF5XS+Th7V+4MX IJMlfaXulxz9lyFZfVxcoO06HKmTB4NlUKbFkX1B/LzfDwTPRXUgfrIExGYPwyVryAG2 zI3DI6/Q2cPAflOwZcycGP5O18mOSfh/vx3w/Mj2TR89Al4U7Hy639BbxQ9ezfh2xwZN +4xDg1t+Yi2OQnUm9FLlEAX5LrReUUR7wgqGVm+fWVNMLEQIxQi/1DRwTHQOqJt576FT OAJo3HGQO+13aHt0uJL5XIyV8Z9CZhc49VrWJtfp9MaSMrtNt/oeDx2ipGPe2WwH4C4A yIjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=47uoyKSt85EOPYyE7dejHJdAFCjTvLI61KjOly3mw+k=; b=N7+lzULBXkanO60fJiZ1uWSBBmiYmqTosS6b+byqiYuC7nQ24IulOWRF0vGQ7fRc0q QrytgRl50WKUQwsg7nrFN6dw8jiOhnGsKWaIKgkqPiqpe7yrwwVTdk1nJQB31owgNNoW tqruSroYbJVH+ubScTu7MeAwTxuX3cp/sBcwyp9B+GtHL7uQJTGsAlcoTM73LsImWp3r 6GTbs3TqztVHbYSHFarTVuZ13dVwofvB4fa1rC1mzHOqWDRxotm81wRDIJBWyXoUzmwP /oylEZ9l8+/hgTJMThJdpI5X1GVIUeQn+YOYThreeRvZjzDnAcBTH+dyRnVZBARKEzpg Jbcg== X-Gm-Message-State: AOAM531wLfDHVl932Ms8XLNugNuZgIw3xSPNjbutZDlUoBR343w0Ibwz b1qMAVzfjVRaiC0VWhour/LdKGiYfd0= X-Google-Smtp-Source: ABdhPJwEYJACqhXydk0uljG/lDqXq8M+0piSYY1vGPpAsSQnFvs4/BUVn0+eSnPy31RJbfy+Tn3mfA== X-Received: by 2002:a17:90b:19c8:: with SMTP id nm8mr4411430pjb.163.1638332626160; Tue, 30 Nov 2021 20:23:46 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id y25sm16153653pgk.47.2021.11.30.20.23.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:45 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 03/10] io_uring: Implement eBPF iterator for registered files Date: Wed, 1 Dec 2021 09:53:26 +0530 Message-Id: <20211201042333.2035153-4-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=8089; h=from:subject; bh=SV7KciF1JI0jgl3quwQpGuXagSiM5czyomUPJ39eYiw=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYxOddRV/bHejsaQE8htmP9T0ToBH9G+bC7F93j QuMUQc2JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MQAKCRBM4MiGSL8RylPHEA CNMVcmCWha2s2S3mHMbKZ7O4i4jDiBJPByHBABe8D1OiPe3l4L/2MTYmHAOz5aqLn+RRW7gQPCR6qU bbUG7nAHt5yk4L564a9SHUj9PBV4DXghg5zUvegqjQtcPbxjmiOaWp8cX/PJJqF+8NSgcmuNQTl5Gr fzPbh7zlo2WXC97pBxQoBAuuwOig+6vhUstrV121b56wpeM8qc8HOWX3D3ZXoQNLd2e36WhSCt1O4k 0hhcI1r5mfw3kcRDUeK/wxuPRUt+30t+TBPkAYF1qhKSYeC258yRoQuOcP8+F5puY4YJbP8omS5ySR huqbtdoH+kkEWrxWFZQ8Jh8p0fHCcXrsHwWyM37ZG2AKL7Lq4YemJuYyf9yMxHoT3zZBRikkw/RqEO 73YyK/ahlDu+yynRocqPvkCAAbdOeFMnEPPhA1vFXL6QxwqH4FRkTvzCh9R+NXVYuKkRww6m0c3a0Z PZezl905D6PZUpAicNOQ2UcfJxeHkCviV/YYxzoUrlzNnRe9ff87/R6BtNBjHBfotONfhy4Igk72t9 5Wy5sQZ0/gtsjPRJGMm2LQ0DT7YWiut51/gWMKA7eS7iYEnIuQ/tEDs0sYkVp8YMrTZ2KIoU50Ylmq eg621i8pmIiGO1z06g942tSK/Ulm1imWuvWWSXu5hA0DMLosoOM88V053Iiw== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the struct file itself. This allows the iterator to save info related to the file added to an io_uring instance, that isn't easy to export using the fdinfo interface (like being able to match registered files to a task's file set). Getting access to underlying struct file allows deduplication and efficient pairing with task file set (obtained using task_file iterator). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. With the advent of descriptorless files supported by io_uring, this iterator provides the required visibility and introspection of io_uring instance for the purposes of dumping and restoring it. In the future, this iterator will be extended to support direct inspection of a lot of file state (currently descriptorless files are obtained using openat2 and socket) to dump file state for these hidden files. Later, we can explore filling in the gaps for dumping file state for more file types (those not hidden in io_uring ctx). All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/io_uring.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 02e628448ebd..28348fce81dc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -11132,6 +11132,7 @@ __initcall(io_uring_init); BTF_ID_LIST(btf_io_uring_ids) BTF_ID(struct, io_ring_ctx) BTF_ID(struct, io_mapped_ubuf) +BTF_ID(struct, file) struct bpf_io_uring_seq_info { struct io_ring_ctx *ctx; @@ -11319,11 +11320,152 @@ static struct bpf_iter_reg io_uring_buf_reg_info = { .seq_info = &bpf_io_uring_buf_seq_info, }; +/* io_uring iterator for registered files */ + +struct bpf_iter__io_uring_file { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct io_ring_ctx *, ctx); + __bpf_md_ptr(struct file *, file); + u64 index; +}; + +static void *__bpf_io_uring_file_seq_get_next(struct bpf_io_uring_seq_info *info) +{ + struct file *file = NULL; + + if (info->index < info->ctx->nr_user_files) { + /* file set can be sparse */ + file = io_file_from_index(info->ctx, info->index++); + /* use info as a distinct pointer to distinguish between empty + * slot and valid file, since we cannot return NULL for this + * case if we want iter prog to still be invoked with file == + * NULL. + */ + if (!file) + return info; + } + + return file; +} + +static void *bpf_io_uring_file_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct file *file; + + /* Indicate to userspace that the uring lock is contended */ + if (!mutex_trylock(&info->ctx->uring_lock)) + return ERR_PTR(-EDEADLK); + + file = __bpf_io_uring_file_seq_get_next(info); + if (!file) + return NULL; + + if (*pos == 0) + ++*pos; + return file; +} + +static void *bpf_io_uring_file_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + ++*pos; + return __bpf_io_uring_file_seq_get_next(info); +} + +DEFINE_BPF_ITER_FUNC(io_uring_file, struct bpf_iter_meta *meta, + struct io_ring_ctx *ctx, struct file *file, + u64 index) + +static int __bpf_io_uring_file_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct bpf_iter__io_uring_file ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ctx = info->ctx; + /* when we encounter empty slot, v will point to info */ + ctx.file = v == info ? NULL : v; + ctx.index = info->index ? info->index - !in_stop : 0; + + return bpf_iter_run_prog(prog, &ctx); +} + +static int bpf_io_uring_file_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_io_uring_file_seq_show(seq, v, false); +} + +static void bpf_io_uring_file_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + /* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */ + if (IS_ERR(v)) + return; + if (!v) + __bpf_io_uring_file_seq_show(seq, v, true); + else if (info->index) /* restart from index */ + info->index--; + mutex_unlock(&info->ctx->uring_lock); +} + +static const struct seq_operations bpf_io_uring_file_seq_ops = { + .start = bpf_io_uring_file_seq_start, + .next = bpf_io_uring_file_seq_next, + .stop = bpf_io_uring_file_seq_stop, + .show = bpf_io_uring_file_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_io_uring_file_seq_info = { + .seq_ops = &bpf_io_uring_file_seq_ops, + .init_seq_private = bpf_io_uring_init_seq, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct bpf_io_uring_seq_info), +}; + +static struct bpf_iter_reg io_uring_file_reg_info = { + .target = "io_uring_file", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_io_uring_iter_attach, + .detach_target = bpf_io_uring_iter_detach, +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_io_uring_iter_show_fdinfo, +#endif + .fill_link_info = bpf_io_uring_iter_fill_link_info, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__io_uring_file, ctx), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__io_uring_file, file), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_io_uring_file_seq_info, +}; + static int __init io_uring_iter_init(void) { + int ret; + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1]; - return bpf_iter_reg_target(&io_uring_buf_reg_info); + io_uring_file_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; + io_uring_file_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[2]; + ret = bpf_iter_reg_target(&io_uring_buf_reg_info); + if (ret) + return ret; + ret = bpf_iter_reg_target(&io_uring_file_reg_info); + if (ret) + bpf_iter_unreg_target(&io_uring_buf_reg_info); + return ret; } late_initcall(io_uring_iter_init); From patchwork Wed Dec 1 04:23:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649017 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E82DC433F5 for ; Wed, 1 Dec 2021 04:24:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346614AbhLAE16 (ORCPT ); Tue, 30 Nov 2021 23:27:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346574AbhLAE1L (ORCPT ); Tue, 30 Nov 2021 23:27:11 -0500 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 09A3AC061746; Tue, 30 Nov 2021 20:23:50 -0800 (PST) Received: by mail-pg1-x542.google.com with SMTP id 133so4068258pgc.12; Tue, 30 Nov 2021 20:23:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZcDIyVgD0V4zxLYUzV02cdFgM6F9geNX6kVfcxOJQGc=; b=J6kafzB86ed2SgDRMaCwfooMkatzrpQn93Z/8ri4MqxCmfp7k3oHRtwIx04U+C/NnV x/0nemByOdVDex0nQKZcpfr7HjLlCWJwBF0J5rJ29bUwF0g+ONnroQnUQ/KHr5tkAXcU tJiU2Q772Y8ylTCSiyPjcS1DKa3cSe++Zu6FkqmS1Gvgbp75kv5aGRwNU7Moar1/YcGv 0zZiXgFCtPE8Nc5TwgyDOyalCvHwYe9Y+93BdEE4jpC4N4gpZ9850rKdj2IiuFBg3pGx iML+qS+xRQ3AEHZSLx9gBE1CarXilO9qxQBVBZas0F1H+JEv3GC+fOlrGmDOhAfT5NQv PTxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZcDIyVgD0V4zxLYUzV02cdFgM6F9geNX6kVfcxOJQGc=; b=xcHmv1wbgMAzi+W2j0tlLzjm1BIxtv/r4tsxWwqWJTwR45X2I/4fXrT/+egkFXQUOt JZhSkyw+A+Xj+j/rlst5dH4nhBBSR0340aNB5wAnANfdKsvzU8hgBJsmpAhg6agxn0V2 pL6gDv49zu7G5VQIdD1vxiSdBiVIc7FSQDIe1LCN02bsZAALBrMbieKcx5YPtnSFdVG2 +M5XqA9Qa4HSBqaZnPhetbPlQw+oTcx4ioFtcdOkgM2LdEP8o4uebyPbArU1FFWwjYPk ag3LV+NpKZzScrj84IVjrFnXeH2BS/3QbMSyyZ+wonQO86+zF4D7sLsxzRMFHrLwEjPC rFrQ== X-Gm-Message-State: AOAM532q4h4mJ1usBHM/6WlQQfCU3SSppkN/D6HhaP53g4zui7L7c6kO RnldFub5oTJOlCS/5k8g2pwb/qd9hhc= X-Google-Smtp-Source: ABdhPJx3GylUCsKJ8L4T333v7+S3a5KuALzB3z4C3uwUbINhSxuhR0WIsuJ/SVKlOJnrAIGzJJgUPQ== X-Received: by 2002:a63:dd10:: with SMTP id t16mr2906931pgg.318.1638332629147; Tue, 30 Nov 2021 20:23:49 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id j34sm16002062pgj.42.2021.11.30.20.23.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:48 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org Subject: [PATCH bpf-next v3 04/10] epoll: Implement eBPF iterator for registered items Date: Wed, 1 Dec 2021 09:53:27 +0530 Message-Id: <20211201042333.2035153-5-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10997; h=from:subject; bh=FF86W5j4+tn80bzEFwJZI9lmub+R7F6lxJCqnNdc8mM=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYxgWU/izzAus4xi5Iq9iukKMItxsmiwJbWU635 dH1rzfqJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MQAKCRBM4MiGSL8Ryo9kEA CEPsAVxFxoP7wTicF2bAEueVFN0PLkFi0DGkhlXaHvPxn4oX8hFI9gWAgRx4TcfwN50QWwzErR1wKz 4kd8HdYISWrqxNTLa70PK5fO8+TD1lPd/UisCzdLrrQ4+iQQuJql/FqMEcO7WRynqo9vtuaYkN4EW5 ut2TxtykFRWeay6t7CpBH8+0RxgqHvY4SttknN4ZWgZHqDBWFEYqYZ85mf2BYCc6vjHp+pMK7x0yoR QdWxmd1wbCxvZsYgu6ffRX3UhN2tId1pYT0n7ypt9DIP/0yX0VwziMCfmabsQxkJvoim7m32N2qs8G MeV+jQ/MgSIK9I+OqJ0BnawEFqST04eCKpVd61gMbDxlKBOfC3zQTV2XRCpIKEib8Ng2vqQNy+hYlX OA/8M7ZHqHeP6DnftMOQ7OMDZQrMO4s2gNgIrkk+nim3CQnrKthku4Hx4LCs/zWGSbfzhZosN2Mo0+ qMgC0ORlT+3vYrQAFfa3Y/BMHxl2+SlzBc09/bfkp0HnaMneIyOfeSWHpBrJ3lzQsXdL0U7ackl+H7 HEC1orUmnArQtTUua+ErijTQKih+mkWTL3gCoEPiM2fhPoikSsDu5+zJwsZhpqCsNuiwOQNCCt6mr6 LQ2hZyeeXlPz4MkCRmM4fzOwlTkvVBxMx7lCoglU2OIiJH6dsr5cvtwlsB8g== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch adds eBPF iterator for epoll items (epitems) registered in an epoll instance. It gives access to the eventpoll ctx, and the registered epoll item (struct epitem). This allows the iterator to inspect the registered file and be able to use others iterators to associate it with a task's fdtable. The primary usecase this is enabling is expediting existing eventpoll checkpoint/restore support in the CRIU project. This iterator allows us to switch from a worst case O(n^2) algorithm to a single O(n) pass over task and epoll registered descriptors. We also make sure we're iterating over a live file, one that is not going away. The case we're concerned about is a file that has its f_count as zero, but is waiting for iterator bpf_seq_read to release ep->mtx, so that it can remove its epitem. Since such a file will disappear once iteration is done, and it is being destructed, we use get_file_rcu to ensure it is alive when invoking the BPF program. Getting access to a file that is going to disappear after iteration is not useful anyway. This does have a performance overhead however (since file reference will be raised and dropped for each file). The rcu_read_lock around get_file_rcu isn't strictly required for lifetime management since fput path is serialized on ep->mtx to call ep_remove, hence the epi->ffd.file pointer remains stable during our seq_start/seq_stop bracketing. To be able to continue back from the position we were iterating, we store the epi->ffi.fd and use ep_find_tfd to find the target file again. It would be more appropriate to use both struct file pointer and fd number to find the last file, but see below for why that cannot be done. Taking reference to struct file and walking RB-Tree to find it again will lead to reference cycle issue if the iterator after partial read takes reference to socket which later is used in creating a descriptor cycle using SCM_RIGHTS. An example that was encountered when working on this is mentioned below. Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator ITER. Let SK1 be registered in EP, then on a partial read it is possible that ITER returns from read and takes reference to SK1 to be able to find it later in RB-Tree and continue the iteration. If SK1 sends ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using SCM_RIGHTS, and both fds are not consumed on the corresponding receive ends, a cycle is created. When all of SK1, SK2, EP, and ITER are closed, SK1's receive queue holds reference to SK2, and SK2's receive queue holds reference to ITER, which holds a reference to SK1. All file descriptors except EP leak. To resolve it, we would need to hook into the Unix Socket GC mechanism, but the alternative of using ep_find_tfd is much more simpler. The finding of the last position in face of concurrent modification of the epoll set is at best an approximation anyway. For the case of CRIU, the epoll set remains stable. Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/eventpoll.c | 201 ++++++++++++++++++++++++++++++++- include/linux/bpf.h | 11 +- include/uapi/linux/bpf.h | 3 + tools/include/uapi/linux/bpf.h | 3 + 4 files changed, 213 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 06f4c5ae1451..fb4e58857baa 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -37,6 +37,7 @@ #include #include #include +#include #include /* @@ -985,7 +986,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) return epir; } -#ifdef CONFIG_KCMP static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff) { struct rb_node *rbp; @@ -1005,6 +1005,7 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t return NULL; } +#ifdef CONFIG_KCMP struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long toff) { @@ -2385,3 +2386,201 @@ static int __init eventpoll_init(void) return 0; } fs_initcall(eventpoll_init); + +#ifdef CONFIG_BPF_SYSCALL + +enum epoll_iter_state { + EP_ITER_DONE = -2, + EP_ITER_INIT = -1, +}; + +BTF_ID_LIST(btf_epoll_ids) +BTF_ID(struct, eventpoll) +BTF_ID(struct, epitem) + +struct bpf_epoll_iter_seq_info { + struct eventpoll *ep; + struct rb_node *rbp; + int tfd; +}; + +static int bpf_epoll_init_seq(void *priv_data, struct bpf_iter_aux_info *aux) +{ + struct bpf_epoll_iter_seq_info *info = priv_data; + + info->ep = aux->ep->private_data; + info->tfd = EP_ITER_INIT; + return 0; +} + +static int bpf_epoll_iter_attach(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + struct file *file; + int ret; + + file = fget(linfo->epoll.epoll_fd); + if (!file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (unlikely(!is_file_epoll(file))) + goto out_fput; + + aux->ep = file; + return 0; +out_fput: + fput(file); + return ret; +} + +static void bpf_epoll_iter_detach(struct bpf_iter_aux_info *aux) +{ + fput(aux->ep); +} + +struct bpf_iter__epoll { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct eventpoll *, ep); + __bpf_md_ptr(struct epitem *, epi); +}; + +static void *bpf_epoll_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct epitem *epi; + + mutex_lock(&info->ep->mtx); + /* already iterated? */ + if (info->tfd == EP_ITER_DONE) + return NULL; + /* partially iterated? find position to restart */ + if (info->tfd >= 0) { + epi = ep_find_tfd(info->ep, info->tfd, 0); + if (!epi) + return NULL; + info->rbp = &epi->rbn; + return epi; + } + WARN_ON(info->tfd != EP_ITER_INIT); + /* first iteration */ + info->rbp = rb_first_cached(&info->ep->rbr); + if (!info->rbp) + return NULL; + if (*pos == 0) + ++*pos; + return rb_entry(info->rbp, struct epitem, rbn); +} + +static void *bpf_epoll_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + + ++*pos; + info->rbp = rb_next(info->rbp); + return info->rbp ? rb_entry(info->rbp, struct epitem, rbn) : NULL; +} + +DEFINE_BPF_ITER_FUNC(epoll, struct bpf_iter_meta *meta, struct eventpoll *ep, + struct epitem *epi) + +static int __bpf_epoll_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct bpf_iter__epoll ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + int ret; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ep = info->ep; + ctx.epi = v; + if (ctx.epi) { + /* The file we are going to pass to prog may already have its + * f_count as 0, hence before invoking the prog, we always try + * to get the reference if it isn't zero, failing which we skip + * the file. This is usually the case for files that are closed + * before calling EPOLL_CTL_DEL for them, which would wait for + * us to release ep->mtx before doing ep_remove. + */ + rcu_read_lock(); + ret = get_file_rcu(ctx.epi->ffd.file); + rcu_read_unlock(); + if (!ret) + return 0; + } + ret = bpf_iter_run_prog(prog, &ctx); + /* fput queues work asynchronously, so in our case, either task_work for + * non-exiting task, and otherwise delayed_fput, so holding ep->mtx and + * calling fput (which will take the same lock) in this context will not + * deadlock us, in case f_count is 1 at this point. + */ + if (ctx.epi) + fput(ctx.epi->ffd.file); + return ret; +} + +static int bpf_epoll_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_epoll_seq_show(seq, v, false); +} + +static void bpf_epoll_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct epitem *epi; + + if (!v) { + __bpf_epoll_seq_show(seq, v, true); + /* done iterating */ + info->tfd = EP_ITER_DONE; + } else { + epi = rb_entry(info->rbp, struct epitem, rbn); + info->tfd = epi->ffd.fd; + } + mutex_unlock(&info->ep->mtx); +} + +static const struct seq_operations bpf_epoll_seq_ops = { + .start = bpf_epoll_seq_start, + .next = bpf_epoll_seq_next, + .stop = bpf_epoll_seq_stop, + .show = bpf_epoll_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_epoll_seq_info = { + .seq_ops = &bpf_epoll_seq_ops, + .init_seq_private = bpf_epoll_init_seq, + .seq_priv_size = sizeof(struct bpf_epoll_iter_seq_info), +}; + +static struct bpf_iter_reg epoll_reg_info = { + .target = "epoll", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_epoll_iter_attach, + .detach_target = bpf_epoll_iter_detach, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__epoll, ep), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__epoll, epi), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_epoll_seq_info, +}; + +static int __init epoll_iter_init(void) +{ + epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0]; + epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1]; + return bpf_iter_reg_target(&epoll_reg_info); +} +late_initcall(epoll_iter_init); + +#endif diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e44503158d76..d7e3e9c59b68 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1519,10 +1519,13 @@ struct bpf_iter_aux_info { * to skip this check for non-map iterator cheaply. */ struct bpf_map *map; - struct { - struct io_ring_ctx *ctx; - ino_t inode; - } io_uring; + union { + struct { + struct io_ring_ctx *ctx; + ino_t inode; + } io_uring; + struct file *ep; + }; }; typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 885d9293c147..b82b11d72520 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -94,6 +94,9 @@ union bpf_iter_link_info { struct { __u32 io_uring_fd; } io_uring; + struct { + __u32 epoll_fd; + } epoll; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 885d9293c147..b82b11d72520 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -94,6 +94,9 @@ union bpf_iter_link_info { struct { __u32 io_uring_fd; } io_uring; + struct { + __u32 epoll_fd; + } epoll; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ From patchwork Wed Dec 1 04:23:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649009 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B556DC433EF for ; Wed, 1 Dec 2021 04:24:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346570AbhLAE1p (ORCPT ); Tue, 30 Nov 2021 23:27:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346513AbhLAE1g (ORCPT ); Tue, 30 Nov 2021 23:27:36 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D96C5C061757; Tue, 30 Nov 2021 20:23:52 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id r130so23054259pfc.1; Tue, 30 Nov 2021 20:23:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=FHgCXMZqFYvkD6qdznJ6BR7nNTjcADokUMa/fFZF/nk=; b=N4eUQuHve5bweBa+tX73OvM7XorYiCyjlF4pFKjJ4Nu8Qh7IOBSQ6Gl6SaXVElnK1s 18Q82hLsGIYIgo5KTsrFxY13Slh3iSIV2g8DMirh6AzPUITHRevAbRDczaKuEULjF/Ju jzsnmQaqbsn1FQ5DlHEFVEsTTj/6E4Gi+VJzN3ibzB2eGM1Kyn+/kwdnHRcE8mv1ayPB VcJzxvHfXCgHCVD2fizFhUb4bkEm3wbQX2W+ZZ/c//QareiCRvBTwGmjL2Q6cSVC8re+ s4197kdgoTJYK/g6JX17kcjWTLdsjJ/UXOJu96GNzKoVGVNJfqeojfjDXnA4e7j9kDyM mQVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=FHgCXMZqFYvkD6qdznJ6BR7nNTjcADokUMa/fFZF/nk=; b=3A/tSP5lVoFut+KJv8b+aE9I1qPDH5SfMoEkRqRSTRyLUGFEkf7sbiDC1WQD6kjwzx 05u4m/aWz4rSNSAVKi8oElPUyj6t81VbtCb3XEx0FGy4kXfL7ABRxvaiwho9Aagz/uPu L9j84KwBOAxwBf6LqTzwBLrX7XZeVtjCctXrUtr1I7censiaGQHIlfrmuUy11BUe4ixr n6AEZy0Y0+Z4IBPBU9rvUUZflOPUNQwPy1Ivy7Agsp9esG/L8PGxSyuk2/ew9pofWpe/ /4T9Hs3JZT24VYgVRrYWutWwB3+fJkM4UK3SrtzXPSiza3W3QG2Z9jceBd/4bR9sVeS9 sluQ== X-Gm-Message-State: AOAM530Drwf3IEh2BtgLa1K8o20yhLohHurfi/BbPPyv60HcWlWadDMM xYfFvoIXa/VbOPQwI0bbZgIQ6zW0PWo= X-Google-Smtp-Source: ABdhPJyGdJTP/hIderRqDZG9P+XcHQipQ3pQqC8+czstFQ5AbMAZpqLzDJ4MWFfh4ZCxNxXS4HJTrw== X-Received: by 2002:a05:6a00:807:b0:49f:d6ab:590c with SMTP id m7-20020a056a00080700b0049fd6ab590cmr3666943pfk.32.1638332632300; Tue, 30 Nov 2021 20:23:52 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id v1sm21695285pfg.169.2021.11.30.20.23.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:52 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 05/10] bpftool: Output io_uring iterator info Date: Wed, 1 Dec 2021 09:53:28 +0530 Message-Id: <20211201042333.2035153-6-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=2337; h=from:subject; bh=AzWo3KQKcucubukC3jB1Ua1DZJ24Akt6XiNikymWxlw=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYyf9f2STSREJ6G62ED68G1TqCyrjJoGgtD4FQh b+FkM2eJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8RylWxD/ 4oDbn6W5hPzuxfCu2kQx0N50BQLbixvZ3qThi1mL/zQks+MRERR93GlZuDhIeNphmafsJl7frKuf9R YqpJbt0Msvmkz9pZKbZxLIK0F6RyXfByPkwovjMxvZnkK0hXGFBtr22M8ikpvJ9k7Au8oLtfPhTCIZ BY3QIKFndECrluPHiJ4jshF5qR+11L4lYKyBtFn1KOvwQZ6znF9/eoQwVuSguoAB2NNFLFaB8d7+to /tJ/BTBNcrikz5n/GWkkzwJibBEEX2M9Xrne9iwBGLKYx/dRLXUvsOx4XzIqSsKhfoJvyX3Xtc7Gbk vH4Hkj3hyCarM0aXoFMTP52Npoeoe0KkbmbGW+44oEwrk7/tQ/Phjw3mg7oIsm15uZ182iSaxKzQKp cEdaKe3j52pQGO8Y8j+Cl0knyinszSiV3bIQrpGrF2Pyq9/yt45udgCXTZitHWe3FF8K3q96Ake/ia 1jLK5ZwK/YZszQqAW4B4d4MfCsidvR5KGRu42VSPqtEGPZk/1LNA+9yyX6ayGOcR0twrRhW95Hj4rv mo3eg/I0uPM1ipaVNTtyarOnwWNIFCONAs0xijq1s4FOOp49byoQ6NLpN9aqrHcs1JOXXLAr22h+u2 gJuWCOLyWhlHAQoyQd5JZJ9Kdxr1eX0SrQdyoT1TM5QU/2Ynz5U+vta1a1rA== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Output the sole field related to io_uring iterator (inode of attached io_uring) so that it can be useful in informational and also debugging cases (trying to find actual io_uring fd attached to the iterator). Output: 89: iter prog 262 target_name io_uring_file io_uring_inode 16764 pids test_progs(384) [ { "id": 123, "type": "iter", "prog_id": 463, "target_name": "io_uring_buf", "io_uring_inode": 16871, "pids": [ { "pid": 443, "comm": "test_progs" } ] } ] [ { "id": 126, "type": "iter", "prog_id": 483, "target_name": "io_uring_file", "io_uring_inode": 16887, "pids": [ { "pid": 448, "comm": "test_progs" } ] } ] Signed-off-by: Kumar Kartikeya Dwivedi --- tools/bpf/bpftool/link.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c index 2c258db0d352..409ae861b839 100644 --- a/tools/bpf/bpftool/link.c +++ b/tools/bpf/bpftool/link.c @@ -86,6 +86,12 @@ static bool is_iter_map_target(const char *target_name) strcmp(target_name, "bpf_sk_storage_map") == 0; } +static bool is_iter_io_uring_target(const char *target_name) +{ + return strcmp(target_name, "io_uring_file") == 0 || + strcmp(target_name, "io_uring_buf") == 0; +} + static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr) { const char *target_name = u64_to_ptr(info->iter.target_name); @@ -94,6 +100,8 @@ static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr) if (is_iter_map_target(target_name)) jsonw_uint_field(wtr, "map_id", info->iter.map.map_id); + else if (is_iter_io_uring_target(target_name)) + jsonw_uint_field(wtr, "io_uring_inode", info->iter.io_uring.inode); } static int get_prog_info(int prog_id, struct bpf_prog_info *info) @@ -204,6 +212,8 @@ static void show_iter_plain(struct bpf_link_info *info) if (is_iter_map_target(target_name)) printf("map_id %u ", info->iter.map.map_id); + else if (is_iter_io_uring_target(target_name)) + printf("io_uring_inode %llu ", info->iter.io_uring.inode); } static int show_link_close_plain(int fd, struct bpf_link_info *info) From patchwork Wed Dec 1 04:23:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649011 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEAA2C433F5 for ; Wed, 1 Dec 2021 04:24:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346586AbhLAE1s (ORCPT ); Tue, 30 Nov 2021 23:27:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49548 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346599AbhLAE1g (ORCPT ); Tue, 30 Nov 2021 23:27:36 -0500 Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E8F4BC061758; Tue, 30 Nov 2021 20:23:55 -0800 (PST) Received: by mail-pl1-x641.google.com with SMTP id u17so16655820plg.9; Tue, 30 Nov 2021 20:23:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BPINuSWw8XwQJNNy+A8cP32Sw8dHWxp27eHdH02KeB8=; b=H6O8aRKnFk4tvf5DthNGiuPPKjnjMEHYhL1cp0Qg/+rcZdRMLDIvFHs1kuNLw2ETVs LVCehE+1LY0H/Zk+ZgoPUfXE4fvZEqWQmvUNgpt6ejD6dBjrGq+l5vrVcWvRCLJVamfw 6RlJifBIUn19TQ0d74b7xArewF+rJwMu923EPt9af1vnoRwyf0cw15NN8J/iPNbzuayw eAQLUt9TFJkXiZ9CzakwO3uvKDliCE9iU99ipGakUVMNvwY5tucjwm72eutupRHnYzqc feucD3dUnVROY9nhRz/N0xormKt28djS8af7hcAwtsmpcT48iLQb8AusBxig8kggtX4U eSQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BPINuSWw8XwQJNNy+A8cP32Sw8dHWxp27eHdH02KeB8=; b=m3LIcvFv6iJ1GLNqJDBHePn7kvAEVmLc0PnR+FYJbF7sBym8G9POLf1WAzRkFG0cZ4 sg+MDBb8Ap03WmTVRN2+WbNGpJeJ+BJMvqe/5D1pbyc5J+6NTKxDs3sroyNNkBCDO2sO tLcVDeEGTTmsIEPec7M0kMJ8V3v1FFR0dr3alhiKf+fPgUgwXgI5RIAPLy+M+X3fItbw tEX2JEFmCn99aRsEG+YgCCvNW6pWyzvfwL+OWm16HWs2rEFuiEnAJkQSKQ0Ar/8Pgl0c RAEZfyD2jmi1ULsS8Lpl3OiBvr56NmeF8FEX5SgX9X5yqL6m4xpnnXTI7vBwWWHfnuZ9 kQWw== X-Gm-Message-State: AOAM531A1U+6L8RS/gAtf6ERuBDRG3mCIkXnHJudNgTqy12vJEG4h+41 1zkZCMOsIaC/AjZfpO/kRafQNSTiyf0= X-Google-Smtp-Source: ABdhPJylH+iW8g36xiCBTu6Df0Xjr9sKmFW1PrwLfjitnvgexbtrTIRVk3JrEcG6dSoICO0yk7NFaw== X-Received: by 2002:a17:90a:590d:: with SMTP id k13mr4366795pji.184.1638332635298; Tue, 30 Nov 2021 20:23:55 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id k19sm23026041pff.20.2021.11.30.20.23.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:54 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 06/10] selftests/bpf: Add test for io_uring BPF iterators Date: Wed, 1 Dec 2021 09:53:29 +0530 Message-Id: <20211201042333.2035153-7-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10827; h=from:subject; bh=YuAjHrYkj+avbRxVKQ4VqH3aJv0q5aj/W1czAZzAxeo=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYyFe/5dhJG5sB7V+gcndKtyVzXtiVJi6mJq6Bd Y4mVOwKJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8Ryj6xD/ 0ZZlyfC/U/TufnEmQ3ez4CczbrqQEkA77MVw6CrI553kUl9B3Khg1/kRnZ1+CA7Vv4EmnqNl3qXh/0 VJbXFOckYB14UXEPkX0RG8PDuGqC+hiP76tUCWrP/UbYvd02lcMuXaDEczvRB8DL8Ptu8jKMPR7C2d uKbzVymW3MnTrDzG39stty/Zs53LNpaOFUWBUTAObsshjoIx9hxn6u5TZ9vU+RpxSc7wAPYLh+AoDD 7xhuroDeeDoftrclgNgHpG4CZrhaGTpBjSBhntn3EvMZhJzq3O6N+oPCxOBadEnKungwW/dJ/e/fXb rd9dhXhMqL8Bcx/SUy/rZfXtW2N4d6W0VzoDMOXaana5p05xggUPEVRDhkeuEGC+09rWlxxBUGO19Q l+6bEMUkGmNpb2F7uU1oF4rndu+hhqIiMRVJe3nriQ8VWWlMJODg1GHWiQhwiYIVSefFWYJ+BRCaLl SdkfZKW5VhSYL5n4iIx1uYCY/IGznlX/Hrnq2Oatg1nRN9TponvuSyBL3FOE0NzEoPBSBmPYCxj0uI NjJenQlQHAB9PMOPjNciJp4t4b5vr8rkllOs+YEATLxfWVlp3KgJJzm/IzEjnU8ZPFUosiVOtY2ePU cyBljxM2iM/zHDpu8/WUawuw/6ZQ9KytWMGPkzu2cMN6u1fAW/73EHkGPmzQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This exercises the io_uring_buf and io_uring_file iterators, and tests sparse file sets as well. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 251 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_io_uring.c | 50 ++++ 2 files changed, 301 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 0b996be923b5..13ea2eaed032 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -1,6 +1,10 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2020 Facebook */ +#include +#include #include +#include + #include "bpf_iter_ipv6_route.skel.h" #include "bpf_iter_netlink.skel.h" #include "bpf_iter_bpf_map.skel.h" @@ -26,6 +30,7 @@ #include "bpf_iter_bpf_sk_storage_map.skel.h" #include "bpf_iter_test_kern5.skel.h" #include "bpf_iter_test_kern6.skel.h" +#include "bpf_iter_io_uring.skel.h" static int duration; @@ -1239,6 +1244,248 @@ static void test_task_vma(void) bpf_iter_task_vma__destroy(skel); } +static int sys_io_uring_setup(u32 entries, struct io_uring_params *p) +{ + return syscall(__NR_io_uring_setup, entries, p); +} + +static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr) +{ + return syscall(__NR_io_uring_register, io_uring_fd, + IORING_REGISTER_BUFFERS, iovs, nr); +} + +static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr) +{ + return syscall(__NR_io_uring_register, io_uring_fd, + IORING_REGISTER_FILES, fds, nr); +} + +static unsigned long long page_addr_to_pfn(unsigned long addr) +{ + int page_size = sysconf(_SC_PAGE_SIZE), fd, ret; + unsigned long long pfn; + + if (page_size < 0) + return 0; + fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) + return 0; + + ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8); + close(fd); + if (ret < 0) + return 0; + /* Bits 0-54 have PFN for non-swapped page */ + return pfn & 0x7fffffffffffff; +} + +static int io_uring_inode_match(int link_fd, int io_uring_fd) +{ + struct bpf_link_info linfo = {}; + __u32 info_len = sizeof(linfo); + struct stat st; + int ret; + + ret = fstat(io_uring_fd, &st); + if (ret < 0) + return -errno; + + ret = bpf_obj_get_info_by_fd(link_fd, &linfo, &info_len); + if (ret < 0) + return -errno; + + ASSERT_EQ(st.st_ino, linfo.iter.io_uring.inode, "io_uring inode matches"); + return 0; +} + +void test_io_uring_buf(void) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char rbuf[4096], buf[4096] = "B\n"; + union bpf_iter_link_info linfo; + struct bpf_iter_io_uring *skel; + int ret, fd, i, len = 128; + struct io_uring_params p; + struct iovec iovs[8]; + int iter_fd; + char *str; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_io_uring__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load")) + return; + + for (i = 0; i < ARRAY_SIZE(iovs); i++) { + iovs[i].iov_len = len; + iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED, -1, 0); + if (iovs[i].iov_base == MAP_FAILED) + goto end; + len *= 2; + } + + memset(&p, 0, sizeof(p)); + fd = sys_io_uring_setup(1, &p); + if (!ASSERT_GE(fd, 0, "io_uring_setup")) + goto end; + + linfo.io_uring.io_uring_fd = fd; + skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, + &opts); + if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter")) + goto end_close_fd; + + if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_buf), fd), "inode match")) + goto end_close_fd; + + ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs)); + if (!ASSERT_OK(ret, "io_uring_register_bufs")) + goto end_close_fd; + + /* "B\n" */ + len = 2; + str = buf + len; + for (int j = 0; j < ARRAY_SIZE(iovs); j++) { + ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j, + (unsigned long)iovs[j].iov_base, + iovs[j].iov_len); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + len += ret; + str += ret; + + ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n", + page_addr_to_pfn((unsigned long)iovs[j].iov_base)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + len += ret; + str += ret; + } + + ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end_close_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) + goto end_close_iter; + + if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) { + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + } +end_close_iter: + close(iter_fd); +end_close_fd: + close(fd); +end: + while (i--) + munmap(iovs[i].iov_base, iovs[i].iov_len); + bpf_iter_io_uring__destroy(skel); +} + +void test_io_uring_file(void) +{ + int reg_files[] = { [0 ... 7] = -1 }; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char buf[4096] = "B\n", rbuf[4096] = {}, *str; + union bpf_iter_link_info linfo = {}; + struct bpf_iter_io_uring *skel; + int iter_fd, fd, len = 0, ret; + struct io_uring_params p; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_io_uring__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load")) + return; + + /* "B\n" */ + len = 2; + str = buf + len; + ret = snprintf(str, sizeof(buf) - len, "B\n"); + for (int i = 0; i < ARRAY_SIZE(reg_files); i++) { + char templ[] = "/tmp/io_uringXXXXXX"; + const char *name, *def = ""; + + /* create sparse set */ + if (i & 1) { + name = def; + } else { + reg_files[i] = mkstemp(templ); + if (!ASSERT_GE(reg_files[i], 0, templ)) + goto end_close_reg_files; + name = templ; + ASSERT_OK(unlink(name), "unlink"); + } + ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : ""); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_reg_files; + len += ret; + str += ret; + } + + ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_reg_files; + + memset(&p, 0, sizeof(p)); + fd = sys_io_uring_setup(1, &p); + if (!ASSERT_GE(fd, 0, "io_uring_setup")) + goto end_close_reg_files; + + linfo.io_uring.io_uring_fd = fd; + skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file, + &opts); + if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter")) + goto end_close_fd; + + if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_file), fd), "inode match")) + goto end_close_fd; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end; + + ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files)); + if (!ASSERT_OK(ret, "io_uring_register_files")) + goto end_iter_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)")) + goto end_iter_fd; + + if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) { + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + } +end_iter_fd: + close(iter_fd); +end_close_fd: + close(fd); +end_close_reg_files: + for (int i = 0; i < ARRAY_SIZE(reg_files); i++) { + if (reg_files[i] != -1) + close(reg_files[i]); + } +end: + bpf_iter_io_uring__destroy(skel); +} + void test_bpf_iter(void) { if (test__start_subtest("btf_id_or_null")) @@ -1299,4 +1546,8 @@ void test_bpf_iter(void) test_rdonly_buf_out_of_bound(); if (test__start_subtest("buf-neg-offset")) test_buf_neg_offset(); + if (test__start_subtest("io_uring_buf")) + test_io_uring_buf(); + if (test__start_subtest("io_uring_file")) + test_io_uring_file(); } diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c new file mode 100644 index 000000000000..caf8bd0bf8d4 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "bpf_iter.h" +#include + +SEC("iter/io_uring_buf") +int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx) +{ + struct io_mapped_ubuf *ubuf = ctx->ubuf; + struct seq_file *seq = ctx->meta->seq; + unsigned int index = ctx->index; + + if (!ctx->meta->seq_num) + BPF_SEQ_PRINTF(seq, "B\n"); + + if (ubuf) { + BPF_SEQ_PRINTF(seq, "%u:0x%lx:%lu\n", index, (unsigned long)ubuf->ubuf, + (unsigned long)ubuf->ubuf_end - ubuf->ubuf); + BPF_SEQ_PRINTF(seq, "`-PFN for bvec[0]=%lu\n", + (unsigned long)bpf_page_to_pfn(ubuf->bvec[0].bv_page)); + } else { + BPF_SEQ_PRINTF(seq, "E:%u\n", index); + } + return 0; +} + +SEC("iter/io_uring_file") +int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + unsigned int index = ctx->index; + struct file *file = ctx->file; + char buf[256] = ""; + + if (!ctx->meta->seq_num) + BPF_SEQ_PRINTF(seq, "B\n"); + /* for io_uring_file iterator, this is the terminating condition */ + if (ctx->ctx->nr_user_files == index) { + BPF_SEQ_PRINTF(seq, "E:%u\n", index); + return 0; + } + if (file) { + bpf_d_path(&file->f_path, buf, sizeof(buf)); + BPF_SEQ_PRINTF(seq, "%u:%s\n", index, buf); + } else { + BPF_SEQ_PRINTF(seq, "%u:\n", index); + } + return 0; +} + +char _license[] SEC("license") = "GPL"; From patchwork Wed Dec 1 04:23:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649013 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56673C433F5 for ; Wed, 1 Dec 2021 04:24:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346577AbhLAE1v (ORCPT ); Tue, 30 Nov 2021 23:27:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346605AbhLAE1i (ORCPT ); Tue, 30 Nov 2021 23:27:38 -0500 Received: from mail-pj1-x1043.google.com (mail-pj1-x1043.google.com [IPv6:2607:f8b0:4864:20::1043]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CABF0C061759; Tue, 30 Nov 2021 20:23:58 -0800 (PST) Received: by mail-pj1-x1043.google.com with SMTP id j5-20020a17090a318500b001a6c749e697so209569pjb.1; Tue, 30 Nov 2021 20:23:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=KFbBr90gefR+1Pl+HlupC3oRKy8nqU/RjzeqFiPeQ3o=; b=e+RfFZyT2Z95/cByl+9jVpDoznekanJq8V2Enw878GsMgRfoHFuwCwIWAt1jJTEPZA 0QWTz/exUlzr6bDyRd/R0xuOT9EQ/52Q8FdhVHFgKCsPLlirxP9lwJWF2wY/9+lSffX0 9bRTaRhg7X+J5bZvZLJ/PbUiWq1EiQzblRPoW8h95zCB1+/9aZ1mohzFdnusSn8pP4qQ mWKDhP2QAWQ6TWiqeVTDTqcB70bfwe+d3izE7WGHGb08hiOyzf0yMJPxBYKR5IPXGUqi r4xrwK34avLwijDa2V6z4DiPFfr1GD28B98sVPV1UPL1gEbVJ5wsICkUnU59rfUPsWZe 7Bqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=KFbBr90gefR+1Pl+HlupC3oRKy8nqU/RjzeqFiPeQ3o=; b=77WQigumU7npmPlhfxFoEnXn+2WVrXBoAfBB6uA6dacOamyip4GHeBukpBL7fT6dGv 2S5rsmtYSGAlAPtjI75tG6t2K5+m7ze9n+d4hcEso3ByoDCJTT/PIQy6/uQS5cz5bHj5 ejfm3mLeMNG2cEugwNq2U3KRZdz3DKO1pq2wt22kb/zv5SMxc4VtjSLOTDw1M1iNAyW+ eO4FVLrDgkVQK4Or91g9jtlIGsaGEPQntGFbjZyRHYfpFGYI6YNBnfCGwFAZ3dE/J0Fw fe6JSzPK3gGjiT9Xxq9NR3mYK5f4MGZKSPEIveypn/ee+Sn9j6lC02CO8mGOwIVdJ88E 5n4g== X-Gm-Message-State: AOAM532+7suhDLcMtWmi+K9cjPgfcvNWA6NW2ERIthOQ8mOFTVxPBhKt IqFpmkH5oNBe4MN8nPREh4a8KLwpDLY= X-Google-Smtp-Source: ABdhPJzID7l6m8Fx9/aG/fU+oOSFrCd+5fwfQFqWvtei+GtvUD1cazlNjaFE8U3c5NvHThysAURm2A== X-Received: by 2002:a17:90b:4c03:: with SMTP id na3mr4453829pjb.62.1638332638147; Tue, 30 Nov 2021 20:23:58 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id n71sm22891407pfd.50.2021.11.30.20.23.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:23:57 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org Subject: [PATCH bpf-next v3 07/10] selftests/bpf: Add test for epoll BPF iterator Date: Wed, 1 Dec 2021 09:53:30 +0530 Message-Id: <20211201042333.2035153-8-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=5763; h=from:subject; bh=Eob8InaX8B/ObiQ95zseeNHtxkxkQ3RBO6zFLubC6DQ=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYy03VQ7tWIC+yeXaQVoRmDO93wh9RiLBW4dXQc h8xiWuGJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8RyvwHD/ sH3KfyWnQIr0qyiTdcwU7gQSuemMMK2ff/t/3zEY4kJgspMBM5/pvRkLCeUjgCyHIX61gdjrhl/V9o AEQajN5ims0gbtAHUCs2je0xaqOahxLBMPScNKUvUn/ouX+e3sC1L5FtarUScae5ZgzN0UCc2sUDxS 7NN9FCpujgWvBmSk0dprEGGizTNF/e8HRPCVjXX/pNA1Gk+H9Ld9l/tWuGU64jeNaW2i0xeDj5/PIX nLp7750Bs2sfh7+JnOGcgaE/km4RurnhG2dzbi5AMISIFLO3/TtHBa2es7mu6U2UcpuV0viuMTopb1 s8A+qMyQN9MrYF0MhQWdohJ9aMDvxGMGZhSr8OEaru4E+c45be0go+3dV5JT+CvHjjQm8hsxnnfFKL QyCN2VnOudggdvbCNeYGk7BoY042EWFD4qb0VtYjuHzx+nrMT9B8Ouu47QXAaBOJm8WdFXSwe6Q7Jz FuyOsfp6rQ0/ahv7o7hEoaKeRFJ86PaA0IqEbzADLVvsqhiM8qUnWFU9Q+CFlRy6rHNQvu6DPgluMK Udc8+/RbYccQX5KurMC4AoEIrb8lvL20x2RBALRdmIlsajCTIjTwnkTkRa/cXG+/jpz2MiPyI56Ll6 WqvQ+P0N8hQoB1PWWYhCGpG5aXGQANcSuiLlRuicGVxG7bHGt0IawNq6vg7w== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This tests the epoll iterator, including peeking into the epitem to inspect the registered file and fd number, and verifying that in userspace. Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 121 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_epoll.c | 33 +++++ 2 files changed, 154 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 13ea2eaed032..cc0555c5b373 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -2,6 +2,7 @@ /* Copyright (c) 2020 Facebook */ #include #include +#include #include #include @@ -31,6 +32,7 @@ #include "bpf_iter_test_kern5.skel.h" #include "bpf_iter_test_kern6.skel.h" #include "bpf_iter_io_uring.skel.h" +#include "bpf_iter_epoll.skel.h" static int duration; @@ -1486,6 +1488,123 @@ void test_io_uring_file(void) bpf_iter_io_uring__destroy(skel); } +void test_epoll(void) +{ + const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n"; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char buf[4096] = {}, rbuf[4096] = {}; + union bpf_iter_link_info linfo; + int fds[2], sk[2], epfd, ret; + struct bpf_iter_epoll *skel; + struct epoll_event ev = {}; + int iter_fd, set[4]; + char *s, *t; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_epoll__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_epoll__open_and_load")) + return; + + epfd = epoll_create1(EPOLL_CLOEXEC); + if (!ASSERT_GE(epfd, 0, "epoll_create1")) + goto end; + + ret = pipe(fds); + if (!ASSERT_OK(ret, "pipe(fds)")) + goto end_epfd; + + ret = socketpair(AF_UNIX, SOCK_STREAM, 0, sk); + if (!ASSERT_OK(ret, "socketpair")) + goto end_pipe; + + ev.events = EPOLLIN; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[0], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[0], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[1], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[1], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + linfo.epoll.epoll_fd = epfd; + skel->links.dump_epoll = bpf_program__attach_iter(skel->progs.dump_epoll, &opts); + if (!ASSERT_OK_PTR(skel->links.dump_epoll, "bpf_program__attach_iter")) + goto end_sk; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_epoll)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, iter_fd, &ev); + if (!ASSERT_EQ(ret, -1, "epoll_ctl add for iter_fd")) + goto end_iter_fd; + + ret = snprintf(buf, sizeof(buf), fmt, fds[0], sk[0], fds[1], sk[1]); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf")) + goto end_iter_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) + goto end_iter_fd; + + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + + s = rbuf; + while ((s = strtok_r(s, "\n", &t))) { + int fd = -1; + + if (s[0] == 'B' || s[0] == 'E') + goto next; + ASSERT_EQ(sscanf(s, s[0] == 'p' ? "pipe:%d" : "socket:%d", &fd), 1, s); + if (fd == fds[0]) { + ASSERT_NEQ(set[0], 1, "pipe[0]"); + set[0] = 1; + } else if (fd == fds[1]) { + ASSERT_NEQ(set[1], 1, "pipe[1]"); + set[1] = 1; + } else if (fd == sk[0]) { + ASSERT_NEQ(set[2], 1, "sk[0]"); + set[2] = 1; + } else if (fd == sk[1]) { + ASSERT_NEQ(set[3], 1, "sk[1]"); + set[3] = 1; + } else { + ASSERT_TRUE(0, "Incorrect fd in iterator output"); + } +next: + s = NULL; + } + for (int i = 0; i < ARRAY_SIZE(set); i++) + ASSERT_EQ(set[i], 1, "fd found"); +end_iter_fd: + close(iter_fd); +end_sk: + close(sk[1]); + close(sk[0]); +end_pipe: + close(fds[1]); + close(fds[0]); +end_epfd: + close(epfd); +end: + bpf_iter_epoll__destroy(skel); +} + void test_bpf_iter(void) { if (test__start_subtest("btf_id_or_null")) @@ -1550,4 +1669,6 @@ void test_bpf_iter(void) test_io_uring_buf(); if (test__start_subtest("io_uring_file")) test_io_uring_file(); + if (test__start_subtest("epoll")) + test_epoll(); } diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c new file mode 100644 index 000000000000..0afc74d154a1 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c @@ -0,0 +1,33 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "bpf_iter.h" +#include + +extern void pipefifo_fops __ksym; + +SEC("iter/epoll") +int dump_epoll(struct bpf_iter__epoll *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct epitem *epi = ctx->epi; + char sstr[] = "socket"; + char pstr[] = "pipe"; + + if (!ctx->meta->seq_num) { + BPF_SEQ_PRINTF(seq, "B\n"); + } + if (epi) { + struct file *f = epi->ffd.file; + char *str; + + if (f->f_op == &pipefifo_fops) + str = pstr; + else + str = sstr; + BPF_SEQ_PRINTF(seq, "%s:%d\n", str, epi->ffd.fd); + } else { + BPF_SEQ_PRINTF(seq, "E\n"); + } + return 0; +} + +char _license[] SEC("license") = "GPL"; From patchwork Wed Dec 1 04:23:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649015 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84012C433EF for ; Wed, 1 Dec 2021 04:24:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346611AbhLAE1z (ORCPT ); Tue, 30 Nov 2021 23:27:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346606AbhLAE1i (ORCPT ); Tue, 30 Nov 2021 23:27:38 -0500 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AECFFC06175A; Tue, 30 Nov 2021 20:24:01 -0800 (PST) Received: by mail-pg1-x541.google.com with SMTP id s37so12537553pga.9; Tue, 30 Nov 2021 20:24:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/GG26fljeFe9x6wzQv8DNbXQNyp/pOqmJex7xSjYwQY=; b=SnJBUkTHcfWk6urUJMXJzXclIQBnC03ctgh2DcPP4EvvRRWbxLtOPIQbY9ehhmUIo6 JIofsap0RTaSSRNEsvdf2drwr4IWqIGCeahfgCWOUMtu2UtFEumIr8mqIfHa0sXyHG6O iTULkA/H3DFGKAFuGkLGGh/g+5qWBTPRGUQznCERwhz7hdQO6fS7B0Pc8+R3suWySvXR PniLlgrUKlAUNz0LQTFf+N9BthY80KchPRjsmcTz18Yu9vo/jOFtpJ5+SGf8RG4C+oqy f6F8RUbfBljHq1tLfqn/JbnUF+zoN+1KfFZYuVptQ8SpHG50P8EIKCw4qjOdRAIQuByY DCBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/GG26fljeFe9x6wzQv8DNbXQNyp/pOqmJex7xSjYwQY=; b=1qihPktxkVCuF8NURhMgof9KWIQOm7ChqFINhHIsODpuvDRWUcsmALz45GmDfIrE3X G4e33Uwn0ZHVnmT0cqIzokKoGg+mDM9X+0/uuLrl0yxgLaOMht5gu36F7Aqe3duNbHFD k7yQslxdw+eH0VZw0XmS5aYvIzzkCeFfWbTmZhfUO7zf/1wJhJA708hyy4rp3yIMHvsF wLproyKjOxVDbqZVG+L/d0YF8KM+832trSxfMj7OzKc6yasq1hyRoTIehuKrOHp0XzV4 QdA0JLVsBUrV0ciBntIXja1NODvqaEtzD/YXd1C8msTwnLvrJolyar2KZZgVYumFc1W3 p2bQ== X-Gm-Message-State: AOAM531tg1/zTIquPFRXZy9ZvFwYEuugnHbqoDQ2YSemm4+jE8f6adxt WEqTbi6TFxW8njmeGLiX3xweXpHKaCY= X-Google-Smtp-Source: ABdhPJzzZpX3Lrd931VgUe8gGDP/OUZQWzFevWu/CavZhJvUCF5YDd7DWx9Ma0rJZRvjQxIJhFjMUA== X-Received: by 2002:a65:4bc6:: with SMTP id p6mr2812040pgr.544.1638332641147; Tue, 30 Nov 2021 20:24:01 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id qe2sm4538986pjb.42.2021.11.30.20.24.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:24:00 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators Date: Wed, 1 Dec 2021 09:53:31 +0530 Message-Id: <20211201042333.2035153-9-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=4120; h=from:subject; bh=aaB79qIuMMM7CbUZq9r3qLVaIyomUA09O47gys8VzJY=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYyIvWjAWHOYOzMoD2ZREvPrPxvUqFa24X82mna LbuaSzaJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8RypZIEA CR3mW3w4TBLpLOUaxrQ2T2UfBGffRRTxgV3Vo4MoGuYCOpdoYiSKh0FAQd38pZZ9i7K/X0T1+uBC0q 14ezpGDLl1SbiXNe4KZzqBQ6UWN1Q3G7kgOSdwYtir9Z9uDTTZ+GNCVlyhaB6Ut1eZmsvQhYXA/VMf XVXsgGaJfcjhg1NrUZqDD90sfaPLAH085IavERpfNOK3FUw2Vis8SDpGHtVLj+dS9lboxdXpehGk2K GGqvdEews0VF9uRivLIz8rd/w6pactdirOTZjQqTn0GbkIU84sDLAMUeXcyt9IJl9BTG9BBmiJsFTc TbqTjJts04aUEZ/3JSpCv5ULYsrnQ89nHedL91pvqCAnyqtBBky4KgWb/bn0HaJqyAFC9RnkcCFjLG ogBePSCceufxFl2joHm36v5o4JuL8RzRCeccDFocvcOTEPHhto4LuGSm7Eeql1+TGTln5QeO+xqgID 7J3/io4Sq+j5Xfd+36jLo1AL/SKWH7V5stF8uxWT0+vsCMcuHezb4FhoVETOuwvpQthyDWRClMAiRt 5uVTcfGvdoXv1WkgKput9St+Ozk6cjPkHLCUKu4IfM9DmevmZhRfH2opJYcAX+pfJSdHzMXb336Lcr E+RgbOPWq66u43PvyHRL5WP8xt4ULQAuQctpFKNTjHRIFEvqE5DQ023hRKaQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Ensure that the output is consistent in face of partial reads that return to userspace and then resume again later. To this end, we do reads in 1-byte chunks, which is a bit stupid in real life, but works well to simulate interrupted iteration. This also tests case where seq_file buffer is consumed (after seq_printf) on interrupted read before iterator invoked BPF prog again. Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 33 ++++++++++++------- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index cc0555c5b373..3a07fdf31874 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -73,13 +73,13 @@ static void do_dummy_read(struct bpf_program *prog) bpf_link__destroy(link); } -static int read_fd_into_buffer(int fd, char *buf, int size) +static int __read_fd_into_buffer(int fd, char *buf, int size, size_t chunks) { int bufleft = size; int len; do { - len = read(fd, buf, bufleft); + len = read(fd, buf, chunks ?: bufleft); if (len > 0) { buf += len; bufleft -= len; @@ -89,6 +89,11 @@ static int read_fd_into_buffer(int fd, char *buf, int size) return len < 0 ? len : size - bufleft; } +static int read_fd_into_buffer(int fd, char *buf, int size) +{ + return __read_fd_into_buffer(fd, buf, size, 0); +} + static void test_ipv6_route(void) { struct bpf_iter_ipv6_route *skel; @@ -1301,7 +1306,7 @@ static int io_uring_inode_match(int link_fd, int io_uring_fd) return 0; } -void test_io_uring_buf(void) +void test_io_uring_buf(bool partial) { DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); char rbuf[4096], buf[4096] = "B\n"; @@ -1375,7 +1380,7 @@ void test_io_uring_buf(void) if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) goto end_close_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) goto end_close_iter; @@ -1396,7 +1401,7 @@ void test_io_uring_buf(void) bpf_iter_io_uring__destroy(skel); } -void test_io_uring_file(void) +void test_io_uring_file(bool partial) { int reg_files[] = { [0 ... 7] = -1 }; DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); @@ -1464,7 +1469,7 @@ void test_io_uring_file(void) if (!ASSERT_OK(ret, "io_uring_register_files")) goto end_iter_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)")) goto end_iter_fd; @@ -1488,7 +1493,7 @@ void test_io_uring_file(void) bpf_iter_io_uring__destroy(skel); } -void test_epoll(void) +void test_epoll(bool partial) { const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n"; DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); @@ -1554,7 +1559,7 @@ void test_epoll(void) if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf")) goto end_iter_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) goto end_iter_fd; @@ -1666,9 +1671,15 @@ void test_bpf_iter(void) if (test__start_subtest("buf-neg-offset")) test_buf_neg_offset(); if (test__start_subtest("io_uring_buf")) - test_io_uring_buf(); + test_io_uring_buf(false); if (test__start_subtest("io_uring_file")) - test_io_uring_file(); + test_io_uring_file(false); if (test__start_subtest("epoll")) - test_epoll(); + test_epoll(false); + if (test__start_subtest("io_uring_buf-partial")) + test_io_uring_buf(true); + if (test__start_subtest("io_uring_file-partial")) + test_io_uring_file(true); + if (test__start_subtest("epoll-partial")) + test_epoll(true); } From patchwork Wed Dec 1 04:23:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649021 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C51CAC433EF for ; Wed, 1 Dec 2021 04:24:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241794AbhLAE2L (ORCPT ); Tue, 30 Nov 2021 23:28:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346608AbhLAE1i (ORCPT ); Tue, 30 Nov 2021 23:27:38 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C9FDC06175B; Tue, 30 Nov 2021 20:24:04 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id n26so23004876pff.3; Tue, 30 Nov 2021 20:24:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=y7IcXh7hkKTNMPxt6YZo1bjq2Oa0ASN2LOQPZecX9IA=; b=G+voL6wJw3c+eMoG3p9xj5cgifda2ejkqCJqCQLLeSaGHJ+JKnfVBotRvVX/0sOx7I 6C7aicypaCP+nfwUqw+9GgnLaR2+E0Z9BaOMfv5bxQ/+EiZP9sEE2pDKFhmbp50KMs1i WCLE0WBK5FO9w9UtFchHY1Sf2OIuxQao92ENqR5m8o70J7D/TD5/ffzmblQ6eg3adjbx kx7AGLbanbtH/jgwLvctaEdQlBzpXpf0aufSTxSocbY6r2+CE+20IiWSqy1KK3Cr4SSe DX1ODKLbUZL45dcKPm3Sxm7cmEQfhNT3qin/o1eFpYV2+Pl3PaZ9kcURRsH2PeAJepX/ DI9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=y7IcXh7hkKTNMPxt6YZo1bjq2Oa0ASN2LOQPZecX9IA=; b=Ae+ZirFiA8s+tt0Sl9LaLA16zBKM19rg3lOBCEslue1UdYr24K1ofI6ATFiJSHHBqL b06GEodYBrjrFhCqyQG5kjWFN6OwYNwC+6IZy9sryXTZORuUzaa9hYO5VfHWZ8b4sx5l BS71DleUw2fMO2Zm5CEFJtHsiN3LOL9qZxWabTE++XVpXp086DCY60++ZGADRs5uahxa CnhYrsY3nJQW7F5jx2DAgth8IJFpJA7SzpFEXUhihUSUvl1jE5xDRnWMUQE4GzcRFYV0 yo8wSU3lj4huP1B0VF2oBg16jHyrAZqlNvM61yvlLOl7Y8VW5tK40yhx1XKInPHvRuku cn0w== X-Gm-Message-State: AOAM532zWHr+Y0TOkwpszeRXEZrgu4WAT+5sB9U6eL2P8QtSeZahn5CK 3R5IpdExtliTIZym3reak1VSIcIv29s= X-Google-Smtp-Source: ABdhPJxLaJnjlMXWkC24b57twGFlBGMgy582WfgYrQLbB/lD8ZTWNvk/AyBOttqbm/xjQsyogkyLLw== X-Received: by 2002:a05:6a00:234a:b0:49f:c0f7:f474 with SMTP id j10-20020a056a00234a00b0049fc0f7f474mr3616842pfj.64.1638332644067; Tue, 30 Nov 2021 20:24:04 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id g1sm8435444pgm.23.2021.11.30.20.24.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:24:03 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v3 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Date: Wed, 1 Dec 2021 09:53:32 +0530 Message-Id: <20211201042333.2035153-10-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=1199; h=from:subject; bh=QE4SyOAFzzmkq+EmLnqIhi3J/rXlE8epdKkKQ0ddjk0=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYyHodypS5dcJxgGWq6Gcm01FTC9AoUN7RCDrNE /m9lpO2JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8RysTEEA DCmZCmJmDZDxqlorY0oGjDyVOz8xicGLWGmSH307rEylOEUw85KOGpd3eIMfiK/Nnqg9rK5s2wHPyc /ARNL1eox1+ygPgQQUQ0haEENVe149mNIY/bB5clUNBw+8dhw26Jt+Vy2LepNOEYDIAsIC+y/s374r qh6XZm2I8MsXRuSTgWWdV71OsJ221i9gMWFW+H/ReIFwLFg1Nh5n6suRxc+bv6FdTUpn7zERR7kxxO v4c6C8k3f2fuhfQ9JMcnWu4QSiJsYfGQWVShUA/alh0W6ah9MvRYJvyfpUUL+v7ObgbCYbicBIb/pl RC4aSphiHmGciptihLOImE6FmlEw3HAdz/kIpPtLarR6fEXgAIGfeClA8LIu5YtJRj/v6eh3WK4vI0 J8nawV0icVnf9wjjMvRdKF2Ge8cYv8lmHVYnGGiLK61qeGrq0BtCBxBycYrFRmS5Kv07SSlfwqAnHm I0T+D5giFEj0fLONEz1X8U0h1tHGpnrH4Ha/YSt0LDHHQ0ZPN6uWHJhAQwls+gcoVHrc9QFxrt/amO hNfMhKbhGZuCEkRNYHpVDJFOf7452KtXXmLPHWomWisxX7E/c0Kha4lYlLFFJlH6wg1bBnRHVYyadm 8CK3YEOupz+hdjA4nV9A70shbJW6vHjAjpWhf9B1PR22DK1pa0fRWf4Rd+Bw== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Since we changed the definition while adding io_uring and epoll iterator support, adjust the selftest to check against the updated definition. Signed-off-by: Kumar Kartikeya Dwivedi --- tools/testing/selftests/bpf/prog_tests/btf_dump.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c index 9e26903f9170..1678b2c49f78 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c @@ -736,7 +736,9 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d, /* union with nested struct */ TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT, - "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}", + "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,}," + ".io_uring = (struct){.io_uring_fd = (__u32)1,}," + ".epoll = (struct){.epoll_fd = (__u32)1,},}", { .map = { .map_fd = 1 }}); /* struct skb with nested structs/unions; because type output is so From patchwork Wed Dec 1 04:23:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12649019 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0DCCC433EF for ; Wed, 1 Dec 2021 04:24:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346623AbhLAE2G (ORCPT ); Tue, 30 Nov 2021 23:28:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49566 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346610AbhLAE1i (ORCPT ); Tue, 30 Nov 2021 23:27:38 -0500 Received: from mail-pl1-x644.google.com (mail-pl1-x644.google.com [IPv6:2607:f8b0:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6760EC06175C; Tue, 30 Nov 2021 20:24:08 -0800 (PST) Received: by mail-pl1-x644.google.com with SMTP id u17so16656138plg.9; Tue, 30 Nov 2021 20:24:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=1lDChbfVnceuDgQ7cJNX6E8lOTvvhMw1Gu6MIe1nrOU=; b=NHiLByohLdUF11afUMUCHNCn7WyVm/j8em+dDiurLHK1MKycadT2vGprXhyQsq+CvW 3EanLf6L/ygWHmNWEQ8myGME5p/RfZiNWEftJdCNN0fYNRXKw/7mypeplnxVepteYOHo VlA8x2DlGT0YsYWIE45YHY3JX0rgK9lJywqyjBl6MYhiLLDtlNufP7acvKpfc/uynb3V HAR3o7/bZfONBdCjXOmD7EzMYePftsWHoQzsAjuBypGQwNIZlrZm6JQnSefsNTN3+kv9 TWtC0sWhasKlx3TQYe1Zl3BydVRyIwVB+vLlFCwkmWfbRVqyD0W/2I2to+EDPu9rkPcx McHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=1lDChbfVnceuDgQ7cJNX6E8lOTvvhMw1Gu6MIe1nrOU=; b=fLNeMJdgbFgjF3akR8qWvxSS4iNMUhJBEuO2uMc5p2rsRIocRwIVUt5Jy021Bwzgs3 HTahr0v294DCplLedlohKkcySWcLE5KfLdYZWDkYpvB4fJStP9y/K8xL0Rz1EYW7X0rK LoisO3giQ4Jps++H2fYNZg2OIX9Yxq43E3ItJD+/4/6Gn07B0rnL/YVnA7a6oNXnZuIV hcglfal4BFsuCTdKqO18HLOIXKjKHKJIICsJ0MPwJDxsHtx+l/cPJ2MWrjWUvojJCvLN 2bxMxDpMWIP3HL2QXvhs1sU40wjL8gn8lz5FfdiuTg3jCmJAdD5TKgyVEqfy8VxiWnW8 xsuw== X-Gm-Message-State: AOAM532kaK3lk0EXKEasxggrDbxYh6lTrzWYxbGZPiq7c2ejDuVkOvJP Jb1Jkd07hJqG1kMQ4Po1eH6a1Le3YxI= X-Google-Smtp-Source: ABdhPJw5EJgVk2HRCX88RRZr84GNhhr5pqFSzCe/81C2ww8Im7NPC3xi/6au5pe0cCq0+fbEu0Hy9Q== X-Received: by 2002:a17:90b:1d09:: with SMTP id on9mr4323427pjb.191.1638332647162; Tue, 30 Nov 2021 20:24:07 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id lt5sm3985662pjb.43.2021.11.30.20.24.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Nov 2021 20:24:06 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH RFC bpf-next v3 10/10] samples/bpf: Add example to checkpoint/restore io_uring Date: Wed, 1 Dec 2021 09:53:33 +0530 Message-Id: <20211201042333.2035153-11-memxor@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20211201042333.2035153-1-memxor@gmail.com> References: <20211201042333.2035153-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=35678; h=from:subject; bh=iNAt8VQjuJBYGUNqQ67EOuHYsc8ms6HOM1LXury7neo=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhpvYydW38QmNI2no2Qf/JXHCwEpIIyZqKtCLX+sbP qHmscZaJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYab2MgAKCRBM4MiGSL8RyqbUD/ 965Ad+jU/WtRUTN+XQzI5mVIDQ1/kCptBipLpb6oQEaslbUmM5tOmrCbuwf55TR9FT/1bXZ5D5f9sm YXPsm6Iu6/BpkJq9saxo/AV0QqHkgqZ6oT1uqWenD8/ZwDVESyefkKUd6nUlwRjo6Zx54UfclmAK8i gf2bD5S/EA6w9v3H8K08sFOgXIlN2nRKjWZyn0SPbKlVtvFWT9vuWKC6KOJGzyQynRwUBmkFinAwzj 3JVivsbATksNsS+qWiRwmxYwwCZFz5nPb4eWOXaVFkPfz0PnloP8aSQaOmWymOCXK0ZIF3b0Ur4mKf 35Hq7gY0QAyIDObtLPDWTKYz/FppkxA9Y7TOANiZDneqe1bmWgmsG6IKe4Ndaky0lnEZkB7cozvDHq 2tiIRfZ0AurCqAF3i8z2bA7xVC7vyNKYsmNgs20dJjZ+HHgYgt5dDmytmAk6Y0QDUwBc4mQVhv2wXv /iFYPSBvkp+HDVABBTsnprTV4P282jqibVVCtwk0al62DzDMERbAnSCyAlFNxZbM95BThBMTa4TR2Z 0/VmPfWd1dQEQOi/xJMhZTxJIfxyIv0y8mi/uJ9vEdmgEP4OW+hOjUxX6/8sfO33DL99JUXTIKyvZf eLdKrEqUtHdYYHv/DsJRUU+oDhKlYIBpZI9/VTyfkGidqBExSDE68HLDcs9g== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC The sample demonstrates how BPF iterators for task and io_uring can be used to checkpoint the state of an io_uring instance and then recreate it using that information, as a working example of how the iterator will be utilized for the same by userspace projects like CRIU. This is very similar to how CRIU actually works in principle, by writing all data on dump to protobuf images, which are then read during restore to reconstruct the task and its resources. Here we use a custom binary format and pipe the io_uring "image(s)" (in case of wq_fd there will be multiple images), to the restorer, which then consumes this information to form a total ordering of restore actions it has to execute to reach the same state. The sample restores all features that currently cannot be restored without bpf iterators, hence is a good demonstration of what we would like to achieve using these new facilities. As is evident, we need a single iteration pass in each iterator to obtain all the information we require. io_uring ring buffer restoration is orthogonal and not specific to iterators, so it has been left out. Our example app also shares the workqueue with parent io_uring, which is detected by our dumper tool and it moves to first dump the parent io_uring. io_uring doesn't allow creating cycles in this case, so the chain ends eventually in practice. For now only single parent is supported, but it easy to extend to arbitrary length chains (by recursing with limit in do_dump_parent after detecting presence of wq_fd > 0). The epoll iterator usecase is similar to what we do in dump_io_uring_file, and would significantly simplify current implementation [0]. [0]: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/eventpoll.c The dry-run mode of bpf_cr tool prints the dump image: $ ./bpf_cr app & PID: 318, Parent io_uring: 3, Dependent io_uring: 4 $ ./bpf_cr dump 318 4 | ./bpf_cr restore --dry-run DUMP_SETUP: io_uring_fd: 3 end: true flags: 14 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 0 DUMP_SETUP: io_uring_fd: 4 end: false flags: 46 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 3 DUMP_EVENTFD: io_uring_fd: 4 end: false eventfd: 5 async: true DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 0 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 2 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 4 DUMP_REG_BUF: io_uring_fd: 4 end: false addr: 0 len: 0 index: 0 DUMP_REG_BUF: io_uring_fd: 4 end: true addr: 140721288339216 len: 120 index: 1 Nothing to do, exiting... ====== The trace is as follows: // We can shift fd number around randomly, it doesn't impact C/R $ exec 3<> /dev/urandom $ exec 4<> /dev/random $ exec 5<> /dev/null $ strace ./bpf_cr app & ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 getpid() = 324 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 ... // PID: 324, Parent io_uring: 6, Dependent io_uring: 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 1, -1, 2], 5) = 0 io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = 0 The restore's trace is as follows (which detects the wq_fd on its own) and dumps and restores it as well, before restoring fd 7: $ ./bpf_cr dump 326 7 | strace ./bpf_cr restore ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 dup2(6, 6) = 6 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 dup2(7, 7) = 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 ... // fd number 0 is same as 1 and 2, hence the lowest one is used during restore, // it doesn't matter as underlying struct file is same... io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 0, -1, 0], 5) = 0 // This step would happen after restoring mm, so it fails for now for second iovec io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = -1 EFAULT (Bad address) ... --- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 8 +- samples/bpf/bpf_cr.bpf.c | 185 +++++++++++ samples/bpf/bpf_cr.c | 688 +++++++++++++++++++++++++++++++++++++++ samples/bpf/bpf_cr.h | 48 +++ samples/bpf/hbm_kern.h | 2 - 6 files changed, 928 insertions(+), 4 deletions(-) create mode 100644 samples/bpf/bpf_cr.bpf.c create mode 100644 samples/bpf/bpf_cr.c create mode 100644 samples/bpf/bpf_cr.h -- 2.34.1 diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 0e7bfdbff80a..9c542431ea45 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only +bpf_cr cpustat fds_example hbm diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index a886dff1ba89..a64f2e019bfc 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -53,6 +53,7 @@ tprogs-y += task_fd_query tprogs-y += xdp_sample_pkts tprogs-y += ibumad tprogs-y += hbm +tprogs-y += bpf_cr tprogs-y += xdp_redirect_cpu tprogs-y += xdp_redirect_map_multi @@ -118,6 +119,7 @@ task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS) xdp_sample_pkts-objs := xdp_sample_pkts_user.o ibumad-objs := ibumad_user.o hbm-objs := hbm.o $(CGROUP_HELPERS) +bpf_cr-objs := bpf_cr.o xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o $(XDP_SAMPLE) xdp_redirect_cpu-objs := xdp_redirect_cpu_user.o $(XDP_SAMPLE) @@ -198,7 +200,7 @@ BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic endif endif -TPROGS_CFLAGS += -Wall -O2 +TPROGS_CFLAGS += -Wall -O2 -g TPROGS_CFLAGS += -Wmissing-prototypes TPROGS_CFLAGS += -Wstrict-prototypes @@ -337,6 +339,7 @@ $(obj)/xdp_redirect_map_multi_user.o: $(obj)/xdp_redirect_map_multi.skel.h $(obj)/xdp_redirect_map_user.o: $(obj)/xdp_redirect_map.skel.h $(obj)/xdp_redirect_user.o: $(obj)/xdp_redirect.skel.h $(obj)/xdp_monitor_user.o: $(obj)/xdp_monitor.skel.h +$(obj)/bpf_cr.o: $(obj)/bpf_cr.skel.h $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h @@ -392,7 +395,7 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x -I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \ -c $(filter %.bpf.c,$^) -o $@ -LINKED_SKELS := xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \ +LINKED_SKELS := bpf_cr.skel.h xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \ xdp_redirect_map.skel.h xdp_redirect.skel.h xdp_monitor.skel.h clean-files += $(LINKED_SKELS) @@ -401,6 +404,7 @@ xdp_redirect_map_multi.skel.h-deps := xdp_redirect_map_multi.bpf.o xdp_sample.bp xdp_redirect_map.skel.h-deps := xdp_redirect_map.bpf.o xdp_sample.bpf.o xdp_redirect.skel.h-deps := xdp_redirect.bpf.o xdp_sample.bpf.o xdp_monitor.skel.h-deps := xdp_monitor.bpf.o xdp_sample.bpf.o +bpf_cr.skel.h-deps := bpf_cr.bpf.o LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps))) diff --git a/samples/bpf/bpf_cr.bpf.c b/samples/bpf/bpf_cr.bpf.c new file mode 100644 index 000000000000..6b0bb019f2be --- /dev/null +++ b/samples/bpf/bpf_cr.bpf.c @@ -0,0 +1,185 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "vmlinux.h" +#include +#include +#include + +#include "bpf_cr.h" + +/* struct file -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} fdtable_map SEC(".maps"); + +struct ctx_map_val { + int fd; + bool init; +}; + +/* io_ring_ctx -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, struct ctx_map_val); + __uint(max_entries, 16); +} io_ring_ctx_map SEC(".maps"); + +/* ctx->sq_data -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} sq_data_map SEC(".maps"); + +/* eventfd_ctx -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} eventfd_ctx_map SEC(".maps"); + +const volatile pid_t tgid = 0; + +extern void eventfd_fops __ksym; +extern void io_uring_fops __ksym; + +SEC("iter/task_file") +int dump_task(struct bpf_iter__task_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + struct ctx_map_val val = {}; + __u64 f_priv; + int fd; + + if (!task) + return 0; + if (task->tgid != tgid) + return 0; + if (!file) + return 0; + + f_priv = (__u64)file->private_data; + fd = ctx->fd; + val.fd = fd; + if (file->f_op == &eventfd_fops) { + bpf_map_update_elem(&eventfd_ctx_map, &f_priv, &fd, 0); + } else if (file->f_op == &io_uring_fops) { + struct io_ring_ctx *ctx; + void *sq_data; + __u64 key; + + bpf_map_update_elem(&io_ring_ctx_map, &f_priv, &val, 0); + ctx = file->private_data; + bpf_probe_read_kernel(&sq_data, sizeof(sq_data), &ctx->sq_data); + key = (__u64)sq_data; + bpf_map_update_elem(&sq_data_map, &key, &fd, BPF_NOEXIST); + } + f_priv = (__u64)file; + bpf_map_update_elem(&fdtable_map, &f_priv, &fd, BPF_NOEXIST); + return 0; +} + +static void dump_io_ring_ctx(struct seq_file *seq, struct io_ring_ctx *ctx, int ring_fd) +{ + struct io_uring_dump dump; + struct ctx_map_val *val; + __u64 key; + int *fd; + + key = (__u64)ctx; + val = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (val && val->init) + return; + __builtin_memset(&dump, 0, sizeof(dump)); + if (val) + val->init = true; + dump.type = DUMP_SETUP; + dump.io_uring_fd = ring_fd; + key = (__u64)ctx->sq_data; +#define ATTACH_WQ_FLAG (1 << 5) + if (ctx->flags & ATTACH_WQ_FLAG) { + fd = bpf_map_lookup_elem(&sq_data_map, &key); + if (fd) + dump.desc.setup.wq_fd = *fd; + } + dump.desc.setup.flags = ctx->flags; + dump.desc.setup.sq_entries = ctx->sq_entries; + dump.desc.setup.cq_entries = ctx->cq_entries; + dump.desc.setup.sq_thread_cpu = ctx->sq_data->sq_cpu; + dump.desc.setup.sq_thread_idle = ctx->sq_data->sq_thread_idle; + bpf_seq_write(seq, &dump, sizeof(dump)); + if (ctx->cq_ev_fd) { + dump.type = DUMP_EVENTFD; + key = (__u64)ctx->cq_ev_fd; + fd = bpf_map_lookup_elem(&eventfd_ctx_map, &key); + if (fd) + dump.desc.eventfd.eventfd = *fd; + dump.desc.eventfd.async = ctx->eventfd_async; + bpf_seq_write(seq, &dump, sizeof(dump)); + } +} + +SEC("iter/io_uring_buf") +int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx) +{ + struct io_mapped_ubuf *ubuf = ctx->ubuf; + struct seq_file *seq = ctx->meta->seq; + struct io_uring_dump dump; + __u64 key; + int *fd; + + __builtin_memset(&dump, 0, sizeof(dump)); + key = (__u64)ctx->ctx; + fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (!ctx->meta->seq_num) + dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0); + if (!ubuf) + return 0; + dump.type = DUMP_REG_BUF; + if (fd) + dump.io_uring_fd = *fd; + dump.desc.reg_buf.index = ctx->index; + if (ubuf != ctx->ctx->dummy_ubuf) { + dump.desc.reg_buf.addr = ubuf->ubuf; + dump.desc.reg_buf.len = ubuf->ubuf_end - ubuf->ubuf; + } + bpf_seq_write(seq, &dump, sizeof(dump)); + return 0; +} + +SEC("iter/io_uring_file") +int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct file *file = ctx->file; + struct io_uring_dump dump; + __u64 key; + int *fd; + + __builtin_memset(&dump, 0, sizeof(dump)); + key = (__u64)ctx->ctx; + fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (!ctx->meta->seq_num) + dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0); + if (!file) + return 0; + dump.type = DUMP_REG_FD; + if (fd) + dump.io_uring_fd = *fd; + dump.desc.reg_fd.index = ctx->index; + key = (__u64)file; + fd = bpf_map_lookup_elem(&fdtable_map, &key); + if (fd) + dump.desc.reg_fd.reg_fd = *fd; + bpf_seq_write(seq, &dump, sizeof(dump)); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/bpf_cr.c b/samples/bpf/bpf_cr.c new file mode 100644 index 000000000000..f5e0270af852 --- /dev/null +++ b/samples/bpf/bpf_cr.c @@ -0,0 +1,688 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * BPF C/R + * + * Tool to use BPF iterators to dump process state. This currently supports + * dumping io_uring fd state, by taking process PID and fd number pair, then + * dumping to stdout the state as binary struct, which can be passed to the + * tool consuming it, to recreate io_uring. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bpf_cr.h" +#include "bpf_cr.skel.h" + +/* Approx. 4096/40 */ +#define MAX_DESC 96 +size_t dump_desc_cnt; +size_t reg_fd_cnt; +size_t reg_buf_cnt; +struct io_uring_dump *dump_desc[MAX_DESC]; +int fds[MAX_DESC]; +struct iovec bufs[MAX_DESC]; + +static int sys_pidfd_open(pid_t pid, unsigned int flags) +{ + return syscall(__NR_pidfd_open, pid, flags); +} + +static int sys_pidfd_getfd(int pidfd, int targetfd, unsigned int flags) +{ + return syscall(__NR_pidfd_getfd, pidfd, targetfd, flags); +} + +static int sys_io_uring_setup(uint32_t entries, struct io_uring_params *p) +{ + return syscall(__NR_io_uring_setup, entries, p); +} + +static int sys_io_uring_register(unsigned int fd, unsigned int opcode, + void *arg, unsigned int nr_args) +{ + return syscall(__NR_io_uring_register, fd, opcode, arg, nr_args); +} + +static const char *type2str[__DUMP_MAX] = { + [DUMP_SETUP] = "DUMP_SETUP", + [DUMP_EVENTFD] = "DUMP_EVENTFD", + [DUMP_REG_FD] = "DUMP_REG_FD", + [DUMP_REG_BUF] = "DUMP_REG_BUF", +}; + +static int do_dump_parent(struct bpf_cr *skel, int parent_fd) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo = {}; + int ret = 0, buf_it, file_it; + struct bpf_link *lb, *lf; + char buf[4096]; + + linfo.io_uring.io_uring_fd = parent_fd; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + lb = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, &opts); + if (!lb) { + ret = -errno; + fprintf(stderr, "Failed to attach to io_uring_buf: %m\n"); + return ret; + } + + lf = bpf_program__attach_iter(skel->progs.dump_io_uring_file, &opts); + if (!lf) { + ret = -errno; + fprintf(stderr, "Failed to attach io_uring_file: %m\n"); + goto end; + } + + buf_it = bpf_iter_create(bpf_link__fd(lb)); + if (buf_it < 0) { + ret = -errno; + fprintf(stderr, "Failed to create io_uring_buf: %m\n"); + goto end_lf; + } + + file_it = bpf_iter_create(bpf_link__fd(lf)); + if (file_it < 0) { + ret = -errno; + fprintf(stderr, "Failed to create io_uring_file: %m\n"); + goto end_buf_it; + } + + ret = read(file_it, buf, sizeof(buf)); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n"); + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to write to stdout: %m\n"); + goto end_file_it; + } + + ret = read(buf_it, buf, sizeof(buf)); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n"); + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to write to stdout: %m\n"); + goto end_file_it; + } + +end_file_it: + close(file_it); +end_buf_it: + close(buf_it); +end_lf: + bpf_link__destroy(lf); +end: + bpf_link__destroy(lb); + return ret; +} + +static int do_dump(pid_t tpid, int tfd) +{ + int pidfd, ret = 0, buf_it, file_it, task_it; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo = {}; + const struct io_uring_dump *d; + struct bpf_cr *skel; + char buf[4096]; + + pidfd = sys_pidfd_open(tpid, 0); + if (pidfd < 0) { + fprintf(stderr, "Failed to open pidfd for PID %d: %m\n", tpid); + return 1; + } + + tfd = sys_pidfd_getfd(pidfd, tfd, 0); + if (tfd < 0) { + fprintf(stderr, "Failed to acquire io_uring fd from PID %d: %m\n", tpid); + ret = 1; + goto end; + } + + skel = bpf_cr__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF prog: %m\n"); + ret = 1; + goto end_tfd; + } + skel->rodata->tgid = tpid; + + ret = bpf_cr__load(skel); + if (ret < 0) { + fprintf(stderr, "Failed to load BPF prog: %m\n"); + ret = 1; + goto end_skel; + } + + skel->links.dump_task = bpf_program__attach_iter(skel->progs.dump_task, NULL); + if (!skel->links.dump_task) { + fprintf(stderr, "Failed to attach task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + task_it = bpf_iter_create(bpf_link__fd(skel->links.dump_task)); + if (task_it < 0) { + fprintf(stderr, "Failed to create task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + /* Drive task iterator */ + ret = read(task_it, buf, sizeof(buf)); + close(task_it); + if (ret < 0) { + fprintf(stderr, "Failed to read from task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + linfo.io_uring.io_uring_fd = tfd; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, + &opts); + if (!skel->links.dump_io_uring_buf) { + fprintf(stderr, "Failed to attach io_uring_buf iterator: %m\n"); + ret = 1; + goto end_skel; + } + skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file, + &opts); + if (!skel->links.dump_io_uring_file) { + fprintf(stderr, "Failed to attach io_uring_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + buf_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf)); + if (buf_it < 0) { + fprintf(stderr, "Failed to create io_uring_buf iterator: %m\n"); + ret = 1; + goto end_skel; + } + + file_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file)); + if (file_it < 0) { + fprintf(stderr, "Failed to create io_uring_file iterator: %m\n"); + ret = 1; + goto end_buf_it; + } + + ret = read(file_it, buf, sizeof(buf)); + if (ret < 0) { + fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n"); + ret = 1; + goto end_file_it; + } + + /* Check if we have to dump its parent as well, first descriptor will + * always be DUMP_SETUP, if so, recurse and dump it first. + */ + d = (void *)buf; + if (ret >= sizeof(*d) && d->type == DUMP_SETUP && d->desc.setup.wq_fd) { + int r; + + r = sys_pidfd_getfd(pidfd, d->desc.setup.wq_fd, 0); + if (r < 0) { + fprintf(stderr, "Failed to obtain parent io_uring: %m\n"); + ret = 1; + goto end_file_it; + } + r = do_dump_parent(skel, r); + if (r < 0) { + ret = 1; + goto end_file_it; + } + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + fprintf(stderr, "Failed to write to stdout: %m\n"); + ret = 1; + goto end_file_it; + } + + ret = read(buf_it, buf, sizeof(buf)); + if (ret < 0) { + fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n"); + ret = 1; + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + fprintf(stderr, "Failed to write to stdout: %m\n"); + ret = 1; + goto end_file_it; + } + +end_file_it: + close(file_it); +end_buf_it: + close(buf_it); +end_skel: + bpf_cr__destroy(skel); +end_tfd: + close(tfd); +end: + close(pidfd); + return ret; +} + +static int dump_desc_cmp(const void *a, const void *b) +{ + const struct io_uring_dump *da = a; + const struct io_uring_dump *db = b; + uint64_t dafd = da->io_uring_fd; + uint64_t dbfd = db->io_uring_fd; + + if (dafd < dbfd) + return -1; + else if (dafd > dbfd) + return 1; + else if (da->type < db->type) + return -1; + else if (da->type > db->type) + return 1; + return 0; +} + +static int do_restore_setup(const struct io_uring_dump *d) +{ + struct io_uring_params p; + int fd, nfd; + + memset(&p, 0, sizeof(p)); + + p.flags = d->desc.setup.flags; + if (p.flags & IORING_SETUP_SQ_AFF) + p.sq_thread_cpu = d->desc.setup.sq_thread_cpu; + if (p.flags & IORING_SETUP_SQPOLL) + p.sq_thread_idle = d->desc.setup.sq_thread_idle; + if (p.flags & IORING_SETUP_ATTACH_WQ) + p.wq_fd = d->desc.setup.wq_fd; + if (p.flags & IORING_SETUP_CQSIZE) + p.cq_entries = d->desc.setup.cq_entries; + + fd = sys_io_uring_setup(d->desc.setup.sq_entries, &p); + if (fd < 0) { + fprintf(stderr, "Failed to restore DUMP_SETUP desc: %m\n"); + return -errno; + } + + nfd = dup2(fd, d->io_uring_fd); + if (nfd < 0) { + fprintf(stderr, "Failed to dup io_uring_fd: %m\n"); + close(fd); + return -errno; + } + return 0; +} + +static int do_restore_eventfd(const struct io_uring_dump *d) +{ + int evfd, ret, opcode; + + /* This would require restoring the eventfd first in CRIU, which would + * be found using eventfd_ctx and peeking into struct file guts from + * task_file iterator. Here, we just reopen a normal eventfd and + * register it. The BPF program does have code which does eventfd + * matching to report the fd number. + */ + evfd = eventfd(42, 0); + if (evfd < 0) { + fprintf(stderr, "Failed to open eventfd: %m\n"); + return -errno; + } + + opcode = d->desc.eventfd.async ? IORING_REGISTER_EVENTFD_ASYNC : IORING_REGISTER_EVENTFD; + ret = sys_io_uring_register(d->io_uring_fd, opcode, &evfd, 1); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to register eventfd: %m\n"); + goto end; + } + + ret = 0; +end: + close(evfd); + return ret; +} + +static void print_desc(const struct io_uring_dump *d) +{ + printf("%s:\n\tio_uring_fd: %d\n\tend: %s\n", + type2str[d->type % __DUMP_MAX], d->io_uring_fd, d->end ? "true" : "false"); + switch (d->type) { + case DUMP_SETUP: + printf("\t\tflags: %u\n\t\tsq_entries: %u\n\t\tcq_entries: %u\n" + "\t\tsq_thread_cpu: %d\n\t\tsq_thread_idle: %d\n\t\twq_fd: %d\n", + d->desc.setup.flags, d->desc.setup.sq_entries, + d->desc.setup.cq_entries, d->desc.setup.sq_thread_cpu, + d->desc.setup.sq_thread_idle, d->desc.setup.wq_fd); + break; + case DUMP_EVENTFD: + printf("\t\teventfd: %d\n\t\tasync: %s\n", + d->desc.eventfd.eventfd, + d->desc.eventfd.async ? "true" : "false"); + break; + case DUMP_REG_FD: + printf("\t\treg_fd: %d\n\t\tindex: %lu\n", + d->desc.reg_fd.reg_fd, d->desc.reg_fd.index); + break; + case DUMP_REG_BUF: + printf("\t\taddr: %lu\n\t\tlen: %lu\n\t\tindex: %lu\n", + d->desc.reg_buf.addr, d->desc.reg_buf.len, + d->desc.reg_buf.index); + break; + default: + printf("\t\t{Unknown}\n"); + break; + } +} + +static int do_restore_reg_fd(const struct io_uring_dump *d) +{ + int ret; + + /* In CRIU, we restore the fds to be registered before executing the + * restore action that registers file descriptors to io_uring. + * Our example app would register stdin/stdout/stderr in a sparse + * table, so the test case in the commit works. + */ + if (reg_fd_cnt == MAX_DESC || d->desc.reg_fd.index >= MAX_DESC) { + fprintf(stderr, "Exceeded max fds MAX_DESC (%d)\n", MAX_DESC); + return -EDOM; + } + assert(reg_fd_cnt <= d->desc.reg_fd.index); + /* Fill sparse entries */ + while (reg_fd_cnt < d->desc.reg_fd.index) + fds[reg_fd_cnt++] = -1; + fds[reg_fd_cnt++] = d->desc.reg_fd.reg_fd; + if (d->end) { + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_FILES, &fds, + reg_fd_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + return -errno; + } + } + return 0; +} + +static int do_restore_reg_buf(const struct io_uring_dump *d) +{ + struct iovec *iov; + int ret; + + /* This step in CRIU for buffers with intact source buffers must be + * executed with care. There are primarily three cases (each with corner + * cases excluded for brevity): + * 1. Source VMA is intact ([ubuf->ubuf, ubuf->ubuf_end) is in VMA, base + * page PFN is same) + * 2. Source VMA is split (with multiple pages of ubuf overlaying over + * holes) using munmap(s). + * 3. Source VMA is absent (no VMA or full VMA with incorrect PFN). + * + * PFN remains unique as pages are pinned, hence one with same PFN will + * not be recycled to be part of another mapping by page allocator. 2 + * and 3 required page contents dumping. + * + * VMA with holes (registered before punching holes) also needs partial + * page content dumping to restore without holes, and then punch the + * holes. This can be detected when buffer touches two VMAs with holes, + * and base page PFN matches (split VMA case). + * + * All of this is too complicated to demonstrate here, and is done in + * userspace, hence left out. Future patches will implement the page + * dumping from ubuf iterator part. + * + * In usual cases we might be able to dump page contents from inside + * io_uring that we are dumping, by submitting operations, but we want + * to avoid manipulating the ring while dumping, and opcodes we might + * need for doing that may be restricted, hence preventing dump. + */ + if (reg_buf_cnt == MAX_DESC) { + fprintf(stderr, "Exceeded max buffers MAX_DESC (%d)\n", MAX_DESC); + return -EDOM; + } + assert(d->desc.reg_buf.index == reg_buf_cnt); + iov = &bufs[reg_buf_cnt++]; + iov->iov_base = (void *)d->desc.reg_buf.addr; + iov->iov_len = d->desc.reg_buf.len; + if (d->end) { + if (reg_fd_cnt) { + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_FILES, &fds, + reg_fd_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + return -errno; + } + } + + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_BUFFERS, &bufs, + reg_buf_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register buffers: %m\n"); + return -errno; + } + } + return 0; +} + +static int do_restore_action(const struct io_uring_dump *d, bool dry_run) +{ + int ret; + + print_desc(d); + + if (dry_run) + return 0; + + switch (d->type) { + case DUMP_SETUP: + ret = do_restore_setup(d); + break; + case DUMP_EVENTFD: + ret = do_restore_eventfd(d); + break; + case DUMP_REG_FD: + ret = do_restore_reg_fd(d); + break; + case DUMP_REG_BUF: + ret = do_restore_reg_buf(d); + break; + default: + fprintf(stderr, "Unknown dump descriptor\n"); + return -EDOM; + } + return ret; +} + +static int do_restore(bool dry_run) +{ + struct io_uring_dump dump; + int ret, prev_fd = 0; + + while ((ret = read(STDIN_FILENO, &dump, sizeof(dump)))) { + struct io_uring_dump *d; + + if (ret < 0) { + fprintf(stderr, "Failed to read descriptor: %m\n"); + ret = 1; + goto free; + } + + ret = 1; + if (dump_desc_cnt == MAX_DESC) { + fprintf(stderr, "Cannot process more than MAX_DESC (%d) dump descs\n", + MAX_DESC); + goto free; + } + + d = calloc(1, sizeof(*d)); + if (!d) { + fprintf(stderr, "Failed to allocate dump descriptor: %m\n"); + goto free; + } + + *d = dump; + if (!prev_fd) + prev_fd = d->io_uring_fd; + if (prev_fd != d->io_uring_fd) { + dump_desc[dump_desc_cnt - 1]->end = true; + prev_fd = d->io_uring_fd; + } + dump_desc[dump_desc_cnt++] = d; + qsort(dump_desc, dump_desc_cnt, sizeof(dump_desc[0]), dump_desc_cmp); + } + if (dump_desc_cnt) + dump_desc[dump_desc_cnt - 1]->end = true; + + for (size_t i = 0; i < dump_desc_cnt; i++) { + ret = do_restore_action(dump_desc[i], dry_run); + if (ret < 0) { + fprintf(stderr, "Failed to execute restore action\n"); + goto free; + } + } + + if (!dry_run && dump_desc_cnt) + sleep(10000); + else + puts("Nothing to do, exiting..."); + ret = 0; +free: + while (dump_desc_cnt--) + free(dump_desc[dump_desc_cnt]); + return ret; +} + +static int run_app(void) +{ + struct io_uring_params p; + int r, ret, fd, evfd; + + memset(&p, 0, sizeof(p)); + p.flags |= IORING_SETUP_CQSIZE | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF; + p.sq_thread_idle = 1500; + p.cq_entries = 4; + /* Create a test case with parent io_uring, dependent io_uring, + * registered files, eventfd (async), buffers, etc. + */ + fd = sys_io_uring_setup(2, &p); + if (fd < 0) { + fprintf(stderr, "Failed to create io_uring: %m\n"); + return 1; + } + + r = 1; + printf("PID: %d, Parent io_uring: %d, ", getpid(), fd); + p.flags |= IORING_SETUP_ATTACH_WQ; + p.wq_fd = fd; + + fd = sys_io_uring_setup(2, &p); + if (fd < 0) { + fprintf(stderr, "\nFailed to create io_uring: %m\n"); + goto end_wq_fd; + } + + printf("Dependent io_uring: %d\n", fd); + + evfd = eventfd(42, 0); + if (evfd < 0) { + fprintf(stderr, "Failed to create eventfd: %m\n"); + goto end_fd; + } + + ret = sys_io_uring_register(fd, IORING_REGISTER_EVENTFD_ASYNC, &evfd, 1); + if (ret < 0) { + fprintf(stderr, "Failed to register eventfd (async): %m\n"); + goto end_evfd; + } + + ret = sys_io_uring_register(fd, IORING_REGISTER_FILES, &(int []){0, -1, 1, -1, 2}, 5); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + goto end_evfd; + } + + /* Register dummy buf as well */ + ret = sys_io_uring_register(fd, IORING_REGISTER_BUFFERS, &(struct iovec[]){{}, {&p, sizeof(p)}}, 2); + if (ret < 0) { + fprintf(stderr, "Failed to register buffers: %m\n"); + goto end_evfd; + } + + pause(); + + r = 0; +end_evfd: + close(evfd); +end_fd: + close(fd); +end_wq_fd: + close(p.wq_fd); + return r; +} + +int main(int argc, char *argv[]) +{ + if (argc < 2 || argc > 4) { +usage: + fprintf(stderr, "Usage: %s dump PID FD > dump.out\n" + "\tcat dump.out | %s restore [--dry-run]\n" + "\t%s app\n", argv[0], argv[0], argv[0]); + return 1; + } + + if (libbpf_set_strict_mode(LIBBPF_STRICT_ALL)) { + fprintf(stderr, "Failed to set libbpf strict mode\n"); + return 1; + } + + if (!strcmp(argv[1], "app")) { + return run_app(); + } else if (!strcmp(argv[1], "dump")) { + if (argc != 4) + goto usage; + return do_dump(atoi(argv[2]), atoi(argv[3])); + } else if (!strcmp(argv[1], "restore")) { + if (argc < 2 || argc > 3) + goto usage; + if (argc == 3 && strcmp(argv[2], "--dry-run")) + goto usage; + return do_restore(argc == 3 /* dry_run mode */); + } + fprintf(stderr, "Unknown argument\n"); + goto usage; +} diff --git a/samples/bpf/bpf_cr.h b/samples/bpf/bpf_cr.h new file mode 100644 index 000000000000..74d4ca639db5 --- /dev/null +++ b/samples/bpf/bpf_cr.h @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#ifndef BPF_CR_H +#define BPF_CR_H + +/* The order of restore actions is in order of declaration for each type, + * hence on restore consumed descriptors can be sorted based on their type, + * and then each action for the corresponding descriptor can be invoked, to + * recreate the io_uring. + */ +enum io_uring_state_type { + DUMP_SETUP, /* Record setup parameters */ + DUMP_EVENTFD, /* eventfd registered in io_uring */ + DUMP_REG_FD, /* fd registered in io_uring */ + DUMP_REG_BUF, /* buffer registered in io_uring */ + __DUMP_MAX, +}; + +struct io_uring_dump { + enum io_uring_state_type type; + int32_t io_uring_fd; + bool end; + union { + struct /* DUMP_SETUP */ { + uint32_t flags; + uint32_t sq_entries; + uint32_t cq_entries; + int32_t sq_thread_cpu; + int32_t sq_thread_idle; + uint32_t wq_fd; + } setup; + struct /* DUMP_EVENTFD */ { + uint32_t eventfd; + bool async; + } eventfd; + struct /* DUMP_REG_FD */ { + uint32_t reg_fd; + uint64_t index; + } reg_fd; + struct /* DUMP_REG_BUF */ { + uint64_t addr; + uint64_t len; + uint64_t index; + } reg_buf; + } desc; +}; + +#endif diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h index 722b3fadb467..1752a46a2b05 100644 --- a/samples/bpf/hbm_kern.h +++ b/samples/bpf/hbm_kern.h @@ -9,8 +9,6 @@ * Include file for sample Host Bandwidth Manager (HBM) BPF programs */ #define KBUILD_MODNAME "foo" -#include -#include #include #include #include