From patchwork Mon Nov 22 22:53:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633067 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 741A6C4332F for ; Mon, 22 Nov 2021 22:54:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229938AbhKVW5H (ORCPT ); Mon, 22 Nov 2021 17:57:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229827AbhKVW5G (ORCPT ); Mon, 22 Nov 2021 17:57:06 -0500 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B3A26C061574; Mon, 22 Nov 2021 14:53:58 -0800 (PST) Received: by mail-pg1-x541.google.com with SMTP id q16so871178pgq.10; Mon, 22 Nov 2021 14:53:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VdrsSmezdk4vOxR8TRoBJbl9IHR8AVxYpyCR+qwSuW4=; b=LvD3/5fFln2OWj5loDC1YNhwPHtRyfI5d9pz+4HDP2SoFNNZ/W5FyCL77zDs8Wdx8f mHAIlds3nuFKJf8uTGRXXzXxMTIl5ohYKHLDnNlkAYEplJoh2zRoCFHVVabr6CBlhYAD tPuxNdgc2vnUTBHs8A0YsF2bYsAP0/UcE4hg2P+2ZQg4/SODJZ3nUdwIlDDgxGy8+OG8 e1Z7zei+IVXd7NkL8BirV5QoYKb3U51p9CUIv6m6uF+4+2RXtXZBXBzWDzxH5Dao9bC4 NGXcHWVRdlZQLiTTwULXQLOVAIe2PKhmCVC9u5t0DsF8K1LR/sbZqkW4isigIDG7zxaE sk3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VdrsSmezdk4vOxR8TRoBJbl9IHR8AVxYpyCR+qwSuW4=; b=GpEGqhDo9kigCvbBjRjAmMOOMiYkfTR0fwIRq5G0ALUR/26j/n9ZQ3RnyYmwhbGrun wct7GY0cwN9Zc0ehf2CcgOaJsAiQO4Z8pCsVuJB8H39ILiBiXVmvZN0ZQRY32/U8Ayqu CyvBsSbnbzM93FR3phuSrUdZgmCTurcoYEuPIRjl3sZkt78MTwaq910VFjo3jJWnvZ3u yTEsQVYLnqQXDRJbab/Ir9K8dlbBG4r8Ds2G9UegCGOVoQRuKae4ceqH3mKsnUG4+YX+ iLdVuMqVDdUhyaTk9O8x15vY67j1ts5GWCKu8HknvNhvUvqnfrVuyn4RG75iGJDS/yTh v/1g== X-Gm-Message-State: AOAM531b1Tc8Gdg/ZIZ3wOFhq/WhiFIBXgdfYmafeEjQKEgBU9loVw4c Rqbj8PeSM40Y01nrlRRevWe2Km0HMuw= X-Google-Smtp-Source: ABdhPJzWa2kq4enqTvPrx51uQvx+P8HdbrWX/4AtCXZR+Nh05fQ3B2YTaG5SJy9xM2hTjP5cJaYOHA== X-Received: by 2002:a05:6a00:a1e:b0:4a4:d29e:f570 with SMTP id p30-20020a056a000a1e00b004a4d29ef570mr271530pfh.17.1637621637684; Mon, 22 Nov 2021 14:53:57 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id b10sm10052320pft.179.2021.11.22.14.53.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:53:57 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 01/10] io_uring: Implement eBPF iterator for registered buffers Date: Tue, 23 Nov 2021 04:23:43 +0530 Message-Id: <20211122225352.618453-2-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10956; h=from:subject; bh=F1FHugSih73klj9iWEGvhyagLotXlxpPEPgDiR2ycCg=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrLxa57Tm6Z/FcajPvKNP24dOGde+yqPGBf58ac xpi5PCSJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RyhPwD/ oDBr4rajCLN+fhu8iTUBrukWVl24yD+WulHF/yJ1JCXdycQydfnTH4PtOXzlT18yqjCKTVVq9zVTkP 8z4Oxbw3wg2s68/jghmlVup1kcHCLzffkNsrx3v5ZyhSPclfLD3sfHbIzZCXjNoC3AdjeH6HbAFm1H 5XQL19UTmR1ZMyp4PVj/V/v9ntwuogjyOF8jPDkVzk5jbEbES8zRFIcFc0Mn+x8yiI1rvErIla/KDp t0ZKwao9ZdSKzPhtbNRHepitdK7NR7E26W1nrmGrSkN81NHjqNiniByj0JfB1JiHUv2a9U/6holMW5 JMVEULnNFTb5KYvODPzGvTREdwY15f0lv4TUSV3jVbloVGe43aze2DibyJZ2s6Ib79bBJUa2ZIvLRu Xz64ZhZR5CSC5+05mmF1jaDd/LrglFDezfhdhoNH05Pa/cm7PBsCaL5Eqp8P7FfKMbZEUbdXZa3ggY b4tDpVDsbYKU8GpoFk0Pev2CFOH13FuRwUlTYKeupAGubX9bdg5jl/uj9MyUL19UCsB8YVVpIV7pmD XzUPbZQ/IJOEItLYyLTaSo7o+Ulbd5uk14jnDfJzzWB1bWDGjwzd8M1Y7jRTcgJ6yjxJmIp0sCHTbQ s4BNJQhccn99rBCG7fPvhHBQmqV4VLJ2EHkQEAfibxQXhvxEjxS45JC04aRQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the io_uring_ubuf itself. This allows the iterator to save info related to buffers added to an io_uring instance, that isn't easy to export using the fdinfo interface (like exact struct page composing the registered buffer). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. In the future, this iterator will be extended to directly support iteration of bvec Flexible Array Member, so that when there is no corresponding VMA that maps to the registered buffer (e.g. if VMA is destroyed after pinning pages), we are able to reconstruct the registration on restore by dumping the page contents and then replaying them into a temporary mapping used for registration later. All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/io_uring.c | 203 +++++++++++++++++++++++++++++++++ include/linux/bpf.h | 12 ++ include/uapi/linux/bpf.h | 6 + tools/include/uapi/linux/bpf.h | 6 + 4 files changed, 227 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index b07196b4511c..4f41e9f72b73 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -81,6 +81,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -11125,3 +11126,205 @@ static int __init io_uring_init(void) return 0; }; __initcall(io_uring_init); + +#ifdef CONFIG_BPF_SYSCALL + +BTF_ID_LIST(btf_io_uring_ids) +BTF_ID(struct, io_ring_ctx) +BTF_ID(struct, io_mapped_ubuf) + +struct bpf_io_uring_seq_info { + struct io_ring_ctx *ctx; + u64 index; +}; + +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux) +{ + struct bpf_io_uring_seq_info *info = priv_data; + struct io_ring_ctx *ctx = aux->io_uring.ctx; + + info->ctx = ctx; + return 0; +} + +static int bpf_io_uring_iter_attach(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + struct io_ring_ctx *ctx; + struct fd f; + int ret; + + f = fdget(linfo->io_uring.io_uring_fd); + if (unlikely(!f.file)) + return -EBADF; + + ret = -EOPNOTSUPP; + if (unlikely(f.file->f_op != &io_uring_fops)) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (unlikely(!percpu_ref_tryget(&ctx->refs))) + goto out_fput; + + ret = 0; + aux->io_uring.ctx = ctx; + /* each io_uring file's inode is unique, since it uses + * anon_inode_getfile_secure, which can be used to search + * through files and map link fd back to the io_uring. + */ + aux->io_uring.inode = f.file->f_inode->i_ino; + +out_fput: + fdput(f); + return ret; +} + +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux) +{ + percpu_ref_put(&aux->io_uring.ctx->refs); +} + +#ifdef CONFIG_PROC_FS +void bpf_io_uring_iter_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq) +{ + seq_printf(seq, "io_uring_inode:\t%lu\n", aux->io_uring.inode); +} +#endif + +int bpf_io_uring_iter_fill_link_info(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info) +{ + info->iter.io_uring.inode = aux->io_uring.inode; + return 0; +} + +/* io_uring iterator for registered buffers */ + +struct bpf_iter__io_uring_buf { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct io_ring_ctx *, ctx); + __bpf_md_ptr(struct io_mapped_ubuf *, ubuf); + u64 index; +}; + +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info) +{ + if (info->index < info->ctx->nr_user_bufs) + return info->ctx->user_bufs[info->index++]; + return NULL; +} + +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct io_mapped_ubuf *ubuf; + + /* Indicate to userspace that the uring lock is contended */ + if (!mutex_trylock(&info->ctx->uring_lock)) + return ERR_PTR(-EDEADLK); + + ubuf = __bpf_io_uring_buf_seq_get_next(info); + if (!ubuf) + return NULL; + + if (*pos == 0) + ++*pos; + return ubuf; +} + +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + ++*pos; + return __bpf_io_uring_buf_seq_get_next(info); +} + +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta, + struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf, + u64 index) + +static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct bpf_iter__io_uring_buf ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ctx = info->ctx; + ctx.ubuf = v; + ctx.index = info->index ? info->index - !in_stop : 0; + + return bpf_iter_run_prog(prog, &ctx); +} + +static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_io_uring_buf_seq_show(seq, v, false); +} + +static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + /* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */ + if (IS_ERR(v)) + return; + if (!v) + __bpf_io_uring_buf_seq_show(seq, v, true); + else if (info->index) /* restart from index */ + info->index--; + mutex_unlock(&info->ctx->uring_lock); +} + +static const struct seq_operations bpf_io_uring_buf_seq_ops = { + .start = bpf_io_uring_buf_seq_start, + .next = bpf_io_uring_buf_seq_next, + .stop = bpf_io_uring_buf_seq_stop, + .show = bpf_io_uring_buf_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = { + .seq_ops = &bpf_io_uring_buf_seq_ops, + .init_seq_private = bpf_io_uring_init_seq, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct bpf_io_uring_seq_info), +}; + +static struct bpf_iter_reg io_uring_buf_reg_info = { + .target = "io_uring_buf", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_io_uring_iter_attach, + .detach_target = bpf_io_uring_iter_detach, +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_io_uring_iter_show_fdinfo, +#endif + .fill_link_info = bpf_io_uring_iter_fill_link_info, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__io_uring_buf, ctx), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__io_uring_buf, ubuf), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_io_uring_buf_seq_info, +}; + +static int __init io_uring_iter_init(void) +{ + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; + io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1]; + return bpf_iter_reg_target(&io_uring_buf_reg_info); +} +late_initcall(io_uring_iter_init); + +#endif diff --git a/include/linux/bpf.h b/include/linux/bpf.h index cc7a0c36e7df..967842881024 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1509,8 +1509,20 @@ int bpf_obj_get_user(const char __user *pathname, int flags); extern int bpf_iter_ ## target(args); \ int __init bpf_iter_ ## target(args) { return 0; } +struct io_ring_ctx; + struct bpf_iter_aux_info { + /* Map member must not alias any other members, due to the check in + * bpf_trace.c:__get_seq_info, since in case of map the seq_ops for + * iterator is different from others. The seq_ops is not from main + * iter registration but from map_ops. Nullability of 'map' allows + * to skip this check for non-map iterator cheaply. + */ struct bpf_map *map; + struct { + struct io_ring_ctx *ctx; + ino_t inode; + } io_uring; }; typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a69e4b04ffeb..1ad1ae85743c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5720,6 +5723,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 inode; + } io_uring; }; } iter; struct { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index a69e4b04ffeb..1ad1ae85743c 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u32 io_uring_fd; + } io_uring; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5720,6 +5723,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 inode; + } io_uring; }; } iter; struct { From patchwork Mon Nov 22 22:53:44 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633069 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CDFDC433EF for ; Mon, 22 Nov 2021 22:54:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229827AbhKVW5K (ORCPT ); Mon, 22 Nov 2021 17:57:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48658 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229678AbhKVW5I (ORCPT ); Mon, 22 Nov 2021 17:57:08 -0500 Received: from mail-pj1-x1041.google.com (mail-pj1-x1041.google.com [IPv6:2607:f8b0:4864:20::1041]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3CFE6C061574; Mon, 22 Nov 2021 14:54:01 -0800 (PST) Received: by mail-pj1-x1041.google.com with SMTP id j6-20020a17090a588600b001a78a5ce46aso531755pji.0; Mon, 22 Nov 2021 14:54:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=J0U/HNFBN/Z88lQm+zSO8qaZA7o/gFCLKw/vkfFioNE=; b=hPKPCzmW4HAwe7763b1GWV5uRTmdazrlYavQgQCspnPOlxLoK3DFBfbrOeVAVnWHyH 9fnRcAZ6SdUFNs4sxL1U8l94tE7jwFlwsbirw93ZjHoywBc4ZmjzVo2fcSlLbah5Dd8f 2MUhCHP4l0aNRGFwgM91qQXsg65kBOSnozQF6Nfwrq3gk/uy34h5SiNnFhZIJcpZOtTR WIytAGg7cNBjnCIYXuw6BnL6a+yug8UeF2jkjAIqmktmdTxTaozRBj71F3bf/EBr36fg rZrKrn9XrwWysR0DQefPVvfD0H0jilSPELNSlXc8blWRQKcgN3EsoJFyiXyVVD9EmWa+ p9xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=J0U/HNFBN/Z88lQm+zSO8qaZA7o/gFCLKw/vkfFioNE=; b=tA/aOCnRR7w/sXWjKLGZp5EH/pBaGGGISRxor4GlMzQl4N/NPi4UV6+czsNfPy0kNZ k55CLbS6LkdjB7K4lpL4vOt+t5PWXlu5cedP+UAJE3EkNuC9M7yfJ/jaIeususH5R+gS 2lIZb91uzRg585w+aLhnrENxsJiYB7hWQCE8kPIC+ghewbtIfB3/VTczP3AcQLajfm6K pSMumduhOh6l4M9AJyInMrFb3/sryEu/NUiuY1nx/sBjlsCRLdT7SJk+QXQcroKL4tKZ nRKitx+8U0mmpZj5Aiov6Oxd8lzQaVmmrtyZE06KVwENdh2hQ5eCNZwO6e3tWEYMaR7A vYKg== X-Gm-Message-State: AOAM531NdoMMnb1mPtXgSKr11zO6gJQpYnQaMNoUIB3dyYr2VdOjK2q6 0G93G/QWplxZrA0D5CG9U9+A18AlYOA= X-Google-Smtp-Source: ABdhPJwqMBK5szkR7m4WOI68wXC6C8VFQ5vQhiEkl9Cbul6Prx6bbAnfdFNArAQ0UUUetGbCPVZMrA== X-Received: by 2002:a17:902:b78b:b0:143:baac:2ebc with SMTP id e11-20020a170902b78b00b00143baac2ebcmr881859pls.77.1637621640610; Mon, 22 Nov 2021 14:54:00 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id mi18sm9087756pjb.13.2021.11.22.14.53.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:00 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 02/10] bpf: Add bpf_page_to_pfn helper Date: Tue, 23 Nov 2021 04:23:44 +0530 Message-Id: <20211122225352.618453-3-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=5140; h=from:subject; bh=rOV5lCBuFH8aSKAIvsZcxefNNOmVSy2xuQE8g9iRW3s=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrLiVPym0aFm0mjaUr81JCJT5JEQRIq085EtKRY eArmSXOJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RyrnlD/ 9NzSVCy2DuAIPyiz7gDbncEjCy/WH3ZW4/N4LcX/oyoBeDrS9Aj8NO0i8d25yuvoNSb/CiYZiFYPZQ JvGXJF6bj2SvxusP5pzf+nCgtFb95kUfVoI72eTXEmrJxr05F9RpeqVp54ymsWDs3yg3jjng9rWieV qDSTdPYg9fQ76LvltwTrEUlZY9Zj/Ow8z2+wrypmxcGlEWxXwsi5nmweaDMCfUCS1rDcHbWpr3r/Sc jXCwLdhgEiW/IYOpgkIHF/qxWnV1InGiB8vh78tXzdQX8oWhcxDOWmx1EtyCwhZ2uezSZlpwuV1rCf REFYM6hpGAcBd4jhz+g8Tj/cPf/uTQlUJit5G2OKMDBILBhsXp0+rQewiO34dddAvfJnTyzfc0BFP7 UVUi78jm8HRdFwGra45ixL+Q5BVGQ9aeXyzpAGRU/rmt5W6mYslxhrRiMHQb+vL14uSdMV93AYlIMZ o1RdFpYGZQm5sSin3l3cwjfbJh6dGlZFPptS7w8N1ViTpvFkFPy/RRwbw+xmx5O556ik0WXxeV/fKV mUUj1hccTWaVQngSAWAGPkjL57Rd3qOOS8tLWk0FtdnP9cglR5DWQOkGTJrK9MIHNmm7AWtBPacOQ5 0KKzAYXJiXN6sNyVGZux7J8NMLfXo8Pp6FT4TaDMzXgwDXnsFoEpkkH1xXBA== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In CRIU, we need to be able to determine whether the page pinned by io_uring is still present in the same range in the process VMA. /proc//pagemap gives us the PFN, hence using this helper we can establish this mapping easily from the iterator side. It is a simple wrapper over the in-kernel page_to_pfn macro, and ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is obtained from the bvec of io_uring_ubuf for the CRIU usecase. Signed-off-by: Kumar Kartikeya Dwivedi --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 9 +++++++++ kernel/trace/bpf_trace.c | 19 +++++++++++++++++++ scripts/bpf_doc.py | 2 ++ tools/include/uapi/linux/bpf.h | 9 +++++++++ 5 files changed, 40 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 967842881024..e44503158d76 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2176,6 +2176,7 @@ extern const struct bpf_func_proto bpf_sk_setsockopt_proto; extern const struct bpf_func_proto bpf_sk_getsockopt_proto; extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto; extern const struct bpf_func_proto bpf_find_vma_proto; +extern const struct bpf_func_proto bpf_page_to_pfn_proto; const struct bpf_func_proto *tracing_prog_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1ad1ae85743c..885d9293c147 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4960,6 +4960,14 @@ union bpf_attr { * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. * **-EBUSY** if failed to try lock mmap_lock. * **-EINVAL** for invalid **flags**. + * + * long bpf_page_to_pfn(struct page *page) + * Description + * Obtain the page frame number (PFN) for the given *struct page* + * pointer. + * Return + * Page Frame Number corresponding to the page pointed to by the + * *struct page* pointer, or U64_MAX if pointer is NULL. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5143,6 +5151,7 @@ union bpf_attr { FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ FN(find_vma), \ + FN(page_to_pfn), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 25ea521fb8f1..2a6488f14e58 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1091,6 +1091,23 @@ static const struct bpf_func_proto bpf_get_branch_snapshot_proto = { .arg2_type = ARG_CONST_SIZE_OR_ZERO, }; +BPF_CALL_1(bpf_page_to_pfn, struct page *, page) +{ + /* PTR_TO_BTF_ID can be NULL */ + if (!page) + return U64_MAX; + return page_to_pfn(page); +} + +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page) + +const struct bpf_func_proto bpf_page_to_pfn_proto = { + .func = bpf_page_to_pfn, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg1_btf_id = &btf_page_to_pfn_ids[0], +}; + static const struct bpf_func_proto * bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -1212,6 +1229,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_find_vma_proto; case BPF_FUNC_trace_vprintk: return bpf_get_trace_vprintk_proto(); + case BPF_FUNC_page_to_pfn: + return &bpf_page_to_pfn_proto; default: return bpf_base_func_proto(func_id); } diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index a6403ddf5de7..ae68ca794980 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -549,6 +549,7 @@ class PrinterHelpers(Printer): 'struct socket', 'struct file', 'struct bpf_timer', + 'struct page', ] known_types = { '...', @@ -598,6 +599,7 @@ class PrinterHelpers(Printer): 'struct socket', 'struct file', 'struct bpf_timer', + 'struct page', } mapped_types = { 'u8': '__u8', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1ad1ae85743c..885d9293c147 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4960,6 +4960,14 @@ union bpf_attr { * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. * **-EBUSY** if failed to try lock mmap_lock. * **-EINVAL** for invalid **flags**. + * + * long bpf_page_to_pfn(struct page *page) + * Description + * Obtain the page frame number (PFN) for the given *struct page* + * pointer. + * Return + * Page Frame Number corresponding to the page pointed to by the + * *struct page* pointer, or U64_MAX if pointer is NULL. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5143,6 +5151,7 @@ union bpf_attr { FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ FN(find_vma), \ + FN(page_to_pfn), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Mon Nov 22 22:53:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633071 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91317C433EF for ; Mon, 22 Nov 2021 22:54:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231493AbhKVW5L (ORCPT ); Mon, 22 Nov 2021 17:57:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231288AbhKVW5L (ORCPT ); Mon, 22 Nov 2021 17:57:11 -0500 Received: from mail-pj1-x1041.google.com (mail-pj1-x1041.google.com [IPv6:2607:f8b0:4864:20::1041]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EDC9C061574; Mon, 22 Nov 2021 14:54:04 -0800 (PST) Received: by mail-pj1-x1041.google.com with SMTP id o6-20020a17090a0a0600b001a64b9a11aeso1188936pjo.3; Mon, 22 Nov 2021 14:54:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Wub+OcjyCsazLM1rSLg5a5E6WVbTqD+cmi64+XGoxDw=; b=XskBtO84/36ctb1sA5jIcEEO5IZJHIi26FvnINT8+6+RrtIvheWU1h75J53lCpLONA O44RpG7kEvExn22xfUR4tOjiINCDqT1BeIWMz6paZLqFMLTpC/x+W9wbZa7nrdlDCopI sNbiEkyAEI2omtQ9+JR1t/gMfQRMZVMi3OTTexLClRcgHwaaPddX9gjIIzFJ4Q2b9ztC 24PONR8DYhleJdncsv/waO4yQ/GvwnkQfGSf8u8cXimHUNTVEw6vfHAIeqGepblVR43m 7NxBptuCMErHY9ZEb+iMaLE7Il5lSopnAest4lflqRO5+2QbzXZrsaW0Szw6KP33nLQb waUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Wub+OcjyCsazLM1rSLg5a5E6WVbTqD+cmi64+XGoxDw=; b=vzps7/ltzPdjhnvDTNMPWoF4F0DpDEsdBo3CsrhSC9TPYN209bolROZZsHdCQdh+ib s1I8sBsAUF9Dq4VrxhBoiNATVWPTAkLIfYN/ycecuwIYT+XMW95LOst5KFc8nQQDjaiz DHjclBZ/WbcF7dDxeM4cRt2aZM5xV8odp2i/g1Pd24denBSp/gQU4xSaEa88RodoUDQH bp1ziKOntKVenlgf4zbTkWsA+PP7kPTNZ+hOETmTbN8U6hqNJ2ZA+EiARI/infQUjx+T LYJEhzj+jklAKD7r77z1lP2aOI617N32LrqBhDM+viWNSP+HDRaZQjR7VtqRYnr6EMVk pyiw== X-Gm-Message-State: AOAM533vlCu7QEV5Pz+4tZ2IxFqBpNOL+LmMCUtWu3gTaXAqQ5zLfmGX /763dZz6ftfYo/NfDN5x5aT0BykCllM= X-Google-Smtp-Source: ABdhPJxud7nK60/FLQRGBESrN9ZHBq3+SSJEGjXNAI7FevxxVBY5JL0kEF8dQqRboXTTByF2ye4kCg== X-Received: by 2002:a17:902:ab14:b0:143:77d8:2558 with SMTP id ik20-20020a170902ab1400b0014377d82558mr979621plb.54.1637621643519; Mon, 22 Nov 2021 14:54:03 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id x9sm16616571pjq.50.2021.11.22.14.54.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:03 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 03/10] io_uring: Implement eBPF iterator for registered files Date: Tue, 23 Nov 2021 04:23:45 +0530 Message-Id: <20211122225352.618453-4-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=8089; h=from:subject; bh=NEasbinq36le+EXZqwQT050wPHtEwMql3IpU3MolDCk=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrLZA9jlSE008L5HO8DrALJLQk7OC4QVPozmmJi yEJNCc+JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RyoDYD/ 9KMXEN9Nebz5HBcZz5qtmrojzYqUHYhQnm/+FhsaE5qdnuglvmk52J9RonDy2Mq18rnrenVreHxZYy c6LvygPCVEPEd/llyQO6AvqEbL2/48DeHRl879GAyzaoZ4u9XH9wM/sCXhU27u/ZoE/Hrx8PxmaOOt JLnaup02jkJXUR6NSNxld0LXyc/VEKR19vQT7WC84nuWOQqBAmPPMw1qYwfS5kS0eiq37rD3M58xry gMNVX8yZV1VMEA+WNnei2/xeTqnIvqBHHJsZb1tEYwQqtbGad48qQcUP5FwndTh3tU/NqMGiH9L1BK nTlYfnOaWHCQ4PhUK+ouvI90D6i85xrxno+q5oSs2S+kjkqOBIrnxiQG7EOwiArodYZ38JIK6W7xHw LYJZcH4/5aPAMmnCf4grxtNtFpEcYgk5VLO5CVytxaVKv7Dqeha/g1Tw/p1zT3/3m72bg0vmhJe1cm v0rqxPhtYTb3YJFyZBWnqAZODtQOjL1BMTpbCdikI11oE78sAMFsQDIkwiC2ApyexycUPN51Czemvo N/zrzJUmLBXh2CE7hj50aimwuBti8PoERkbZWXr3M6OK/e7wPqnFXVWDxFL/4JC1CTSz+PDgaPRa5z lmiFmDHhBkjKFl83aIzontGzwA6owbVMftOyRDhQI0uONWurblsnoJiJB7BQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the struct file itself. This allows the iterator to save info related to the file added to an io_uring instance, that isn't easy to export using the fdinfo interface (like being able to match registered files to a task's file set). Getting access to underlying struct file allows deduplication and efficient pairing with task file set (obtained using task_file iterator). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. With the advent of descriptorless files supported by io_uring, this iterator provides the required visibility and introspection of io_uring instance for the purposes of dumping and restoring it. In the future, this iterator will be extended to support direct inspection of a lot of file state (currently descriptorless files are obtained using openat2 and socket) to dump file state for these hidden files. Later, we can explore filling in the gaps for dumping file state for more file types (those not hidden in io_uring ctx). All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/io_uring.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 4f41e9f72b73..19f95456b580 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -11132,6 +11132,7 @@ __initcall(io_uring_init); BTF_ID_LIST(btf_io_uring_ids) BTF_ID(struct, io_ring_ctx) BTF_ID(struct, io_mapped_ubuf) +BTF_ID(struct, file) struct bpf_io_uring_seq_info { struct io_ring_ctx *ctx; @@ -11319,11 +11320,152 @@ static struct bpf_iter_reg io_uring_buf_reg_info = { .seq_info = &bpf_io_uring_buf_seq_info, }; +/* io_uring iterator for registered files */ + +struct bpf_iter__io_uring_file { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct io_ring_ctx *, ctx); + __bpf_md_ptr(struct file *, file); + u64 index; +}; + +static void *__bpf_io_uring_file_seq_get_next(struct bpf_io_uring_seq_info *info) +{ + struct file *file = NULL; + + if (info->index < info->ctx->nr_user_files) { + /* file set can be sparse */ + file = io_file_from_index(info->ctx, info->index++); + /* use info as a distinct pointer to distinguish between empty + * slot and valid file, since we cannot return NULL for this + * case if we want iter prog to still be invoked with file == + * NULL. + */ + if (!file) + return info; + } + + return file; +} + +static void *bpf_io_uring_file_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct file *file; + + /* Indicate to userspace that the uring lock is contended */ + if (!mutex_trylock(&info->ctx->uring_lock)) + return ERR_PTR(-EDEADLK); + + file = __bpf_io_uring_file_seq_get_next(info); + if (!file) + return NULL; + + if (*pos == 0) + ++*pos; + return file; +} + +static void *bpf_io_uring_file_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + ++*pos; + return __bpf_io_uring_file_seq_get_next(info); +} + +DEFINE_BPF_ITER_FUNC(io_uring_file, struct bpf_iter_meta *meta, + struct io_ring_ctx *ctx, struct file *file, + u64 index) + +static int __bpf_io_uring_file_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_io_uring_seq_info *info = seq->private; + struct bpf_iter__io_uring_file ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ctx = info->ctx; + /* when we encounter empty slot, v will point to info */ + ctx.file = v == info ? NULL : v; + ctx.index = info->index ? info->index - !in_stop : 0; + + return bpf_iter_run_prog(prog, &ctx); +} + +static int bpf_io_uring_file_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_io_uring_file_seq_show(seq, v, false); +} + +static void bpf_io_uring_file_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_io_uring_seq_info *info = seq->private; + + /* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */ + if (IS_ERR(v)) + return; + if (!v) + __bpf_io_uring_file_seq_show(seq, v, true); + else if (info->index) /* restart from index */ + info->index--; + mutex_unlock(&info->ctx->uring_lock); +} + +static const struct seq_operations bpf_io_uring_file_seq_ops = { + .start = bpf_io_uring_file_seq_start, + .next = bpf_io_uring_file_seq_next, + .stop = bpf_io_uring_file_seq_stop, + .show = bpf_io_uring_file_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_io_uring_file_seq_info = { + .seq_ops = &bpf_io_uring_file_seq_ops, + .init_seq_private = bpf_io_uring_init_seq, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct bpf_io_uring_seq_info), +}; + +static struct bpf_iter_reg io_uring_file_reg_info = { + .target = "io_uring_file", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_io_uring_iter_attach, + .detach_target = bpf_io_uring_iter_detach, +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_io_uring_iter_show_fdinfo, +#endif + .fill_link_info = bpf_io_uring_iter_fill_link_info, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__io_uring_file, ctx), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__io_uring_file, file), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_io_uring_file_seq_info, +}; + static int __init io_uring_iter_init(void) { + int ret; + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1]; - return bpf_iter_reg_target(&io_uring_buf_reg_info); + io_uring_file_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0]; + io_uring_file_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[2]; + ret = bpf_iter_reg_target(&io_uring_buf_reg_info); + if (ret) + return ret; + ret = bpf_iter_reg_target(&io_uring_file_reg_info); + if (ret) + bpf_iter_unreg_target(&io_uring_buf_reg_info); + return ret; } late_initcall(io_uring_iter_init); From patchwork Mon Nov 22 22:53:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633073 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A502C433FE for ; Mon, 22 Nov 2021 22:54:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229986AbhKVW5W (ORCPT ); Mon, 22 Nov 2021 17:57:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230525AbhKVW5O (ORCPT ); Mon, 22 Nov 2021 17:57:14 -0500 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41F60C06173E; Mon, 22 Nov 2021 14:54:07 -0800 (PST) Received: by mail-pg1-x541.google.com with SMTP id m15so16514055pgu.11; Mon, 22 Nov 2021 14:54:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=y4TCjzqBDqBEHwnZMmRYUEm8G073qVjeTZ9q479fd5w=; b=dBK/SvQ8JjfF6gtcfnxltyrU482tkTjiBNht3pAnunjE4JWkijCRh3hbX9fEGdd90L 8fgaNHzGrF2Qkw3ohFsnFz+V/2UcLz3K5LaDnqs6a56hNU48bmiQHR2slseII0zzT5hR FeYz+AsSKldm5eM1OGsFOCWi1iQ6l5aUhA+VYMEbQaCpObrymNYFzvvLlFAySBlvRoY8 zXzkDbk7k44XqBC48qZKX8CSEOOrGxf5HT36l+jyCYXU1xCM2HOpego1CltG1jOS0A2c FYvTMoohCmK9EF65IAxH2BVBq0mNaSxQjkcGUBSwsQLKb/pNogA9nJhxv5hhgbheS9zH RM6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=y4TCjzqBDqBEHwnZMmRYUEm8G073qVjeTZ9q479fd5w=; b=rdL5GUxYH3sfSo8fAYHXApesvLr4GtUEKKjYYhUH8yX2oal5jEqgUBVwIC4EMKF4u3 svQiNoK8ZflaR+k1dgQcktm/pu45lZ1APrYj093vqBu0EuQHlxX6wsY62Se2DY/5amkr tvTr10hMt8JTM/bDiRE34TOE/umG9q0HiipSXurNdx3HGTMIstHitLS2QQMnCeJapgrz 9ePJde8r4DVSMFtNBh7zrPnw20BXg1OVgYsEjl1BfDHerVSgbU3YWxxz/vKetnKbAErV 4ApAHvKzvOWuGMAQC/ZZIhCYfattudRU8t/BGbRL1Lqf4iHyON4h0AIRjURug90LvJV6 thiA== X-Gm-Message-State: AOAM532wcoB9gTnEKU7l2ZMkmyH7wnb63EHowb6tAkzbmoUnxCBg6Vzo JxX8X4U+2gqXiT3sC4n1rchV8HMf3EE= X-Google-Smtp-Source: ABdhPJwka5bhZZ52byel0dgs7hLsZM11lEwb5MWkprJXGz5bGxaJNCo0IvWEZwnps7G4HNegGcfapA== X-Received: by 2002:aa7:864f:0:b0:4a2:ea0a:a16d with SMTP id a15-20020aa7864f000000b004a2ea0aa16dmr433387pfo.11.1637621646388; Mon, 22 Nov 2021 14:54:06 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id y10sm8630580pfl.21.2021.11.22.14.54.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:06 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org Subject: [PATCH bpf-next v2 04/10] epoll: Implement eBPF iterator for registered items Date: Tue, 23 Nov 2021 04:23:46 +0530 Message-Id: <20211122225352.618453-5-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10852; h=from:subject; bh=ca5EuDywN/MSw/ZNolqkDp5GTiQWNZ/y7xTejm7WbFk=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrLGT94o7Dm2lDg/a5shbWnHXsonAIhUzL9yk0W 9DWdcj2JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RyskcD/ 9qtlF/ivMwoG69UVDVztmo5bvuAIugmScmzfdX7lztpkf/wNo6/vJ6gKb2Vq5eJihXgOKZ5voO/1xs qZJvr3oINX9PEgwJMf13U34MveaTVYRdFjHSUiCWaJrFklo6YUpKH+OQmLcza2N6cfkkM3sabwieXT 8qA6g3Zex3ydONAVXhT5G1fSdJfTNxR7CVKGW7VoVwOV5tTnB5mWdpFLlYoyQRkRu6rCOk1e/nx9FL E+XTiYDT8DNkcXcPvp+s7jBZDdUvWbtkSQBphjj3/K1OWcqIG06Lz9pEM1UnESopwPbjtZRtGkdp7/ mS5WCeceB73wgSWHzYEzX9gFOsxiZiFVo8bY6dEJjfbE1uKa+6dHf5bPRqc7PEbcM/ct2v9OCxn1r5 iZpfnnmbZJTusC1Vcq02rdXByZIlFmFEfuLGdM5GTRtVM5Qdb3850ZPSsRCgNsuUduM8OueyiGcsuB aaoHVXGthv39zyuPB/3Xof1gqDMEdpW3lDkSY7PrwoD0ygnHxm0sTFkGXRpsFvf9WSoKuWP+5i/4/i I6JFAG7UdP6wo/Coap9KVDfjcNSOGMA+mvumy+tcneZjHM5PMovguUZ3DFBP6WBerQGZwiKmEJuvZj sP+esE1LsQ+2J6rtI0u9CrdHNrekmFK4lg+bin1xCs5/Y8GaTU5XRsdGGcDw== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch adds eBPF iterator for epoll items (epitems) registered in an epoll instance. It gives access to the eventpoll ctx, and the registered epoll item (struct epitem). This allows the iterator to inspect the registered file and be able to use others iterators to associate it with a task's fdtable. The primary usecase this is enabling is expediting existing eventpoll checkpoint/restore support in the CRIU project. This iterator allows us to switch from a worst case O(n^2) algorithm to a single O(n) pass over task and epoll registered descriptors. We also make sure we're iterating over a live file, one that is not going away. The case we're concerned about is a file that has its f_count as zero, but is waiting for iterator bpf_seq_read to release ep->mtx, so that it can remove its epitem. Since such a file will disappear once iteration is done, and it is being destructed, we use get_file_rcu to ensure it is alive when invoking the BPF program. Getting access to a file that is going to disappear after iteration is not useful anyway. This does have a performance overhead however (since file reference will be raised and dropped for each file). The rcu_read_lock around get_file_rcu isn't strictly required for lifetime management since fput path is serialized on ep->mtx to call ep_remove, hence the epi->ffd.file pointer remains stable during our seq_start/seq_stop bracketing. To be able to continue back from the position we were iterating, we store the epi->ffi.fd and use ep_find_tfd to find the target file again. It would be more appropriate to use both struct file pointer and fd number to find the last file, but see below for why that cannot be done. Taking reference to struct file and walking RB-Tree to find it again will lead to reference cycle issue if the iterator after partial read takes reference to socket which later is used in creating a descriptor cycle using SCM_RIGHTS. An example that was encountered when working on this is mentioned below. Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator ITER. Let SK1 be registered in EP, then on a partial read it is possible that ITER returns from read and takes reference to SK1 to be able to find it later in RB-Tree and continue the iteration. If SK1 sends ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using SCM_RIGHTS, and both fds are not consumed on the corresponding receive ends, a cycle is created. When all of SK1, SK2, EP, and ITER are closed, SK1's receive queue holds reference to SK2, and SK2's receive queue holds reference to ITER, which holds a reference to SK1. All file descriptors except EP leak. To resolve it, we would need to hook into the Unix Socket GC mechanism, but the alternative of using ep_find_tfd is much more simpler. The finding of the last position in face of concurrent modification of the epoll set is at best an approximation anyway. For the case of CRIU, the epoll set remains stable. Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- fs/eventpoll.c | 196 ++++++++++++++++++++++++++++++++- include/linux/bpf.h | 11 +- include/uapi/linux/bpf.h | 3 + tools/include/uapi/linux/bpf.h | 3 + 4 files changed, 208 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 06f4c5ae1451..aa21628b6307 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -37,6 +37,7 @@ #include #include #include +#include #include /* @@ -985,7 +986,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) return epir; } -#ifdef CONFIG_KCMP static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff) { struct rb_node *rbp; @@ -1005,6 +1005,7 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t return NULL; } +#ifdef CONFIG_KCMP struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long toff) { @@ -2385,3 +2386,196 @@ static int __init eventpoll_init(void) return 0; } fs_initcall(eventpoll_init); + +#ifdef CONFIG_BPF_SYSCALL + +BTF_ID_LIST(btf_epoll_ids) +BTF_ID(struct, eventpoll) +BTF_ID(struct, epitem) + +struct bpf_epoll_iter_seq_info { + struct eventpoll *ep; + struct rb_node *rbp; + int tfd; +}; + +static int bpf_epoll_init_seq(void *priv_data, struct bpf_iter_aux_info *aux) +{ + struct bpf_epoll_iter_seq_info *info = priv_data; + + info->ep = aux->ep->private_data; + info->tfd = -1; + return 0; +} + +static int bpf_epoll_iter_attach(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + struct file *file; + int ret; + + file = fget(linfo->epoll.epoll_fd); + if (!file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (unlikely(!is_file_epoll(file))) + goto out_fput; + + aux->ep = file; + return 0; +out_fput: + fput(file); + return ret; +} + +static void bpf_epoll_iter_detach(struct bpf_iter_aux_info *aux) +{ + fput(aux->ep); +} + +struct bpf_iter__epoll { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct eventpoll *, ep); + __bpf_md_ptr(struct epitem *, epi); +}; + +static void *bpf_epoll_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct epitem *epi; + + mutex_lock(&info->ep->mtx); + /* already iterated? */ + if (info->tfd == -2) + return NULL; + /* partially iterated */ + if (info->tfd >= 0) { + epi = ep_find_tfd(info->ep, info->tfd, 0); + if (!epi) + return NULL; + info->rbp = &epi->rbn; + return epi; + } + WARN_ON(info->tfd != -1); + /* first iteration */ + info->rbp = rb_first_cached(&info->ep->rbr); + if (!info->rbp) + return NULL; + if (*pos == 0) + ++*pos; + return rb_entry(info->rbp, struct epitem, rbn); +} + +static void *bpf_epoll_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + + ++*pos; + info->rbp = rb_next(info->rbp); + return info->rbp ? rb_entry(info->rbp, struct epitem, rbn) : NULL; +} + +DEFINE_BPF_ITER_FUNC(epoll, struct bpf_iter_meta *meta, struct eventpoll *ep, + struct epitem *epi) + +static int __bpf_epoll_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct bpf_iter__epoll ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + int ret; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ep = info->ep; + ctx.epi = v; + if (ctx.epi) { + /* The file we are going to pass to prog may already have its f_count as + * 0, hence before invoking the prog, we always try to get the reference + * if it isn't zero, failing which we skip the file. This is usually the + * case for files that are closed before calling EPOLL_CTL_DEL for them, + * which would wait for us to release ep->mtx before doing ep_remove. + */ + rcu_read_lock(); + ret = get_file_rcu(ctx.epi->ffd.file); + rcu_read_unlock(); + if (!ret) + return 0; + } + ret = bpf_iter_run_prog(prog, &ctx); + /* fput queues work asynchronously, so in our case, either task_work + * for non-exiting task, and otherwise delayed_fput, so holding + * ep->mtx and calling fput (which will take the same lock) in + * this context will not deadlock us, in case f_count is 1 at this + * point. + */ + if (ctx.epi) + fput(ctx.epi->ffd.file); + return ret; +} + +static int bpf_epoll_seq_show(struct seq_file *seq, void *v) +{ + return __bpf_epoll_seq_show(seq, v, false); +} + +static void bpf_epoll_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_epoll_iter_seq_info *info = seq->private; + struct epitem *epi; + + if (!v) { + __bpf_epoll_seq_show(seq, v, true); + /* done iterating */ + info->tfd = -2; + } else { + epi = rb_entry(info->rbp, struct epitem, rbn); + info->tfd = epi->ffd.fd; + } + mutex_unlock(&info->ep->mtx); +} + +static const struct seq_operations bpf_epoll_seq_ops = { + .start = bpf_epoll_seq_start, + .next = bpf_epoll_seq_next, + .stop = bpf_epoll_seq_stop, + .show = bpf_epoll_seq_show, +}; + +static const struct bpf_iter_seq_info bpf_epoll_seq_info = { + .seq_ops = &bpf_epoll_seq_ops, + .init_seq_private = bpf_epoll_init_seq, + .seq_priv_size = sizeof(struct bpf_epoll_iter_seq_info), +}; + +static struct bpf_iter_reg epoll_reg_info = { + .target = "epoll", + .feature = BPF_ITER_RESCHED, + .attach_target = bpf_epoll_iter_attach, + .detach_target = bpf_epoll_iter_detach, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__epoll, ep), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__epoll, epi), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &bpf_epoll_seq_info, +}; + +static int __init epoll_iter_init(void) +{ + epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0]; + epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1]; + return bpf_iter_reg_target(&epoll_reg_info); +} +late_initcall(epoll_iter_init); + +#endif diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e44503158d76..d7e3e9c59b68 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1519,10 +1519,13 @@ struct bpf_iter_aux_info { * to skip this check for non-map iterator cheaply. */ struct bpf_map *map; - struct { - struct io_ring_ctx *ctx; - ino_t inode; - } io_uring; + union { + struct { + struct io_ring_ctx *ctx; + ino_t inode; + } io_uring; + struct file *ep; + }; }; typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 885d9293c147..b82b11d72520 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -94,6 +94,9 @@ union bpf_iter_link_info { struct { __u32 io_uring_fd; } io_uring; + struct { + __u32 epoll_fd; + } epoll; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 885d9293c147..b82b11d72520 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -94,6 +94,9 @@ union bpf_iter_link_info { struct { __u32 io_uring_fd; } io_uring; + struct { + __u32 epoll_fd; + } epoll; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ From patchwork Mon Nov 22 22:53:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633075 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BCF9C433FE for ; Mon, 22 Nov 2021 22:54:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230421AbhKVW53 (ORCPT ); Mon, 22 Nov 2021 17:57:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48706 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229672AbhKVW5R (ORCPT ); Mon, 22 Nov 2021 17:57:17 -0500 Received: from mail-pj1-x1044.google.com (mail-pj1-x1044.google.com [IPv6:2607:f8b0:4864:20::1044]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 052EBC061748; Mon, 22 Nov 2021 14:54:10 -0800 (PST) Received: by mail-pj1-x1044.google.com with SMTP id w33-20020a17090a6ba400b001a722a06212so820715pjj.0; Mon, 22 Nov 2021 14:54:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6JWpDmyRuBt8dUb8y0nD87bsR/GLwgR+k0+9HPit3uo=; b=blrQs8SeF+FxMCKS6164TWAtwcnLLc4huiFoIgd5mP0M/IfYOPmW2dSQD2ELy36CVI SP5G3eC3xbDgR8BGgAkOgMMt/gnEWCQZb/qVvECxgaYx4N/7PLNhoLFO4acf/PDggNIo 7MVkQeHRFV4q6w8iW11xXBkjOp+wJpA3sgV/dCshD+69H+H3d/DABso7A6Q44v4zUc2k tcyvpCg0lUwUC/Ak2IGKv5rulAlRS6Czw9P88+nAtOHdWwI8Wcg2QzUgM+/quyoFbrC5 nRQ9Mvs7sItDCt2yzKe+Z5YAqp+1C9ntaMMOBoqupXxptmLgE3sIkJkHXP7VQ+I3IcgM wdjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6JWpDmyRuBt8dUb8y0nD87bsR/GLwgR+k0+9HPit3uo=; b=XIbuNzB942hC/MlT11SyhTudqKLg8SLf1BJYUk8OnWJJFplqD4gI4PaLxC9gBFDvXG 4LA9nbG9UOhUHjBoTGeI/VfmZSg1tk49v5oHwjif40Wt52mZ0ar+eQSCi+0s/pYd3uUy mISU3kcJ7ecV6Sm3vRxlrI2WMlv6+UDBNupI8baduCcGOalIaWSk88yfCa8IrGd2Ke7s rKTqBWg0QkS/fdB5Ui84VMVFVxQ7d77lLj+mrZYfsQw40RMidBZJcNi168K8M5iCYlzb j2NmChgoTr3tvNtqrih60/SrcKs2sRq8dxJTQVQm2WrC/1+wdwNM/7VzaJ+gLXrVVnnC F6SA== X-Gm-Message-State: AOAM533t+gkYvMH/PRoOuhlBs8vU2LtTAjTNop/HCufV55dQPWuYe2vM vP0k0JhcZK93qeNT5gLCt7rQFckxrkw= X-Google-Smtp-Source: ABdhPJz3Fnee9gSo0lRhNT3FqaquLA6YBqY26z67T9sYknAdZVIlqKKAdNHDNA4dHeYqiyWnF4jPtQ== X-Received: by 2002:a17:90a:1b67:: with SMTP id q94mr511543pjq.119.1637621649392; Mon, 22 Nov 2021 14:54:09 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id c15sm8511308pjg.53.2021.11.22.14.54.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:09 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 05/10] bpftool: Output io_uring iterator info Date: Tue, 23 Nov 2021 04:23:47 +0530 Message-Id: <20211122225352.618453-6-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=2337; h=from:subject; bh=fchbbGqOBTHhHf6cxv4eARket982nyTwqX0o7rShbKg=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrLRFkkqFnipUhBfvYLaoNQmk2cfVhJa+lOXyZy BjqxVfCJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RysyLD/ 4+9AEkVCAdbZ/ZK5wmDynX/JfWoot1Juv6q1lxXcb9lRY9RN992uXa55Pfsm67XlCaY3X7j9yu9xYB OhbuvByhrlC6frkHkAKuRDIldr4GA0YgZlseAAB/m3RRCvuNFYpdfsVQ8RIOd0Xg1mbtWN9RbsWZtE cb3P89+RKVp2pwkQzESaLzRRSRqP7U/vTHu82ntAeZdiqOmBJtyUbubL2RvYH4grxa3P3ymJ7z5xol 3QABcxe4sBmeixcHGBIckLOYFQtsgSBD9sMH8jk1xHvqftINxBpq0zy9NmHxrDmzlLOSyYgfU7Vdsg uyIEJw+9zM9BWJLXURTS+8y8AIJ3/yl3FpeaMQcGMzF1Mnt5EynrfGTbG348BDnSWXLNPc0miCvkY/ WUwIfzgI42S/LiTwKFq4B4RvUbN76GScAuPK+9GQ2U+eGYA8uGEI6c82eONgIciuRJo3jTxZJUqvAS IDQapUDMztQpZTKDxgtyUp0N0g3qm1twh/JGlrscDL7vllas/dzN8QjP8sLpF6lz7exRr/yVyoVBfb dKYuFPcFrBcB76kLhPwKiRf9Q0PQrORgG94xPWTc4B3fU5gQZJfhhU6IVEhqowFqM56MeVuqgB3ymX xoW2O88LyWVqvXCyE9uCx98Uxh+ccjnx/C8cBFCsdNxm2FPyPoIzcD8nKEwA== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Output the sole field related to io_uring iterator (inode of attached io_uring) so that it can be useful in informational and also debugging cases (trying to find actual io_uring fd attached to the iterator). Output: 89: iter prog 262 target_name io_uring_file io_uring_inode 16764 pids test_progs(384) [ { "id": 123, "type": "iter", "prog_id": 463, "target_name": "io_uring_buf", "io_uring_inode": 16871, "pids": [ { "pid": 443, "comm": "test_progs" } ] } ] [ { "id": 126, "type": "iter", "prog_id": 483, "target_name": "io_uring_file", "io_uring_inode": 16887, "pids": [ { "pid": 448, "comm": "test_progs" } ] } ] Signed-off-by: Kumar Kartikeya Dwivedi --- tools/bpf/bpftool/link.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c index 2c258db0d352..409ae861b839 100644 --- a/tools/bpf/bpftool/link.c +++ b/tools/bpf/bpftool/link.c @@ -86,6 +86,12 @@ static bool is_iter_map_target(const char *target_name) strcmp(target_name, "bpf_sk_storage_map") == 0; } +static bool is_iter_io_uring_target(const char *target_name) +{ + return strcmp(target_name, "io_uring_file") == 0 || + strcmp(target_name, "io_uring_buf") == 0; +} + static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr) { const char *target_name = u64_to_ptr(info->iter.target_name); @@ -94,6 +100,8 @@ static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr) if (is_iter_map_target(target_name)) jsonw_uint_field(wtr, "map_id", info->iter.map.map_id); + else if (is_iter_io_uring_target(target_name)) + jsonw_uint_field(wtr, "io_uring_inode", info->iter.io_uring.inode); } static int get_prog_info(int prog_id, struct bpf_prog_info *info) @@ -204,6 +212,8 @@ static void show_iter_plain(struct bpf_link_info *info) if (is_iter_map_target(target_name)) printf("map_id %u ", info->iter.map.map_id); + else if (is_iter_io_uring_target(target_name)) + printf("io_uring_inode %llu ", info->iter.io_uring.inode); } static int show_link_close_plain(int fd, struct bpf_link_info *info) From patchwork Mon Nov 22 22:53:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633077 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24B1FC433FE for ; Mon, 22 Nov 2021 22:54:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231392AbhKVW5a (ORCPT ); Mon, 22 Nov 2021 17:57:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229998AbhKVW5V (ORCPT ); Mon, 22 Nov 2021 17:57:21 -0500 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0920EC061574; Mon, 22 Nov 2021 14:54:13 -0800 (PST) Received: by mail-pl1-x643.google.com with SMTP id y7so15450211plp.0; Mon, 22 Nov 2021 14:54:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eZz/i/LW/pYopHevV0UZkgBAZImgYdtC2F2e/9PEk9U=; b=AdDdLNU7XyI+BGSi7FA85aclwzQMdZ/bI6OurYqcqYTExE2OvviWbNZhbCNnhw3nR5 +pN1uVVSITe4yy97KEjnoZiv/t+VFnM/Hap+bcLQYVy3X1wWnaoIy0MXXC13SIAigcGI mk6vTmPqyZKKA2gz2qXkCHCC9VWSDnRda9nnDukBN6QwA0J0AR6wg7Z+BDi4FAGXlOiB j6KU5lfeYTSyAlx0V0iCiJE0KRaBWOTnTA7H++JiN/SnB+vJJTcGN+Df8VPkbwY0O0JT MJB42/OmitwwNWopbugEGLEeFW2PcHgKXNKSVQ90TNs0hodMVPpPCwolhiwwrPLEgiTc MDKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eZz/i/LW/pYopHevV0UZkgBAZImgYdtC2F2e/9PEk9U=; b=4A3U8kOsaV5CGXRAdbOD6wOuD8yx9AP7fdIwiM8DKj3vQnz5JRAcrdYYtmKtIADm4z Xxx8TRs/QKBUn7SbiqJyPeXHvWqL56Y3w8lcz/7iuSZGJGY8EJwS+1Dqe07qr7BbpC4r EjpXhDOEkpB+k5j50HNXuJvEd5vSqVvpUpLp2Nqe4JAe5g30bQNaRBRqfh1nd8ley4GO MExdIP3BeWkd4qwDExTDoPRDtgVBzY0I4hhPTfsPaPd+rPdiKgTiv9hVAmOL8RCBOc3Q Qr4Yt+DpgyJE0L88QYJ8pdsXRHvy0iOLGf7GgmjCgLZ5bEt/2JEG9Hmm47jr9CQWOBpO hhwA== X-Gm-Message-State: AOAM531Un+zDNPn6XxbKt1E+A/Df3+/WFs0lkJmA+o9M8fch4vcD+KCO i+1vrQPz+D7jBtj8VAoPScsj2VCkevE= X-Google-Smtp-Source: ABdhPJyOmlQPIlKjWbA5AxRKnGAZnr8K8j9+YPSInHCHgJLxdaGpzxcefhttMp9XEc+LZ02Gy2rEVg== X-Received: by 2002:a17:902:d4d0:b0:141:c13d:6c20 with SMTP id o16-20020a170902d4d000b00141c13d6c20mr507350plg.44.1637621652388; Mon, 22 Nov 2021 14:54:12 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id c10sm3588081pgk.84.2021.11.22.14.54.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:12 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 06/10] selftests/bpf: Add test for io_uring BPF iterators Date: Tue, 23 Nov 2021 04:23:48 +0530 Message-Id: <20211122225352.618453-7-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=10827; h=from:subject; bh=kxGPKIv67E1/n+m11jdziOrgwqJ5F4bn7WE5AYofjKY=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrL6LM8NFigs+Kn+EQ1mc+HyeU9K+wFutqBdQTb QNTJZU6JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwaywAKCRBM4MiGSL8RyonlEA C67wEFtLrD91zVdY+J8t3iMwQbF5f65M6iA+o5yRQv9xhWGzY/4yDiiZ7R9qd/id3wgN2lW7zsj3zo JvyPqdv4RAJK8ykp9/b3pQKPQqgq4FCaeVKCbz4Nr2XK8Un4q7mgGW13Azv7DfdhhuJqVN9DFatKq7 u2/gMGwgxZCplBWPhLf3enFv+OEHtcQOpS27rcEWhdoNM5Tzz3d0CQMMTpI6sJyK7tvGp22lBoXx4c gbmA7tfGPiZechuyhQQAPUEMUX4A9VkieZs8E7l2aTcfyBu+DMtws3eUsFSyLM2jK/Uol/4Chp4Vos wiBSfGgSEA0JbSjlSC81wSKF2RhHKggRS/DPJMOeKi3O6yqzcoJt0/C1COy5OI+or40rNM+8Zja50G 3c75N0Uzpnb/sok9ia4oBZ5GTnfYJ1g8NqAibeIuHZtsSeg/hI8zg+MLms0dOhVV8jWH6Nx0+PUXDp K/6bs18ES0hVWLUn5lZgVV+k8YTneSZx+4Uj0Qn0a3UdzHhdXuVX/WkiX09xhrtGg+b7HAhXU8CFWa xJ4W3Cqmg7jCAXaHVpGPfZxCVgi5wUbmtpQ1cWv2hyEbpjaNx2PTAJfI8U69n0X5Q+fAGmSHidsVdf qPbFpJeHlaCM0saeE/0cwoDDFWmuJC/cqApFNk+eNfz4CB5y4qlWqte2Xkfg== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This exercises the io_uring_buf and io_uring_file iterators, and tests sparse file sets as well. Cc: Jens Axboe Cc: Pavel Begunkov Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 251 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_io_uring.c | 50 ++++ 2 files changed, 301 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 3e10abce3e5a..d21121d62458 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -1,6 +1,10 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2020 Facebook */ +#include +#include #include +#include + #include "bpf_iter_ipv6_route.skel.h" #include "bpf_iter_netlink.skel.h" #include "bpf_iter_bpf_map.skel.h" @@ -26,6 +30,7 @@ #include "bpf_iter_bpf_sk_storage_map.skel.h" #include "bpf_iter_test_kern5.skel.h" #include "bpf_iter_test_kern6.skel.h" +#include "bpf_iter_io_uring.skel.h" static int duration; @@ -1239,6 +1244,248 @@ static void test_task_vma(void) bpf_iter_task_vma__destroy(skel); } +static int sys_io_uring_setup(u32 entries, struct io_uring_params *p) +{ + return syscall(__NR_io_uring_setup, entries, p); +} + +static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr) +{ + return syscall(__NR_io_uring_register, io_uring_fd, + IORING_REGISTER_BUFFERS, iovs, nr); +} + +static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr) +{ + return syscall(__NR_io_uring_register, io_uring_fd, + IORING_REGISTER_FILES, fds, nr); +} + +static unsigned long long page_addr_to_pfn(unsigned long addr) +{ + int page_size = sysconf(_SC_PAGE_SIZE), fd, ret; + unsigned long long pfn; + + if (page_size < 0) + return 0; + fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) + return 0; + + ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8); + close(fd); + if (ret < 0) + return 0; + /* Bits 0-54 have PFN for non-swapped page */ + return pfn & 0x7fffffffffffff; +} + +static int io_uring_inode_match(int link_fd, int io_uring_fd) +{ + struct bpf_link_info linfo = {}; + __u32 info_len = sizeof(linfo); + struct stat st; + int ret; + + ret = fstat(io_uring_fd, &st); + if (ret < 0) + return -errno; + + ret = bpf_obj_get_info_by_fd(link_fd, &linfo, &info_len); + if (ret < 0) + return -errno; + + ASSERT_EQ(st.st_ino, linfo.iter.io_uring.inode, "io_uring inode matches"); + return 0; +} + +void test_io_uring_buf(void) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char rbuf[4096], buf[4096] = "B\n"; + union bpf_iter_link_info linfo; + struct bpf_iter_io_uring *skel; + int ret, fd, i, len = 128; + struct io_uring_params p; + struct iovec iovs[8]; + int iter_fd; + char *str; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_io_uring__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load")) + return; + + for (i = 0; i < ARRAY_SIZE(iovs); i++) { + iovs[i].iov_len = len; + iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED, -1, 0); + if (iovs[i].iov_base == MAP_FAILED) + goto end; + len *= 2; + } + + memset(&p, 0, sizeof(p)); + fd = sys_io_uring_setup(1, &p); + if (!ASSERT_GE(fd, 0, "io_uring_setup")) + goto end; + + linfo.io_uring.io_uring_fd = fd; + skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, + &opts); + if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter")) + goto end_close_fd; + + if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_buf), fd), "inode match")) + goto end_close_fd; + + ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs)); + if (!ASSERT_OK(ret, "io_uring_register_bufs")) + goto end_close_fd; + + /* "B\n" */ + len = 2; + str = buf + len; + for (int j = 0; j < ARRAY_SIZE(iovs); j++) { + ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j, + (unsigned long)iovs[j].iov_base, + iovs[j].iov_len); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + len += ret; + str += ret; + + ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n", + page_addr_to_pfn((unsigned long)iovs[j].iov_base)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + len += ret; + str += ret; + } + + ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_fd; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end_close_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) + goto end_close_iter; + + if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) { + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + } +end_close_iter: + close(iter_fd); +end_close_fd: + close(fd); +end: + while (i--) + munmap(iovs[i].iov_base, iovs[i].iov_len); + bpf_iter_io_uring__destroy(skel); +} + +void test_io_uring_file(void) +{ + int reg_files[] = { [0 ... 7] = -1 }; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char buf[4096] = "B\n", rbuf[4096] = {}, *str; + union bpf_iter_link_info linfo = {}; + struct bpf_iter_io_uring *skel; + int iter_fd, fd, len = 0, ret; + struct io_uring_params p; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_io_uring__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load")) + return; + + /* "B\n" */ + len = 2; + str = buf + len; + ret = snprintf(str, sizeof(buf) - len, "B\n"); + for (int i = 0; i < ARRAY_SIZE(reg_files); i++) { + char templ[] = "/tmp/io_uringXXXXXX"; + const char *name, *def = ""; + + /* create sparse set */ + if (i & 1) { + name = def; + } else { + reg_files[i] = mkstemp(templ); + if (!ASSERT_GE(reg_files[i], 0, templ)) + goto end_close_reg_files; + name = templ; + ASSERT_OK(unlink(name), "unlink"); + } + ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : ""); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_reg_files; + len += ret; + str += ret; + } + + ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files)); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf")) + goto end_close_reg_files; + + memset(&p, 0, sizeof(p)); + fd = sys_io_uring_setup(1, &p); + if (!ASSERT_GE(fd, 0, "io_uring_setup")) + goto end_close_reg_files; + + linfo.io_uring.io_uring_fd = fd; + skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file, + &opts); + if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter")) + goto end_close_fd; + + if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_file), fd), "inode match")) + goto end_close_fd; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end; + + ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files)); + if (!ASSERT_OK(ret, "io_uring_register_files")) + goto end_iter_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)")) + goto end_iter_fd; + + if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) { + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + } +end_iter_fd: + close(iter_fd); +end_close_fd: + close(fd); +end_close_reg_files: + for (int i = 0; i < ARRAY_SIZE(reg_files); i++) { + if (reg_files[i] != -1) + close(reg_files[i]); + } +end: + bpf_iter_io_uring__destroy(skel); +} + void test_bpf_iter(void) { if (test__start_subtest("btf_id_or_null")) @@ -1299,4 +1546,8 @@ void test_bpf_iter(void) test_rdonly_buf_out_of_bound(); if (test__start_subtest("buf-neg-offset")) test_buf_neg_offset(); + if (test__start_subtest("io_uring_buf")) + test_io_uring_buf(); + if (test__start_subtest("io_uring_file")) + test_io_uring_file(); } diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c new file mode 100644 index 000000000000..caf8bd0bf8d4 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "bpf_iter.h" +#include + +SEC("iter/io_uring_buf") +int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx) +{ + struct io_mapped_ubuf *ubuf = ctx->ubuf; + struct seq_file *seq = ctx->meta->seq; + unsigned int index = ctx->index; + + if (!ctx->meta->seq_num) + BPF_SEQ_PRINTF(seq, "B\n"); + + if (ubuf) { + BPF_SEQ_PRINTF(seq, "%u:0x%lx:%lu\n", index, (unsigned long)ubuf->ubuf, + (unsigned long)ubuf->ubuf_end - ubuf->ubuf); + BPF_SEQ_PRINTF(seq, "`-PFN for bvec[0]=%lu\n", + (unsigned long)bpf_page_to_pfn(ubuf->bvec[0].bv_page)); + } else { + BPF_SEQ_PRINTF(seq, "E:%u\n", index); + } + return 0; +} + +SEC("iter/io_uring_file") +int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + unsigned int index = ctx->index; + struct file *file = ctx->file; + char buf[256] = ""; + + if (!ctx->meta->seq_num) + BPF_SEQ_PRINTF(seq, "B\n"); + /* for io_uring_file iterator, this is the terminating condition */ + if (ctx->ctx->nr_user_files == index) { + BPF_SEQ_PRINTF(seq, "E:%u\n", index); + return 0; + } + if (file) { + bpf_d_path(&file->f_path, buf, sizeof(buf)); + BPF_SEQ_PRINTF(seq, "%u:%s\n", index, buf); + } else { + BPF_SEQ_PRINTF(seq, "%u:\n", index); + } + return 0; +} + +char _license[] SEC("license") = "GPL"; From patchwork Mon Nov 22 22:53:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633079 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69E34C433EF for ; Mon, 22 Nov 2021 22:54:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230318AbhKVW5d (ORCPT ); Mon, 22 Nov 2021 17:57:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48730 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230428AbhKVW5X (ORCPT ); Mon, 22 Nov 2021 17:57:23 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1ACCC061574; Mon, 22 Nov 2021 14:54:15 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id u80so5831877pfc.9; Mon, 22 Nov 2021 14:54:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Y8fd/OzoWO0A4euNuWolWcSQ1k5QoF6gHKtM1v6KFXI=; b=OUvvhWYC+ge2LI8AESR+JF0HyeuaXNSYPw3VE702Rpjq7xHm9eA53xo6wYNPpYL4Jf 70RnxeEVLRht9cFuNAjr7daUWeYYeh184lM/2KGoiJuZrErgUObCCPN75VahtufHj8Hc tV/3Lcg6U2l9YSQHt+Umi12+VwdJFcHoYCVO3sm/vHlTqXObUXcaxqgOtre3v3LD7HOv mO5RtwNCAvueRjEASZsPKr7bWJEVIguTXHSsslkKGslrmOLz5le1CR2jFjzbZ4N7i7x0 L3M5nwaO4AIWcm1IxfPWeXmHflwwTB96ExgEb2MzIFh/Q3xG2dir7Rns2PlFSKxwLT21 hbrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Y8fd/OzoWO0A4euNuWolWcSQ1k5QoF6gHKtM1v6KFXI=; b=Ru0Uf5vBA06+8BHkOy30K2a3y0awZ8vGtwmPHXOUZuoQ5Ynmw3DqcCCUhuXnhMBz2J dKUjIm+PFh+mxcTXqxpl/ypEd5vSbmG4Ujk/y3yANG6FNY59CT+OaTLZX+GQlLFNogXs V1CX/U7QcZRCczoPhcoO6EkIJyyHKfLcoYIXCspsYB8ZMNfzVFIcDzUnXfQ9bh1dU7bv IDPlZtqTNedWi2uN7oUmM66PZAm11bFRNwQ3ssSOkNuqR/6FP/Kj4EdFDWfXHsE57pZd uoFXyl96aWbH6b5k8GOMug35MkBRiz86qgBsBdhgRdIrWy9kCYWtAr5dkU6ekoABXb8N nSCA== X-Gm-Message-State: AOAM530RTzhLQ2+XqUenf1LblDt5gKNi1hPb56J53JTlpM1gK8Qz6R09 jOMvw3QaVQf3eXzP+JrJjWUWMfABv3Y= X-Google-Smtp-Source: ABdhPJwubNKRzEqF3TgEEJ5euzH8JJvmO4ZBg3zenrtJQXfjnbidaCTg2Dtf02bm4l4nqBKhNwY5xQ== X-Received: by 2002:a62:7541:0:b0:4a3:8a3b:6136 with SMTP id q62-20020a627541000000b004a38a3b6136mr806380pfc.78.1637621655252; Mon, 22 Nov 2021 14:54:15 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id c9sm7044106pgq.58.2021.11.22.14.54.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:15 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org Subject: [PATCH bpf-next v2 07/10] selftests/bpf: Add test for epoll BPF iterator Date: Tue, 23 Nov 2021 04:23:49 +0530 Message-Id: <20211122225352.618453-8-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=5763; h=from:subject; bh=wwA+z5sb920kQ2kHDEdlo3XuyNd/1OGKswevIlyym7c=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrM3EGICESPIZLbqEZWOuWAUbGWRqIllKwYN30n GPnjQEWJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwazAAKCRBM4MiGSL8Rysa5EA C4U+xHvMHjgwmmc8vZFGOapkFm+jL1Ee3OtqWhZGtMgk2xvvVETELDoB+7XAMWrudN3a9cH4ysvx9e +z844r1/osWPivwFzUxpBq+OSMMTrxNR+KQV0BF1as8jV9pgTndCC4StyAeWtrWjqQ3kCEir7ubEtu iyZc40mlaYyBG5nNQ2GRA2UK4yyx/UWZcQf9LPeWTbMbTc1I2E+JLaNUKbmrSsawMTrKQf0s6mpgUS uXDNrjE/3Ns+RZ/VjDn6uZK8nMxJi1aHzQZfjEgrsLDFiJyq4XLATbCgW5IGpPm5jzItVuWIq66smv JLyMo8ohv1On4/4Jg3691S0HQBOgN/URiCovKvLn6G5F425LrGPnZkIb4KrD9gVqfe6t0A+dgo5HmN gyx9binrGOxK9pyHI8Q9UVn9n7n9s1XGY9DCjGzJT2MI7AneGB/SSB/XDEnNXXp7MgZiwE0gWY3Akb evatiShc90ZBygI4emWqqNEzfOlyqP9X/6yBtgTv5orvDNHNlvK2t3jt6s12hKXT3NOXtd6qW4kikg JvwYohUahD6WdJG9eCTiMTCrkvkTw99z5cVja2d2YlfBcrtwSNMU0m94/oS6dR4WRG+npwueV+Mhlu mxO5+cX0RHWjQhw4mfFRIKPfj2n2U2IDD0ZNyEHDtBUEZ9HQW89YIHx66n9w== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This tests the epoll iterator, including peeking into the epitem to inspect the registered file and fd number, and verifying that in userspace. Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 121 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_epoll.c | 33 +++++ 2 files changed, 154 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index d21121d62458..7fb995deb22d 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -2,6 +2,7 @@ /* Copyright (c) 2020 Facebook */ #include #include +#include #include #include @@ -31,6 +32,7 @@ #include "bpf_iter_test_kern5.skel.h" #include "bpf_iter_test_kern6.skel.h" #include "bpf_iter_io_uring.skel.h" +#include "bpf_iter_epoll.skel.h" static int duration; @@ -1486,6 +1488,123 @@ void test_io_uring_file(void) bpf_iter_io_uring__destroy(skel); } +void test_epoll(void) +{ + const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n"; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + char buf[4096] = {}, rbuf[4096] = {}; + union bpf_iter_link_info linfo; + int fds[2], sk[2], epfd, ret; + struct bpf_iter_epoll *skel; + struct epoll_event ev = {}; + int iter_fd, set[4]; + char *s, *t; + + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + skel = bpf_iter_epoll__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_epoll__open_and_load")) + return; + + epfd = epoll_create1(EPOLL_CLOEXEC); + if (!ASSERT_GE(epfd, 0, "epoll_create1")) + goto end; + + ret = pipe(fds); + if (!ASSERT_OK(ret, "pipe(fds)")) + goto end_epfd; + + ret = socketpair(AF_UNIX, SOCK_STREAM, 0, sk); + if (!ASSERT_OK(ret, "socketpair")) + goto end_pipe; + + ev.events = EPOLLIN; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[0], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[0], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[1], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[1], &ev); + if (!ASSERT_OK(ret, "epoll_ctl")) + goto end_sk; + + linfo.epoll.epoll_fd = epfd; + skel->links.dump_epoll = bpf_program__attach_iter(skel->progs.dump_epoll, &opts); + if (!ASSERT_OK_PTR(skel->links.dump_epoll, "bpf_program__attach_iter")) + goto end_sk; + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_epoll)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto end_sk; + + ret = epoll_ctl(epfd, EPOLL_CTL_ADD, iter_fd, &ev); + if (!ASSERT_EQ(ret, -1, "epoll_ctl add for iter_fd")) + goto end_iter_fd; + + ret = snprintf(buf, sizeof(buf), fmt, fds[0], sk[0], fds[1], sk[1]); + if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf")) + goto end_iter_fd; + + ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) + goto end_iter_fd; + + puts("=== Expected Output ==="); + printf("%s", buf); + puts("==== Actual Output ===="); + printf("%s", rbuf); + puts("======================="); + + s = rbuf; + while ((s = strtok_r(s, "\n", &t))) { + int fd = -1; + + if (s[0] == 'B' || s[0] == 'E') + goto next; + ASSERT_EQ(sscanf(s, s[0] == 'p' ? "pipe:%d" : "socket:%d", &fd), 1, s); + if (fd == fds[0]) { + ASSERT_NEQ(set[0], 1, "pipe[0]"); + set[0] = 1; + } else if (fd == fds[1]) { + ASSERT_NEQ(set[1], 1, "pipe[1]"); + set[1] = 1; + } else if (fd == sk[0]) { + ASSERT_NEQ(set[2], 1, "sk[0]"); + set[2] = 1; + } else if (fd == sk[1]) { + ASSERT_NEQ(set[3], 1, "sk[1]"); + set[3] = 1; + } else { + ASSERT_TRUE(0, "Incorrect fd in iterator output"); + } +next: + s = NULL; + } + for (int i = 0; i < ARRAY_SIZE(set); i++) + ASSERT_EQ(set[i], 1, "fd found"); +end_iter_fd: + close(iter_fd); +end_sk: + close(sk[1]); + close(sk[0]); +end_pipe: + close(fds[1]); + close(fds[0]); +end_epfd: + close(epfd); +end: + bpf_iter_epoll__destroy(skel); +} + void test_bpf_iter(void) { if (test__start_subtest("btf_id_or_null")) @@ -1550,4 +1669,6 @@ void test_bpf_iter(void) test_io_uring_buf(); if (test__start_subtest("io_uring_file")) test_io_uring_file(); + if (test__start_subtest("epoll")) + test_epoll(); } diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c new file mode 100644 index 000000000000..0afc74d154a1 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c @@ -0,0 +1,33 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "bpf_iter.h" +#include + +extern void pipefifo_fops __ksym; + +SEC("iter/epoll") +int dump_epoll(struct bpf_iter__epoll *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct epitem *epi = ctx->epi; + char sstr[] = "socket"; + char pstr[] = "pipe"; + + if (!ctx->meta->seq_num) { + BPF_SEQ_PRINTF(seq, "B\n"); + } + if (epi) { + struct file *f = epi->ffd.file; + char *str; + + if (f->f_op == &pipefifo_fops) + str = pstr; + else + str = sstr; + BPF_SEQ_PRINTF(seq, "%s:%d\n", str, epi->ffd.fd); + } else { + BPF_SEQ_PRINTF(seq, "E\n"); + } + return 0; +} + +char _license[] SEC("license") = "GPL"; From patchwork Mon Nov 22 22:53:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633081 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2B1FC433FE for ; Mon, 22 Nov 2021 22:54:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230108AbhKVW5i (ORCPT ); Mon, 22 Nov 2021 17:57:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230514AbhKVW52 (ORCPT ); Mon, 22 Nov 2021 17:57:28 -0500 Received: from mail-pj1-x1043.google.com (mail-pj1-x1043.google.com [IPv6:2607:f8b0:4864:20::1043]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CABB4C061574; Mon, 22 Nov 2021 14:54:18 -0800 (PST) Received: by mail-pj1-x1043.google.com with SMTP id y14-20020a17090a2b4e00b001a5824f4918so482787pjc.4; Mon, 22 Nov 2021 14:54:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PZuwyt+Y+4UQRZDTE6VtW15hsN/SCV8zV6qhcm9x7FM=; b=TaYppKIjGc4gpYNPPjDJ56F8KjjWLXMJ3WQFPpeOwx8ktZEUZwHGrM/a6Nf0pZEhbf oGLBBsbHtZE2FD9rbDZVvD3xNYctH9nyR/Qm2G/CJoqQjSTXU27tX1t12sn/cu5jhHtV 8cjfaEvx4X9C3NUOjDdJ2DvBI2yLaq6M/D++ktwsYg7OHnVymJrphv+kQN8Ue1d7vnmY Lg/XFQabFjMSEyhLp/rZwIbQw4ApLi2MUUhGAyFsNAdxbVBi4DZmvnx44LOn2l+GY3cC p/QX0rBDBy/zT2UQtzmrHIMKW4zrBCDzhU5H4QP0em1yK4L82T6iW5srwFXdpGFlbcXr /hWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PZuwyt+Y+4UQRZDTE6VtW15hsN/SCV8zV6qhcm9x7FM=; b=1tR08hytAx0SgS6yjuM0hgKBu1YEiObn3zKyxGFiVFb+KkexzjiCx/83i1SpGIW7RT uD7OdkXhNoMg/BaX0yWo80x9kGzoYVRf38/6GOB/SkzsdqqxYBDOcHvRBU5Svu+WyyVj v94KMvpGVaicJyV7gKXftdsCDJzO9vtjZ58jFHTHeTTPEImWmHRzAZiPBxh+gBioNqMg jtGGnMUekbIqyZwmnXJBAChpN5jEN8Us7c0j6VrpnLIUBfhl927Kiv5LNlml52Dc+it8 IBVRoU+aUbYqZrXmoBtKxPfIYrQ4WWt9ui5noGIsKDJMzGOt3GnQJDZWnPTJFT3j1mVt YNww== X-Gm-Message-State: AOAM530hXvjQkiiQqhNVFPeiuI3QF/GcF9mGzLQIduTERjKbTvAYy3US S4e1Kq4tOIvU3yPmkhXAt37SvtsUxOE= X-Google-Smtp-Source: ABdhPJz7xp4/3KJeIimufNvobLHlPjPPVRcmHj8sNO7wRM6XVqcJjBb48wwe4XXSYoS3iH1BrcLLpw== X-Received: by 2002:a17:90b:1d0e:: with SMTP id on14mr36398251pjb.119.1637621658246; Mon, 22 Nov 2021 14:54:18 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id o16sm10700034pfu.72.2021.11.22.14.54.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:18 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators Date: Tue, 23 Nov 2021 04:23:50 +0530 Message-Id: <20211122225352.618453-9-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=4120; h=from:subject; bh=4JGBx8JjpHH7oyNohjtBYSdYYGQ+zVIqYnuCTh5qHs8=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrMZOhXqrnAI1QC53VXLf0cXttDgsEWQxgxa1tK lxvtmWGJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwazAAKCRBM4MiGSL8RyrlvEA CxZKWosDm1hIKrpqx56T55Cz47L8Y6MtQUqq+VBttYvHEB/tZ3ym58nQXns+j69uF3SCp8Her+V27V lHiDnlON37PPWfPlJysSFDIfxTLY9raxAqTgMDQdt5PBRmnmmZTEx4262bJ1AzwnWxsCq3qW2ctkgh fia18SvR/j8fKO4O/d+vGMEpfGOljvyK7bcZ+y9rLY89a69o4XwB2T7GiB65j04xXNT9IYfJRl/2nQ izT3QppDlmGn5OGBriTIBt1Jlc6U0vpevhedgU9IbavK9q7KSIdJAe6QLxqY1jCkcqKP7YZJ+AFaeJ zncDdAn9xKOrjcnrSBbGb+8vwIJzRK+3a7Cpd5o68NjTYxh7fmRtM5sxpKeTXtu/72LjQK0WlXF1+/ la7QnXVAUVJfzQFzgjS4xXvkkFWOK7cxvoJCmvxlwXWdZmsXCuhH+3m2Q98yMgRgeNUvokrgD5VJbc KogKcRPD5RpeowIr29Pfmnf80L3L1hDqq/XgtqJ6y6XQtiFhyhHd90RkD7Wqioz3CeosS7Jq7E5PqQ zaOrz4Yz6jRyR3dWMQuVtVY66qSAM9GSg/PcEdDKHCK135NzdLZPo+uTKZuTbR06T8vhYg0NwFk+xB Wqq/uoRwNEpvuAyoCtGESwQELYMDyxeWQRgr2A/XDoaSXs1D9EZLqPBHO7IQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Ensure that the output is consistent in face of partial reads that return to userspace and then resume again later. To this end, we do reads in 1-byte chunks, which is a bit stupid in real life, but works well to simulate interrupted iteration. This also tests case where seq_file buffer is consumed (after seq_printf) on interrupted read before iterator invoked BPF prog again. Signed-off-by: Kumar Kartikeya Dwivedi --- .../selftests/bpf/prog_tests/bpf_iter.c | 33 ++++++++++++------- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 7fb995deb22d..c7343a3f5155 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -73,13 +73,13 @@ static void do_dummy_read(struct bpf_program *prog) bpf_link__destroy(link); } -static int read_fd_into_buffer(int fd, char *buf, int size) +static int __read_fd_into_buffer(int fd, char *buf, int size, size_t chunks) { int bufleft = size; int len; do { - len = read(fd, buf, bufleft); + len = read(fd, buf, chunks ?: bufleft); if (len > 0) { buf += len; bufleft -= len; @@ -89,6 +89,11 @@ static int read_fd_into_buffer(int fd, char *buf, int size) return len < 0 ? len : size - bufleft; } +static int read_fd_into_buffer(int fd, char *buf, int size) +{ + return __read_fd_into_buffer(fd, buf, size, 0); +} + static void test_ipv6_route(void) { struct bpf_iter_ipv6_route *skel; @@ -1301,7 +1306,7 @@ static int io_uring_inode_match(int link_fd, int io_uring_fd) return 0; } -void test_io_uring_buf(void) +void test_io_uring_buf(bool partial) { DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); char rbuf[4096], buf[4096] = "B\n"; @@ -1375,7 +1380,7 @@ void test_io_uring_buf(void) if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) goto end_close_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) goto end_close_iter; @@ -1396,7 +1401,7 @@ void test_io_uring_buf(void) bpf_iter_io_uring__destroy(skel); } -void test_io_uring_file(void) +void test_io_uring_file(bool partial) { int reg_files[] = { [0 ... 7] = -1 }; DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); @@ -1464,7 +1469,7 @@ void test_io_uring_file(void) if (!ASSERT_OK(ret, "io_uring_register_files")) goto end_iter_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)")) goto end_iter_fd; @@ -1488,7 +1493,7 @@ void test_io_uring_file(void) bpf_iter_io_uring__destroy(skel); } -void test_epoll(void) +void test_epoll(bool partial) { const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n"; DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); @@ -1554,7 +1559,7 @@ void test_epoll(void) if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf")) goto end_iter_fd; - ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf)); + ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial); if (!ASSERT_GT(ret, 0, "read_fd_into_buffer")) goto end_iter_fd; @@ -1666,9 +1671,15 @@ void test_bpf_iter(void) if (test__start_subtest("buf-neg-offset")) test_buf_neg_offset(); if (test__start_subtest("io_uring_buf")) - test_io_uring_buf(); + test_io_uring_buf(false); if (test__start_subtest("io_uring_file")) - test_io_uring_file(); + test_io_uring_file(false); if (test__start_subtest("epoll")) - test_epoll(); + test_epoll(false); + if (test__start_subtest("io_uring_buf-partial")) + test_io_uring_buf(true); + if (test__start_subtest("io_uring_file-partial")) + test_io_uring_file(true); + if (test__start_subtest("epoll-partial")) + test_epoll(true); } From patchwork Mon Nov 22 22:53:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633083 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F36ABC433EF for ; Mon, 22 Nov 2021 22:54:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230514AbhKVW5m (ORCPT ); Mon, 22 Nov 2021 17:57:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230525AbhKVW5a (ORCPT ); Mon, 22 Nov 2021 17:57:30 -0500 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 68C23C061574; Mon, 22 Nov 2021 14:54:22 -0800 (PST) Received: by mail-pg1-x541.google.com with SMTP id t4so9233136pgn.9; Mon, 22 Nov 2021 14:54:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=SOu70OAblbF9iCNJCmnZM4cFR7dOkECnjCLjb7r/oYs=; b=NDc5jZxJOtHIWkOdNQa7BT3oLbcywOpK81t2SqZcUMAEPs17e/Bfdw+qbs9qWMgNQF 3Mb1Aa9kle8VNmHamovYxnrm5sn8xElnBnxfXmbNpmomikJ8+hiybNXooEnjjYoIk9zf sHnimREdWt6rj0Ckxqyio2FzalvlTSivFP98VtbQigrdhzw1zYcLtZ0Tp+lgto+yurOu 9P1sKCRg9lXGN3RDibBoBEYNhmZPVySajRgZ/PjEZ3Rabyt3DBD+Ddz5BhmAP7k2FwuD J0blABBk7sl372HvnMiKuPNP5SeCO9lf07yPmwAF3utw9lCb81iJ/W/fWbHz4KiQQKJZ UqHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=SOu70OAblbF9iCNJCmnZM4cFR7dOkECnjCLjb7r/oYs=; b=cxFLiSYb09WJF4yUoJZqggv2rO7Csv9SFn/LZFhJgn2eBmXVPSiF7WVt3JHTLdkV6G ilyZMC9koF23qlcfSR1x2W9q9b37r3Odj1n4mLL/C+6Pxacg8mSFbbVCZLvl+5hxtSXK R0AjudZ+MuANXm75wPdLNSnP3pAUlM3tHyVLNaK0hLl7sOtoh1qJ6bVO+PfzJdGOgrUz dOsqMcXePAlXtnXm5oEbqUD/Zq2fW7BjeSkWcHrfZlqDBAoiU7oT1T1pPe17R6a1+uN7 EjgtyBUIv3kO5BQtTGpYsHw/ZvDHZxno2LoHvc9nHo+l/alRqPCtI9F9NcLZOJTPerap 4oAg== X-Gm-Message-State: AOAM532KTJqEAcQBfKcEqhzx9csxQ17Khiztb14I+AMt6rgdLpJyKTbY 3E5MHj7YMefd0nYuMkw0qOihgg02ExM= X-Google-Smtp-Source: ABdhPJyndHNdKz8o3qLqWALzZa+zE7vHJHGMCH0H5aYOiYze6rtviJajL8IAOC/3JfHQObHq5EK5wA== X-Received: by 2002:a05:6a00:1946:b0:492:64f1:61b5 with SMTP id s6-20020a056a00194600b0049264f161b5mr522057pfk.52.1637621661690; Mon, 22 Nov 2021 14:54:21 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id ne22sm8599329pjb.18.2021.11.22.14.54.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:21 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH bpf-next v2 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Date: Tue, 23 Nov 2021 04:23:51 +0530 Message-Id: <20211122225352.618453-10-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=1199; h=from:subject; bh=Aa07OX8z1krQq2qMJczuwnNeBDERtUcS4LltNhLXoso=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrMZYNOTuXBlrXif/E4BaVe6f0+nbks5dpSVkN7 Gh/yukqJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwazAAKCRBM4MiGSL8Ryj+hD/ 0cFj993tNTv/rrc43Ltfw2RYVEMcm/IWYPE0yhY09zXC8dRKYXXt5mw5oIFADL9Po2rgpkKOPpGSVH XT+gUMv/8GsKeE4HEveGq4/9yHymC7yVxvLr9ACY/FGbTek5hhwBmcWk5sgE2oefbesDao5uQ39rpY m809CbqDlHueqXk0vvP/454LdJE5igkMqdKAqF+k154kEc48EejO4aUotkY7iJrLYwDWiqI2tNj86f YzggJyAKUl5Jf673MW6NXmFmHA5MD6sbDlRVwE1fe3QWjfPmExOrF13o7G4vFhKwLKnPVle9IU0Xj0 Z5l3/IPAmIHfTCBrStwonCrQ5HjXkL8zh+rbsNHxYfiP0zhxydN9TUOjR7gPKOMcapYmZk987sc5w2 tX7IwcsxTE3DI/8L62b56j+Rxyen8sJWP9YO8YDWJ2zaoX2fCpFDY818NEmjjTKCDY1Cqn4iFQ1F5c PQKAxB7ls6cksekbA64wn5ws031KS8dsZ0wJ6tXzD8pROTvetLOKnqwL2hOEd5qX9b3muCJk4bnLPs 8VBV2bu7m5n2dNntHi1FVOTi6N1cqbteWIop3tmh21kWzgTbHcxJGMqo7fUsiKIfk+f22O68cLM5FR Hd+j9U8WNgXQGnrg7bNXBDFdfvldGDqaswdjhhrpQz0eSK9JnMGMWTSIfHBQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Since we changed the definition while adding io_uring and epoll iterator support, adjust the selftest to check against the updated definition. Signed-off-by: Kumar Kartikeya Dwivedi --- tools/testing/selftests/bpf/prog_tests/btf_dump.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c index d6272013a5a3..a2fc006e074a 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c @@ -736,7 +736,9 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d, /* union with nested struct */ TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT, - "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}", + "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,}," + ".io_uring = (struct){.io_uring_fd = (__u32)1,}," + ".epoll = (struct){.epoll_fd = (__u32)1,},}", { .map = { .map_fd = 1 }}); /* struct skb with nested structs/unions; because type output is so From patchwork Mon Nov 22 22:53:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kumar Kartikeya Dwivedi X-Patchwork-Id: 12633085 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E3A5C433EF for ; Mon, 22 Nov 2021 22:54:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231341AbhKVW5r (ORCPT ); Mon, 22 Nov 2021 17:57:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48774 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229672AbhKVW5d (ORCPT ); Mon, 22 Nov 2021 17:57:33 -0500 Received: from mail-pl1-x644.google.com (mail-pl1-x644.google.com [IPv6:2607:f8b0:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 239F7C061574; Mon, 22 Nov 2021 14:54:26 -0800 (PST) Received: by mail-pl1-x644.google.com with SMTP id b11so15363227pld.12; Mon, 22 Nov 2021 14:54:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=4cok9Tz08yrGdyczu0wBiLU6ZEq/77kot/uk7jnqN1E=; b=R64fWR9VYltkjmImML2Kj2eq0VylG14px0w5ewqU5p4v6H3XW5yvntcelrJReBzmE7 58aXVlr03ORv7X8zDbXJT9/vSASIyC2ZymMDlzbegsyShh7VwlfEZzSvCMP+LGVWKBrD su6ftVP5DQFy/FmOkaOjGMqbfVtND6D1CMs5NV7sdl/dM+PpFgWy9qx6s/lBrTOB5t5R muJg+rcBlfGY9W2aQlUD/T3i3wsHsdRfafsBMrtLOVSeVJF4zZ3eR6sAzo9I6dN+m7Qo mC5VGZDkcGHiw2SqZCVq73pjmBimwRdqK+1NLPoVvv9gteP0Y5BY8BXg2+wjWqWb+d+B qZIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4cok9Tz08yrGdyczu0wBiLU6ZEq/77kot/uk7jnqN1E=; b=o55WkufXGNfnWfqt2d0pog1K0n0oJQOoOdSpEwpEhwGRMz+GHM63678NExaeJCRTZN U4hbNy4ArkBXvaOAtHKVVgiHNY49Jyx1BPjAVrl9hs8qZVPf+6vwHMas0uqnDeLMnZBg LBz078zSdtatQ7vQCyVQMQ8eUE0pompaO3WJCSzgwLPcToJTAhOuYnQ83tlyQM3vXzh+ SiKOTP8BCDC0BiGjQrT0jqEum6DPRHEtwATbCLlKyscQz+Sw7tX93Qe4T5OKRH3OX8v4 2Ot3Ga2OsNtvd20cHJD/BA3ZNI8SzbqrFOUFbTaESUdyQomFnGucvTzSEIhvfa++RoNT cwrQ== X-Gm-Message-State: AOAM533fEg9n+rv4LjIvwF5KsB3xdEaGd94JFZFzRV8FIzM8w/8nn5/6 jZoR/w5dNyJv0giPzyW12w5VPyz9Dhw= X-Google-Smtp-Source: ABdhPJwp+S1EaC5sTWsT+tWJVlCVc622hccd4Xj0lBupgc5hzhr1CsfrA013Xf1NBe666WwJJEdrtw== X-Received: by 2002:a17:90b:3b8d:: with SMTP id pc13mr588811pjb.112.1637621664785; Mon, 22 Nov 2021 14:54:24 -0800 (PST) Received: from localhost ([2405:201:6014:d064:3d4e:6265:800c:dc84]) by smtp.gmail.com with ESMTPSA id m15sm9732461pfk.186.2021.11.22.14.54.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Nov 2021 14:54:24 -0800 (PST) From: Kumar Kartikeya Dwivedi To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yonghong Song , Pavel Emelyanov , Alexander Mikhalitsyn , Andrei Vagin , criu@openvz.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH RFC bpf-next v2 10/10] samples/bpf: Add example to checkpoint/restore io_uring Date: Tue, 23 Nov 2021 04:23:52 +0530 Message-Id: <20211122225352.618453-11-memxor@gmail.com> X-Mailer: git-send-email 2.34.0 In-Reply-To: <20211122225352.618453-1-memxor@gmail.com> References: <20211122225352.618453-1-memxor@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=35650; h=from:subject; bh=gubKs0gG51pcL083ucyNsTa+Pt9iyeRpdz3TMFpnKCM=; b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBhnBrMEXdEu2vFCLQigSnJmZ3LZ4d+hAbw0Ni1E4On aBu9DOiJAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCYZwazAAKCRBM4MiGSL8RyhFTEA CjLOlz1UNGQX/UBI1ob2fUI38ePjeUHil0k7a1RvSrOOI77Mnfx7H7THe4DPU4i5xYSsAJElupHBZ7 zVPijtUU/Vb9aUFud5f1UHdnQlHcOIlZrfQmjpMlNpvKKzKIzvTizPq2Zfk0buETvZ1oZPldlmwXpH 2txqUzOczeAY7uOgsSb/rOxx8CeRhmPXd0rfsHOvg2aPfKoH17Z0tIyjfaaPfSouEEvFNaVNRmQQv0 aCUfkNdAlifVEWBRp0ttNn3GUXMtK5xuYY6MApImOmw1pbSH4FJ4/d1vegHX+MUhKoo3fmf8yH7Y6w khrCmgTsJpOWYtMj7zNw4efM6HIaaoTF8I0XVOpXgEgadxgS8jayZeagrld8DnJL6mcnerhLU1m09l p6RBOlMwb/fQtWXOeKI1AHiMLQ5LdEzda29ouHS6X6uDrzNEZ5AOv2Qdc3r2bTBEOrFbqRODhGtOtG NGjHGPe5sOkKp1CeKhoL45+Ydp+cjsfL28BDxHgWSOOROmQZKRxXZO/JUF+SgtDVyJZ69fsCcjUTj4 +2owCLz1HZGP5p2ZkzZle5emkIwMr2G968NOh0tTsVVpM+azAfw17zsFuiEnucA7lvHjf3dLEe611o 9Act/fYJ4paySfmgFaTlQh9lKgaSUmt5aJoCneRl8lJsARvDGZkbZB2mSBOQ== X-Developer-Key: i=memxor@gmail.com; a=openpgp; fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The sample demonstrates how BPF iterators for task and io_uring can be used to checkpoint the state of an io_uring instance and then recreate it using that information, as a working example of how the iterator will be utilized for the same by userspace projects like CRIU. This is very similar to how CRIU actually works in principle, by writing all data on dump to protobuf images, which are then read during restore to reconstruct the task and its resources. Here we use a custom binary format and pipe the io_uring "image(s)" (in case of wq_fd there will be multiple images), to the restorer, which then consumes this information to form a total ordering of restore actions it has to execute to reach the same state. The sample restores all features that currently cannot be restored without bpf iterators, hence is a good demonstration of what we would like to achieve using these new facilities. As is evident, we need a single iteration pass in each iterator to obtain all the information we require. io_uring ring buffer restoration is orthogonal and not specific to iterators, so it has been left out. Our example app also shares the workqueue with parent io_uring, which is detected by our dumper tool and it moves to first dump the parent io_uring. io_uring doesn't allow creating cycles in this case, so the chain ends eventually in practice. For now only single parent is supported, but it easy to extend to arbitrary length chains (by recursing with limit in do_dump_parent after detecting presence of wq_fd > 0). The epoll iterator usecase is similar to what we do in dump_io_uring_file, and would significantly simplify current implementation [0]. [0]: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/eventpoll.c The dry-run mode of bpf_cr tool prints the dump image: $ ./bpf_cr app & PID: 318, Parent io_uring: 3, Dependent io_uring: 4 $ ./bpf_cr dump 318 4 | ./bpf_cr restore --dry-run DUMP_SETUP: io_uring_fd: 3 end: true flags: 14 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 0 DUMP_SETUP: io_uring_fd: 4 end: false flags: 46 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 3 DUMP_EVENTFD: io_uring_fd: 4 end: false eventfd: 5 async: true DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 0 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 2 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 4 DUMP_REG_BUF: io_uring_fd: 4 end: false addr: 0 len: 0 index: 0 DUMP_REG_BUF: io_uring_fd: 4 end: true addr: 140721288339216 len: 120 index: 1 Nothing to do, exiting... ====== The trace is as follows: // We can shift fd number around randomly, it doesn't impact C/R $ exec 3<> /dev/urandom $ exec 4<> /dev/random $ exec 5<> /dev/null $ strace ./bpf_cr app & ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 getpid() = 324 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 ... // PID: 324, Parent io_uring: 6, Dependent io_uring: 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 1, -1, 2], 5) = 0 io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = 0 The restore's trace is as follows (which detects the wq_fd on its own) and dumps and restores it as well, before restoring fd 7: $ ./bpf_cr dump 326 7 | strace ./bpf_cr restore ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 dup2(6, 6) = 6 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 dup2(7, 7) = 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 ... // fd number 0 is same as 1 and 2, hence the lowest one is used during restore, // it doesn't matter as underlying struct file is same... io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 0, -1, 0], 5) = 0 // This step would happen after restoring mm, so it fails for now for second iovec io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = -1 EFAULT (Bad address) ... --- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 8 +- samples/bpf/bpf_cr.bpf.c | 185 +++++++++++ samples/bpf/bpf_cr.c | 686 +++++++++++++++++++++++++++++++++++++++ samples/bpf/bpf_cr.h | 48 +++ samples/bpf/hbm_kern.h | 2 - 6 files changed, 926 insertions(+), 4 deletions(-) create mode 100644 samples/bpf/bpf_cr.bpf.c create mode 100644 samples/bpf/bpf_cr.c create mode 100644 samples/bpf/bpf_cr.h -- 2.34.0 diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 0e7bfdbff80a..9c542431ea45 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only +bpf_cr cpustat fds_example hbm diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index a886dff1ba89..a64f2e019bfc 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -53,6 +53,7 @@ tprogs-y += task_fd_query tprogs-y += xdp_sample_pkts tprogs-y += ibumad tprogs-y += hbm +tprogs-y += bpf_cr tprogs-y += xdp_redirect_cpu tprogs-y += xdp_redirect_map_multi @@ -118,6 +119,7 @@ task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS) xdp_sample_pkts-objs := xdp_sample_pkts_user.o ibumad-objs := ibumad_user.o hbm-objs := hbm.o $(CGROUP_HELPERS) +bpf_cr-objs := bpf_cr.o xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o $(XDP_SAMPLE) xdp_redirect_cpu-objs := xdp_redirect_cpu_user.o $(XDP_SAMPLE) @@ -198,7 +200,7 @@ BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic endif endif -TPROGS_CFLAGS += -Wall -O2 +TPROGS_CFLAGS += -Wall -O2 -g TPROGS_CFLAGS += -Wmissing-prototypes TPROGS_CFLAGS += -Wstrict-prototypes @@ -337,6 +339,7 @@ $(obj)/xdp_redirect_map_multi_user.o: $(obj)/xdp_redirect_map_multi.skel.h $(obj)/xdp_redirect_map_user.o: $(obj)/xdp_redirect_map.skel.h $(obj)/xdp_redirect_user.o: $(obj)/xdp_redirect.skel.h $(obj)/xdp_monitor_user.o: $(obj)/xdp_monitor.skel.h +$(obj)/bpf_cr.o: $(obj)/bpf_cr.skel.h $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h @@ -392,7 +395,7 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x -I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \ -c $(filter %.bpf.c,$^) -o $@ -LINKED_SKELS := xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \ +LINKED_SKELS := bpf_cr.skel.h xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \ xdp_redirect_map.skel.h xdp_redirect.skel.h xdp_monitor.skel.h clean-files += $(LINKED_SKELS) @@ -401,6 +404,7 @@ xdp_redirect_map_multi.skel.h-deps := xdp_redirect_map_multi.bpf.o xdp_sample.bp xdp_redirect_map.skel.h-deps := xdp_redirect_map.bpf.o xdp_sample.bpf.o xdp_redirect.skel.h-deps := xdp_redirect.bpf.o xdp_sample.bpf.o xdp_monitor.skel.h-deps := xdp_monitor.bpf.o xdp_sample.bpf.o +bpf_cr.skel.h-deps := bpf_cr.bpf.o LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps))) diff --git a/samples/bpf/bpf_cr.bpf.c b/samples/bpf/bpf_cr.bpf.c new file mode 100644 index 000000000000..6b0bb019f2be --- /dev/null +++ b/samples/bpf/bpf_cr.bpf.c @@ -0,0 +1,185 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "vmlinux.h" +#include +#include +#include + +#include "bpf_cr.h" + +/* struct file -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} fdtable_map SEC(".maps"); + +struct ctx_map_val { + int fd; + bool init; +}; + +/* io_ring_ctx -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, struct ctx_map_val); + __uint(max_entries, 16); +} io_ring_ctx_map SEC(".maps"); + +/* ctx->sq_data -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} sq_data_map SEC(".maps"); + +/* eventfd_ctx -> int fd */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, __u64); + __type(value, int); + __uint(max_entries, 16); +} eventfd_ctx_map SEC(".maps"); + +const volatile pid_t tgid = 0; + +extern void eventfd_fops __ksym; +extern void io_uring_fops __ksym; + +SEC("iter/task_file") +int dump_task(struct bpf_iter__task_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + struct ctx_map_val val = {}; + __u64 f_priv; + int fd; + + if (!task) + return 0; + if (task->tgid != tgid) + return 0; + if (!file) + return 0; + + f_priv = (__u64)file->private_data; + fd = ctx->fd; + val.fd = fd; + if (file->f_op == &eventfd_fops) { + bpf_map_update_elem(&eventfd_ctx_map, &f_priv, &fd, 0); + } else if (file->f_op == &io_uring_fops) { + struct io_ring_ctx *ctx; + void *sq_data; + __u64 key; + + bpf_map_update_elem(&io_ring_ctx_map, &f_priv, &val, 0); + ctx = file->private_data; + bpf_probe_read_kernel(&sq_data, sizeof(sq_data), &ctx->sq_data); + key = (__u64)sq_data; + bpf_map_update_elem(&sq_data_map, &key, &fd, BPF_NOEXIST); + } + f_priv = (__u64)file; + bpf_map_update_elem(&fdtable_map, &f_priv, &fd, BPF_NOEXIST); + return 0; +} + +static void dump_io_ring_ctx(struct seq_file *seq, struct io_ring_ctx *ctx, int ring_fd) +{ + struct io_uring_dump dump; + struct ctx_map_val *val; + __u64 key; + int *fd; + + key = (__u64)ctx; + val = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (val && val->init) + return; + __builtin_memset(&dump, 0, sizeof(dump)); + if (val) + val->init = true; + dump.type = DUMP_SETUP; + dump.io_uring_fd = ring_fd; + key = (__u64)ctx->sq_data; +#define ATTACH_WQ_FLAG (1 << 5) + if (ctx->flags & ATTACH_WQ_FLAG) { + fd = bpf_map_lookup_elem(&sq_data_map, &key); + if (fd) + dump.desc.setup.wq_fd = *fd; + } + dump.desc.setup.flags = ctx->flags; + dump.desc.setup.sq_entries = ctx->sq_entries; + dump.desc.setup.cq_entries = ctx->cq_entries; + dump.desc.setup.sq_thread_cpu = ctx->sq_data->sq_cpu; + dump.desc.setup.sq_thread_idle = ctx->sq_data->sq_thread_idle; + bpf_seq_write(seq, &dump, sizeof(dump)); + if (ctx->cq_ev_fd) { + dump.type = DUMP_EVENTFD; + key = (__u64)ctx->cq_ev_fd; + fd = bpf_map_lookup_elem(&eventfd_ctx_map, &key); + if (fd) + dump.desc.eventfd.eventfd = *fd; + dump.desc.eventfd.async = ctx->eventfd_async; + bpf_seq_write(seq, &dump, sizeof(dump)); + } +} + +SEC("iter/io_uring_buf") +int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx) +{ + struct io_mapped_ubuf *ubuf = ctx->ubuf; + struct seq_file *seq = ctx->meta->seq; + struct io_uring_dump dump; + __u64 key; + int *fd; + + __builtin_memset(&dump, 0, sizeof(dump)); + key = (__u64)ctx->ctx; + fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (!ctx->meta->seq_num) + dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0); + if (!ubuf) + return 0; + dump.type = DUMP_REG_BUF; + if (fd) + dump.io_uring_fd = *fd; + dump.desc.reg_buf.index = ctx->index; + if (ubuf != ctx->ctx->dummy_ubuf) { + dump.desc.reg_buf.addr = ubuf->ubuf; + dump.desc.reg_buf.len = ubuf->ubuf_end - ubuf->ubuf; + } + bpf_seq_write(seq, &dump, sizeof(dump)); + return 0; +} + +SEC("iter/io_uring_file") +int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct file *file = ctx->file; + struct io_uring_dump dump; + __u64 key; + int *fd; + + __builtin_memset(&dump, 0, sizeof(dump)); + key = (__u64)ctx->ctx; + fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key); + if (!ctx->meta->seq_num) + dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0); + if (!file) + return 0; + dump.type = DUMP_REG_FD; + if (fd) + dump.io_uring_fd = *fd; + dump.desc.reg_fd.index = ctx->index; + key = (__u64)file; + fd = bpf_map_lookup_elem(&fdtable_map, &key); + if (fd) + dump.desc.reg_fd.reg_fd = *fd; + bpf_seq_write(seq, &dump, sizeof(dump)); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/bpf_cr.c b/samples/bpf/bpf_cr.c new file mode 100644 index 000000000000..53c7fd85246c --- /dev/null +++ b/samples/bpf/bpf_cr.c @@ -0,0 +1,686 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * BPF C/R + * + * Tool to use BPF iterators to dump process state. This currently supports + * dumping io_uring fd state, by taking process PID and fd number pair, then + * dumping to stdout the state as binary struct, which can be passed to the + * tool consuming it, to recreate io_uring. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bpf_cr.h" +#include "bpf_cr.skel.h" + +/* Approx. 4096/40 */ +#define MAX_DESC 96 +size_t dump_desc_cnt; +size_t reg_fd_cnt; +size_t reg_buf_cnt; +struct io_uring_dump *dump_desc[MAX_DESC]; +int fds[MAX_DESC]; +struct iovec bufs[MAX_DESC]; + +static int sys_pidfd_open(pid_t pid, unsigned int flags) +{ + return syscall(__NR_pidfd_open, pid, flags); +} + +static int sys_pidfd_getfd(int pidfd, int targetfd, unsigned int flags) +{ + return syscall(__NR_pidfd_getfd, pidfd, targetfd, flags); +} + +static int sys_io_uring_setup(uint32_t entries, struct io_uring_params *p) +{ + return syscall(__NR_io_uring_setup, entries, p); +} + +static int sys_io_uring_register(unsigned int fd, unsigned int opcode, + void *arg, unsigned int nr_args) +{ + return syscall(__NR_io_uring_register, fd, opcode, arg, nr_args); +} + +static const char *type2str[__DUMP_MAX] = { + [DUMP_SETUP] = "DUMP_SETUP", + [DUMP_EVENTFD] = "DUMP_EVENTFD", + [DUMP_REG_FD] = "DUMP_REG_FD", + [DUMP_REG_BUF] = "DUMP_REG_BUF", +}; + +static int do_dump_parent(struct bpf_cr *skel, int parent_fd) +{ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo = {}; + int ret = 0, buf_it, file_it; + struct bpf_link *lb, *lf; + char buf[4096]; + + linfo.io_uring.io_uring_fd = parent_fd; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + + lb = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, &opts); + if (!lb) { + ret = -errno; + fprintf(stderr, "Failed to attach to io_uring_buf: %m\n"); + return ret; + } + + lf = bpf_program__attach_iter(skel->progs.dump_io_uring_file, &opts); + if (!lf) { + ret = -errno; + fprintf(stderr, "Failed to attach io_uring_file: %m\n"); + goto end; + } + + buf_it = bpf_iter_create(bpf_link__fd(lb)); + if (buf_it < 0) { + ret = -errno; + fprintf(stderr, "Failed to create io_uring_buf: %m\n"); + goto end_lf; + } + + file_it = bpf_iter_create(bpf_link__fd(lf)); + if (file_it < 0) { + ret = -errno; + fprintf(stderr, "Failed to create io_uring_file: %m\n"); + goto end_buf_it; + } + + ret = read(file_it, buf, sizeof(buf)); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n"); + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to write to stdout: %m\n"); + goto end_file_it; + } + + ret = read(buf_it, buf, sizeof(buf)); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n"); + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to write to stdout: %m\n"); + goto end_file_it; + } + +end_file_it: + close(file_it); +end_buf_it: + close(buf_it); +end_lf: + bpf_link__destroy(lf); +end: + bpf_link__destroy(lb); + return ret; +} + +static int do_dump(pid_t tpid, int tfd) +{ + int pidfd, ret = 0, buf_it, file_it, task_it; + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo = {}; + const struct io_uring_dump *d; + struct bpf_cr *skel; + char buf[4096]; + + pidfd = sys_pidfd_open(tpid, 0); + if (pidfd < 0) { + fprintf(stderr, "Failed to open pidfd for PID %d: %m\n", tpid); + return 1; + } + + tfd = sys_pidfd_getfd(pidfd, tfd, 0); + if (tfd < 0) { + fprintf(stderr, "Failed to acquire io_uring fd from PID %d: %m\n", tpid); + ret = 1; + goto end; + } + + skel = bpf_cr__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF prog: %m\n"); + ret = 1; + goto end_tfd; + } + skel->rodata->tgid = tpid; + + ret = bpf_cr__load(skel); + if (ret < 0) { + fprintf(stderr, "Failed to load BPF prog: %m\n"); + ret = 1; + goto end_skel; + } + + skel->links.dump_task = bpf_program__attach_iter(skel->progs.dump_task, NULL); + if (!skel->links.dump_task) { + fprintf(stderr, "Failed to attach task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + task_it = bpf_iter_create(bpf_link__fd(skel->links.dump_task)); + if (task_it < 0) { + fprintf(stderr, "Failed to create task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + /* Drive task iterator */ + ret = read(task_it, buf, sizeof(buf)); + close(task_it); + if (ret < 0) { + fprintf(stderr, "Failed to read from task_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + linfo.io_uring.io_uring_fd = tfd; + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, + &opts); + if (!skel->links.dump_io_uring_buf) { + fprintf(stderr, "Failed to attach io_uring_buf iterator: %m\n"); + ret = 1; + goto end_skel; + } + skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file, + &opts); + if (!skel->links.dump_io_uring_file) { + fprintf(stderr, "Failed to attach io_uring_file iterator: %m\n"); + ret = 1; + goto end_skel; + } + + buf_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf)); + if (buf_it < 0) { + fprintf(stderr, "Failed to create io_uring_buf iterator: %m\n"); + ret = 1; + goto end_skel; + } + + file_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file)); + if (file_it < 0) { + fprintf(stderr, "Failed to create io_uring_file iterator: %m\n"); + ret = 1; + goto end_buf_it; + } + + ret = read(file_it, buf, sizeof(buf)); + if (ret < 0) { + fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n"); + ret = 1; + goto end_file_it; + } + + /* Check if we have to dump its parent as well, first descriptor will + * always be DUMP_SETUP, if so, recurse and dump it first. + */ + d = (void *)buf; + if (ret >= sizeof(*d) && d->type == DUMP_SETUP && d->desc.setup.wq_fd) { + int r; + + r = sys_pidfd_getfd(pidfd, d->desc.setup.wq_fd, 0); + if (r < 0) { + fprintf(stderr, "Failed to obtain parent io_uring: %m\n"); + ret = 1; + goto end_file_it; + } + r = do_dump_parent(skel, r); + if (r < 0) { + ret = 1; + goto end_file_it; + } + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + fprintf(stderr, "Failed to write to stdout: %m\n"); + ret = 1; + goto end_file_it; + } + + ret = read(buf_it, buf, sizeof(buf)); + if (ret < 0) { + fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n"); + ret = 1; + goto end_file_it; + } + + ret = write(STDOUT_FILENO, buf, ret); + if (ret < 0) { + fprintf(stderr, "Failed to write to stdout: %m\n"); + ret = 1; + goto end_file_it; + } + +end_file_it: + close(file_it); +end_buf_it: + close(buf_it); +end_skel: + bpf_cr__destroy(skel); +end_tfd: + close(tfd); +end: + close(pidfd); + return ret; +} + +static int dump_desc_cmp(const void *a, const void *b) +{ + const struct io_uring_dump *da = a; + const struct io_uring_dump *db = b; + uint64_t dafd = da->io_uring_fd; + uint64_t dbfd = db->io_uring_fd; + + if (dafd < dbfd) + return -1; + else if (dafd > dbfd) + return 1; + else if (da->type < db->type) + return -1; + else if (da->type > db->type) + return 1; + return 0; +} + +static int do_restore_setup(const struct io_uring_dump *d) +{ + struct io_uring_params p; + int fd, nfd; + + memset(&p, 0, sizeof(p)); + + p.flags = d->desc.setup.flags; + if (p.flags & IORING_SETUP_SQ_AFF) + p.sq_thread_cpu = d->desc.setup.sq_thread_cpu; + if (p.flags & IORING_SETUP_SQPOLL) + p.sq_thread_idle = d->desc.setup.sq_thread_idle; + if (p.flags & IORING_SETUP_ATTACH_WQ) + p.wq_fd = d->desc.setup.wq_fd; + if (p.flags & IORING_SETUP_CQSIZE) + p.cq_entries = d->desc.setup.cq_entries; + + fd = sys_io_uring_setup(d->desc.setup.sq_entries, &p); + if (fd < 0) { + fprintf(stderr, "Failed to restore DUMP_SETUP desc: %m\n"); + return -errno; + } + + nfd = dup2(fd, d->io_uring_fd); + if (nfd < 0) { + fprintf(stderr, "Failed to dup io_uring_fd: %m\n"); + close(fd); + return -errno; + } + return 0; +} + +static int do_restore_eventfd(const struct io_uring_dump *d) +{ + int evfd, ret, opcode; + + /* This would require restoring the eventfd first in CRIU, which would + * be found using eventfd_ctx and peeking into struct file guts from + * task_file iterator. Here, we just reopen a normal eventfd and + * register it. The BPF program does have code which does eventfd + * matching to report the fd number. + */ + evfd = eventfd(42, 0); + if (evfd < 0) { + fprintf(stderr, "Failed to open eventfd: %m\n"); + return -errno; + } + + opcode = d->desc.eventfd.async ? IORING_REGISTER_EVENTFD_ASYNC : IORING_REGISTER_EVENTFD; + ret = sys_io_uring_register(d->io_uring_fd, opcode, &evfd, 1); + if (ret < 0) { + ret = -errno; + fprintf(stderr, "Failed to register eventfd: %m\n"); + goto end; + } + + ret = 0; +end: + close(evfd); + return ret; +} + +static void print_desc(const struct io_uring_dump *d) +{ + printf("%s:\n\tio_uring_fd: %d\n\tend: %s\n", + type2str[d->type % __DUMP_MAX], d->io_uring_fd, d->end ? "true" : "false"); + switch (d->type) { + case DUMP_SETUP: + printf("\t\tflags: %u\n\t\tsq_entries: %u\n\t\tcq_entries: %u\n" + "\t\tsq_thread_cpu: %d\n\t\tsq_thread_idle: %d\n\t\twq_fd: %d\n", + d->desc.setup.flags, d->desc.setup.sq_entries, + d->desc.setup.cq_entries, d->desc.setup.sq_thread_cpu, + d->desc.setup.sq_thread_idle, d->desc.setup.wq_fd); + break; + case DUMP_EVENTFD: + printf("\t\teventfd: %d\n\t\tasync: %s\n", + d->desc.eventfd.eventfd, + d->desc.eventfd.async ? "true" : "false"); + break; + case DUMP_REG_FD: + printf("\t\treg_fd: %d\n\t\tindex: %lu\n", + d->desc.reg_fd.reg_fd, d->desc.reg_fd.index); + break; + case DUMP_REG_BUF: + printf("\t\taddr: %lu\n\t\tlen: %lu\n\t\tindex: %lu\n", + d->desc.reg_buf.addr, d->desc.reg_buf.len, + d->desc.reg_buf.index); + break; + default: + printf("\t\t{Unknown}\n"); + break; + } +} + +static int do_restore_reg_fd(const struct io_uring_dump *d) +{ + int ret; + + /* In CRIU, we restore the fds to be registered before executing the + * restore action that registers file descriptors to io_uring. + * Our example app would register stdin/stdout/stderr in a sparse + * table, so the test case in the commit works. + */ + if (reg_fd_cnt == MAX_DESC || d->desc.reg_fd.index >= MAX_DESC) { + fprintf(stderr, "Exceeded max fds MAX_DESC (%d)\n", MAX_DESC); + return -EDOM; + } + assert(reg_fd_cnt <= d->desc.reg_fd.index); + /* Fill sparse entries */ + while (reg_fd_cnt < d->desc.reg_fd.index) + fds[reg_fd_cnt++] = -1; + fds[reg_fd_cnt++] = d->desc.reg_fd.reg_fd; + if (d->end) { + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_FILES, &fds, + reg_fd_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + return -errno; + } + } + return 0; +} + +static int do_restore_reg_buf(const struct io_uring_dump *d) +{ + struct iovec *iov; + int ret; + + /* This step in CRIU for buffers with intact source buffers must be + * executed with care. There are primarily three cases (each with corner + * cases excluded for brevity): + * 1. Source VMA is intact ([ubuf->ubuf, ubuf->ubuf_end) is in VMA, base + * page PFN is same) + * 2. Source VMA is split (with multiple pages of ubuf overlaying over + * holes) using munmap(s). + * 3. Source VMA is absent (no VMA or full VMA with incorrect PFN). + * + * PFN remains unique as pages are pinned, hence one with same PFN will + * not be recycled to be part of another mapping by page allocator. 2 + * and 3 required page contents dumping. + * + * VMA with holes (registered before punching holes) also needs partial + * page content dumping to restore without holes, and then punch the + * holes. This can be detected when buffer touches two VMAs with holes, + * and base page PFN matches (split VMA case). + * + * All of this is too complicated to demonstrate here, and is done in + * userspace, hence left out. Future patches will implement the page + * dumping from ubuf iterator part. + * + * In usual cases we might be able to dump page contents from inside + * io_uring that we are dumping, by submitting operations, but we want + * to avoid manipulating the ring while dumping, and opcodes we might + * need for doing that may be restricted, hence preventing dump. + */ + if (reg_buf_cnt == MAX_DESC) { + fprintf(stderr, "Exceeded max buffers MAX_DESC (%d)\n", MAX_DESC); + return -EDOM; + } + assert(d->desc.reg_buf.index == reg_buf_cnt); + iov = &bufs[reg_buf_cnt++]; + iov->iov_base = (void *)d->desc.reg_buf.addr; + iov->iov_len = d->desc.reg_buf.len; + if (d->end) { + if (reg_fd_cnt) { + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_FILES, &fds, + reg_fd_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + return -errno; + } + } + + ret = sys_io_uring_register(d->io_uring_fd, + IORING_REGISTER_BUFFERS, &bufs, + reg_buf_cnt); + if (ret < 0) { + fprintf(stderr, "Failed to register buffers: %m\n"); + return -errno; + } + } + return 0; +} + +static int do_restore_action(const struct io_uring_dump *d, bool dry_run) +{ + int ret; + + print_desc(d); + + if (dry_run) + return 0; + + switch (d->type) { + case DUMP_SETUP: + ret = do_restore_setup(d); + break; + case DUMP_EVENTFD: + ret = do_restore_eventfd(d); + break; + case DUMP_REG_FD: + ret = do_restore_reg_fd(d); + break; + case DUMP_REG_BUF: + ret = do_restore_reg_buf(d); + break; + default: + fprintf(stderr, "Unknown dump descriptor\n"); + return -EDOM; + } + return ret; +} + +static int do_restore(bool dry_run) +{ + struct io_uring_dump dump; + int ret, prev_fd = 0; + + while ((ret = read(STDIN_FILENO, &dump, sizeof(dump)))) { + struct io_uring_dump *d; + + if (ret < 0) { + fprintf(stderr, "Failed to read descriptor: %m\n"); + return 1; + } + + d = calloc(1, sizeof(*d)); + if (!d) { + fprintf(stderr, "Failed to allocate dump descriptor: %m\n"); + goto free; + } + + if (dump_desc_cnt == MAX_DESC) { + fprintf(stderr, "Cannot process more than MAX_DESC (%d) dump descs\n", + MAX_DESC); + goto free; + } + + *d = dump; + if (!prev_fd) + prev_fd = d->io_uring_fd; + if (prev_fd != d->io_uring_fd) { + dump_desc[dump_desc_cnt - 1]->end = true; + prev_fd = d->io_uring_fd; + } + dump_desc[dump_desc_cnt++] = d; + qsort(dump_desc, dump_desc_cnt, sizeof(dump_desc[0]), dump_desc_cmp); + } + if (dump_desc_cnt) + dump_desc[dump_desc_cnt - 1]->end = true; + + for (size_t i = 0; i < dump_desc_cnt; i++) { + ret = do_restore_action(dump_desc[i], dry_run); + if (ret < 0) { + fprintf(stderr, "Failed to execute restore action\n"); + goto free; + } + } + + if (!dry_run && dump_desc_cnt) + sleep(10000); + else + puts("Nothing to do, exiting..."); + ret = 0; +free: + while (dump_desc_cnt--) + free(dump_desc[dump_desc_cnt]); + return ret; +} + +static int run_app(void) +{ + struct io_uring_params p; + int r, ret, fd, evfd; + + memset(&p, 0, sizeof(p)); + p.flags |= IORING_SETUP_CQSIZE | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF; + p.sq_thread_idle = 1500; + p.cq_entries = 4; + /* Create a test case with parent io_uring, dependent io_uring, + * registered files, eventfd (async), buffers, etc. + */ + fd = sys_io_uring_setup(2, &p); + if (fd < 0) { + fprintf(stderr, "Failed to create io_uring: %m\n"); + return 1; + } + + r = 1; + printf("PID: %d, Parent io_uring: %d, ", getpid(), fd); + p.flags |= IORING_SETUP_ATTACH_WQ; + p.wq_fd = fd; + + fd = sys_io_uring_setup(2, &p); + if (fd < 0) { + fprintf(stderr, "\nFailed to create io_uring: %m\n"); + goto end_wq_fd; + } + + printf("Dependent io_uring: %d\n", fd); + + evfd = eventfd(42, 0); + if (evfd < 0) { + fprintf(stderr, "Failed to create eventfd: %m\n"); + goto end_fd; + } + + ret = sys_io_uring_register(fd, IORING_REGISTER_EVENTFD_ASYNC, &evfd, 1); + if (ret < 0) { + fprintf(stderr, "Failed to register eventfd (async): %m\n"); + goto end_evfd; + } + + ret = sys_io_uring_register(fd, IORING_REGISTER_FILES, &(int []){0, -1, 1, -1, 2}, 5); + if (ret < 0) { + fprintf(stderr, "Failed to register files: %m\n"); + goto end_evfd; + } + + /* Register dummy buf as well */ + ret = sys_io_uring_register(fd, IORING_REGISTER_BUFFERS, &(struct iovec[]){{}, {&p, sizeof(p)}}, 2); + if (ret < 0) { + fprintf(stderr, "Failed to register buffers: %m\n"); + goto end_evfd; + } + + pause(); + + r = 0; +end_evfd: + close(evfd); +end_fd: + close(fd); +end_wq_fd: + close(p.wq_fd); + return r; +} + +int main(int argc, char *argv[]) +{ + if (argc < 2 || argc > 4) { +usage: + fprintf(stderr, "Usage: %s dump PID FD > dump.out\n" + "\tcat dump.out | %s restore [--dry-run]\n" + "\t%s app\n", argv[0], argv[0], argv[0]); + return 1; + } + + if (libbpf_set_strict_mode(LIBBPF_STRICT_ALL)) { + fprintf(stderr, "Failed to set libbpf strict mode\n"); + return 1; + } + + if (!strcmp(argv[1], "app")) { + return run_app(); + } else if (!strcmp(argv[1], "dump")) { + if (argc != 4) + goto usage; + return do_dump(atoi(argv[2]), atoi(argv[3])); + } else if (!strcmp(argv[1], "restore")) { + if (argc < 2 || argc > 3) + goto usage; + if (argc == 3 && strcmp(argv[2], "--dry-run")) + goto usage; + return do_restore(argc == 3 /* dry_run mode */); + } + fprintf(stderr, "Unknown argument\n"); + goto usage; +} diff --git a/samples/bpf/bpf_cr.h b/samples/bpf/bpf_cr.h new file mode 100644 index 000000000000..74d4ca639db5 --- /dev/null +++ b/samples/bpf/bpf_cr.h @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#ifndef BPF_CR_H +#define BPF_CR_H + +/* The order of restore actions is in order of declaration for each type, + * hence on restore consumed descriptors can be sorted based on their type, + * and then each action for the corresponding descriptor can be invoked, to + * recreate the io_uring. + */ +enum io_uring_state_type { + DUMP_SETUP, /* Record setup parameters */ + DUMP_EVENTFD, /* eventfd registered in io_uring */ + DUMP_REG_FD, /* fd registered in io_uring */ + DUMP_REG_BUF, /* buffer registered in io_uring */ + __DUMP_MAX, +}; + +struct io_uring_dump { + enum io_uring_state_type type; + int32_t io_uring_fd; + bool end; + union { + struct /* DUMP_SETUP */ { + uint32_t flags; + uint32_t sq_entries; + uint32_t cq_entries; + int32_t sq_thread_cpu; + int32_t sq_thread_idle; + uint32_t wq_fd; + } setup; + struct /* DUMP_EVENTFD */ { + uint32_t eventfd; + bool async; + } eventfd; + struct /* DUMP_REG_FD */ { + uint32_t reg_fd; + uint64_t index; + } reg_fd; + struct /* DUMP_REG_BUF */ { + uint64_t addr; + uint64_t len; + uint64_t index; + } reg_buf; + } desc; +}; + +#endif diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h index 722b3fadb467..1752a46a2b05 100644 --- a/samples/bpf/hbm_kern.h +++ b/samples/bpf/hbm_kern.h @@ -9,8 +9,6 @@ * Include file for sample Host Bandwidth Manager (HBM) BPF programs */ #define KBUILD_MODNAME "foo" -#include -#include #include #include #include