From patchwork Thu Aug 10 01:57:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348704 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E531BEB64DD for ; Thu, 10 Aug 2023 01:58:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230154AbjHJB6E (ORCPT ); Wed, 9 Aug 2023 21:58:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49996 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229604AbjHJB6D (ORCPT ); Wed, 9 Aug 2023 21:58:03 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 411DBC6 for ; Wed, 9 Aug 2023 18:58:02 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-d4db57d2982so461791276.3 for ; Wed, 09 Aug 2023 18:58:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632681; x=1692237481; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GDnXmpYx8eP78AYOg4pkJM7yEm1cliNyt0IkWPt0PrI=; b=Aen8uuD4LiwHFY+bqurb7Vas9R8zhA/qtZnKjesyLr5JuM44D+DP+E9C2pLLGayB4g ZHE3IbXDsPjq2PBUhtJpp6RDzAlwJIBXUkAkhCESoOGN06urfoEq3SPiOZ/0DxdeHDe9 cKpNOPyZi4KD+Uyb5xB1xqoLt0nopqAT9+3zz26lS5SVCyx5dMwrGgYtzEg/ThuoZi8Z A7MRxVyO2om7JHDsm6cznmmiBzQwd253XlogVzlnSGnSPXy8XRVlvMZL6sgnSVKi5Scv ymTwa9moql6bqYLKe31Hu9IUfhwlQFc2t0h6dlR6On5n3gZzuiV535Xp13oJ8tDzJzgO B2Pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632681; x=1692237481; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GDnXmpYx8eP78AYOg4pkJM7yEm1cliNyt0IkWPt0PrI=; b=hyve26JGSkPSWw9/nKCbMWTSwE9YZ3xSGqbsv5WzM8naGgJQqq5G+C+FpLRmQWyDos qI9pV2HInap2FpPBwXMLtzzbfmA/r2/uPgtp/nt4+E+12JiOBlWCN3h1jG8NdubYw1dZ Ny8gFF0EPHPxSEug0LH4/BNeK62L4G5ue6vYD2KNkv5nQS7Ejh5e9Zv0hD/cZIgoq9Zi faqfWJEq/2rNf0UPl4N/2nrzwjAP2YhDFGvFVL4O6qr3VDueFqPLy+PpHkGCNTihh0Xg bBqbd/K0ZoZivuQ4dEcKARSwJrVixt3YWIPMHNNGiBehp48YkYd/RUXEjKw6EM1jfgTi 76cQ== X-Gm-Message-State: AOJu0YzpJOULFUzk7kt7oP8XVqXoMLb4kLvIwZEy/x9Zw5khTwUxqWqd oRBWgwGIceXWy5eWXGgoy6B7wcL0sjMD8iM1Qg== X-Google-Smtp-Source: AGHT+IFdKYYGNPKm4yIJ3i7lDukNoxcqgTIDgeuPYnSb7xY53ojSb6n324Dq4HxTw3OmxqB5uOVGQp26su1wnD96Rw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:d711:0:b0:d4a:7656:a680 with SMTP id o17-20020a25d711000000b00d4a7656a680mr18846ybg.2.1691632681568; Wed, 09 Aug 2023 18:58:01 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:37 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-2-almasrymina@google.com> Subject: [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queue to bind the dma-buf to. The user should be able to bind the same dma-buf to multiple queues, but that is left as a (minor) TODO in this iteration. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- Documentation/netlink/specs/netdev.yaml | 27 +++++++++++++++ include/uapi/linux/netdev.h | 10 ++++++ net/core/netdev-genl-gen.c | 14 ++++++++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 6 ++++ tools/include/uapi/linux/netdev.h | 10 ++++++ tools/net/ynl/generated/netdev-user.c | 41 ++++++++++++++++++++++ tools/net/ynl/generated/netdev-user.h | 46 +++++++++++++++++++++++++ 8 files changed, 155 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index e41015310a6e..907a45260e95 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -68,6 +68,23 @@ attribute-sets: type: u32 checks: min: 1 + - + name: bind-dmabuf + attributes: + - + name: ifindex + doc: netdev ifindex to bind the dma-buf to. + type: u32 + checks: + min: 1 + - + name: queue-idx + doc: receive queue index to bind the dma-buf to. + type: u32 + - + name: dmabuf-fd + doc: dmabuf file descriptor to bind. + type: u32 operations: list: @@ -100,6 +117,16 @@ operations: doc: Notification about device configuration being changed. notify: dev-get mcgrp: mgmt + - + name: bind-rx + doc: Bind dmabuf to netdev + attribute-set: bind-dmabuf + do: + request: + attributes: + - ifindex + - dmabuf-fd + - queue-idx mcast-groups: list: diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index bf71698a1e82..242b2b65161c 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -47,11 +47,21 @@ enum { NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1) }; +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUE_IDX, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, NETDEV_CMD_DEV_DEL_NTF, NETDEV_CMD_DEV_CHANGE_NTF, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index ea9231378aa6..2e34ad5cccfa 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -15,6 +15,13 @@ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1 [NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), }; +/* NETDEV_CMD_BIND_RX - do */ +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_BIND_DMABUF_DMABUF_FD + 1] = { + [NETDEV_A_BIND_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .type = NLA_U32, }, + [NETDEV_A_BIND_DMABUF_QUEUE_IDX] = { .type = NLA_U32, }, +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] = { { @@ -29,6 +36,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .dumpit = netdev_nl_dev_get_dumpit, .flags = GENL_CMD_CAP_DUMP, }, + { + .cmd = NETDEV_CMD_BIND_RX, + .doit = netdev_nl_bind_rx_doit, + .policy = netdev_bind_rx_nl_policy, + .maxattr = NETDEV_A_BIND_DMABUF_DMABUF_FD, + .flags = GENL_CMD_CAP_DO, + }, }; static const struct genl_multicast_group netdev_nl_mcgrps[] = { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index 7b370c073e7d..5aaeb435ec08 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -13,6 +13,7 @@ int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 65ef4867fc49..bf7324dd6c36 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -141,6 +141,12 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } +/* Stub */ +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index bf71698a1e82..242b2b65161c 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -47,11 +47,21 @@ enum { NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1) }; +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUE_IDX, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, NETDEV_CMD_DEV_DEL_NTF, NETDEV_CMD_DEV_CHANGE_NTF, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c index 4eb8aefef0cd..2716e63820d2 100644 --- a/tools/net/ynl/generated/netdev-user.c +++ b/tools/net/ynl/generated/netdev-user.c @@ -18,6 +18,7 @@ static const char * const netdev_op_strmap[] = { [NETDEV_CMD_DEV_ADD_NTF] = "dev-add-ntf", [NETDEV_CMD_DEV_DEL_NTF] = "dev-del-ntf", [NETDEV_CMD_DEV_CHANGE_NTF] = "dev-change-ntf", + [NETDEV_CMD_BIND_RX] = "bind-rx", }; const char *netdev_op_str(int op) @@ -57,6 +58,17 @@ struct ynl_policy_nest netdev_dev_nest = { .table = netdev_dev_policy, }; +struct ynl_policy_attr netdev_bind_dmabuf_policy[NETDEV_A_BIND_DMABUF_MAX + 1] = { + [NETDEV_A_BIND_DMABUF_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, }, + [NETDEV_A_BIND_DMABUF_QUEUE_IDX] = { .name = "queue-idx", .type = YNL_PT_U32, }, + [NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .name = "dmabuf-fd", .type = YNL_PT_U32, }, +}; + +struct ynl_policy_nest netdev_bind_dmabuf_nest = { + .max_attr = NETDEV_A_BIND_DMABUF_MAX, + .table = netdev_bind_dmabuf_policy, +}; + /* Common nested types */ /* ============== NETDEV_CMD_DEV_GET ============== */ /* NETDEV_CMD_DEV_GET - do */ @@ -172,6 +184,35 @@ void netdev_dev_get_ntf_free(struct netdev_dev_get_ntf *rsp) free(rsp); } +/* ============== NETDEV_CMD_BIND_RX ============== */ +/* NETDEV_CMD_BIND_RX - do */ +void netdev_bind_rx_req_free(struct netdev_bind_rx_req *req) +{ + free(req); +} + +int netdev_bind_rx(struct ynl_sock *ys, struct netdev_bind_rx_req *req) +{ + struct nlmsghdr *nlh; + int err; + + nlh = ynl_gemsg_start_req(ys, ys->family_id, NETDEV_CMD_BIND_RX, 1); + ys->req_policy = &netdev_bind_dmabuf_nest; + + if (req->_present.ifindex) + mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_IFINDEX, req->ifindex); + if (req->_present.dmabuf_fd) + mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_DMABUF_FD, req->dmabuf_fd); + if (req->_present.queue_idx) + mnl_attr_put_u32(nlh, NETDEV_A_BIND_DMABUF_QUEUE_IDX, req->queue_idx); + + err = ynl_exec(ys, nlh, NULL); + if (err < 0) + return -1; + + return 0; +} + static const struct ynl_ntf_info netdev_ntf_info[] = { [NETDEV_CMD_DEV_ADD_NTF] = { .alloc_sz = sizeof(struct netdev_dev_get_ntf), diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h index 5554dc69bb9c..74a43bb53627 100644 --- a/tools/net/ynl/generated/netdev-user.h +++ b/tools/net/ynl/generated/netdev-user.h @@ -82,4 +82,50 @@ struct netdev_dev_get_ntf { void netdev_dev_get_ntf_free(struct netdev_dev_get_ntf *rsp); +/* ============== NETDEV_CMD_BIND_RX ============== */ +/* NETDEV_CMD_BIND_RX - do */ +struct netdev_bind_rx_req { + struct { + __u32 ifindex:1; + __u32 dmabuf_fd:1; + __u32 queue_idx:1; + } _present; + + __u32 ifindex; + __u32 dmabuf_fd; + __u32 queue_idx; +}; + +static inline struct netdev_bind_rx_req *netdev_bind_rx_req_alloc(void) +{ + return calloc(1, sizeof(struct netdev_bind_rx_req)); +} +void netdev_bind_rx_req_free(struct netdev_bind_rx_req *req); + +static inline void +netdev_bind_rx_req_set_ifindex(struct netdev_bind_rx_req *req, __u32 ifindex) +{ + req->_present.ifindex = 1; + req->ifindex = ifindex; +} +static inline void +netdev_bind_rx_req_set_dmabuf_fd(struct netdev_bind_rx_req *req, + __u32 dmabuf_fd) +{ + req->_present.dmabuf_fd = 1; + req->dmabuf_fd = dmabuf_fd; +} +static inline void +netdev_bind_rx_req_set_queue_idx(struct netdev_bind_rx_req *req, + __u32 queue_idx) +{ + req->_present.queue_idx = 1; + req->queue_idx = queue_idx; +} + +/* + * Bind dmabuf to netdev + */ +int netdev_bind_rx(struct ynl_sock *ys, struct netdev_bind_rx_req *req); + #endif /* _LINUX_NETDEV_GEN_H */ From patchwork Thu Aug 10 01:57:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348705 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD0AFEB64DD for ; Thu, 10 Aug 2023 01:58:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230404AbjHJB6H (ORCPT ); Wed, 9 Aug 2023 21:58:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229604AbjHJB6G (ORCPT ); Wed, 9 Aug 2023 21:58:06 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 77CF4C6 for ; Wed, 9 Aug 2023 18:58:05 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-569e7aec37bso6483827b3.2 for ; Wed, 09 Aug 2023 18:58:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632684; x=1692237484; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=uM+hS5/EwM+ftlRN/2ZTbt0K7QdpYanXtJwBomb2UwQ=; b=uLaYCy++pxdUFVOxUPoNhYFoqnLRzBWPTl0JkxFlt1NmakDcC5JiSl2Wqx0xDBMMq9 wzBhxLV0g9+a4hf1u+bjjDJs+3nRk9E6clx9JJe03viTH3YjudD9DoaTGpuDnen1fIJy F3vOsJ8gr0zJNKizjCVwlI8hEmkJvG25QOIMAF+1kQC5BxohRIPNOtlhdpQdog7mKXj/ 1Tv72/LBk1ohqLGl1XnLlnE/npZMqQ3HoFGyUKDQ0KS5YDiGtKFZKwn7wxtMUw+w1Bpr mL8aj/ayNt7XeCEGjRBRUjdDcqJNAyvPtnJriesTCc9+oANHLfQNDWtCZaakCyNTvPl8 PJsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632684; x=1692237484; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=uM+hS5/EwM+ftlRN/2ZTbt0K7QdpYanXtJwBomb2UwQ=; b=YqsbcRO62U3Dpb0/0Dtg62Q+wEpXghHzjcSB3ODe1IGyNx26YxHJ9VxL3MLu8dPY57 9JuTlw4eQDc0n+APCQZG6ZeCaKHOAKQQZzxbUCnfbh1TMh84b2KQVkoVuReX6fHwx0d2 N0tdNBeZEd0SSmtnQ321wrL0I/20j464ujJHJGipzmzTvOzl2P15ILV6bneAY1Chrms/ isRG8+mPNdJWQ/WPyIHb0ZwBgr7DqCdifnZ6F6lN01eFOb3dK+ZmB2fq7AO7m9a9bNg4 /VtrZ/6Zi5PLW4GC6EhLwFFEqCvq/hBmmk3mmInO1czd5IiKDIr+jozTTQDEumwhDhB0 qoZQ== X-Gm-Message-State: AOJu0YwMyVWSBctXKBrzuRQ1nWHwSaLbw84F0CIM4CZJHdcWjTB5NztS dAAqrfRjMaFrT6E++vNSvy8zJK/g+0Z2wbZMcQ== X-Google-Smtp-Source: AGHT+IGIxKBB6I4szoZJIiPeLKyVUhfSqsHd8uAIGcFZ4ivEnT5WdLA6J2u1z8MX07aU+fdA7H4yZcQ1kx0cgMI58A== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a81:ac62:0:b0:576:fdbe:76b2 with SMTP id z34-20020a81ac62000000b00576fdbe76b2mr16690ywj.3.1691632684693; Wed, 09 Aug 2023 18:58:04 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:38 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-3-almasrymina@google.com> Subject: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to an rx queue on the netdevice. On the binding, the dma_buf_attach & dma_buf_map_attachment will occur. The entries in the sg_table from mapping will be inserted into a genpool to make it ready for allocation. The chunks in the genpool are owned by a dmabuf_chunk_owner struct which holds the dma-buf offset of the base of the chunk and the dma_addr of the chunk. Both are needed to use allocations that come from this chunk. We create a new type that represents an allocation from the genpool: page_pool_iov. We setup the page_pool_iov allocation size in the genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally allocated by the page pool and given to the drivers. The user can unbind the dmabuf from the netdevice by closing the netlink socket that established the binding. We do this so that the binding is automatically unbound even if the userspace process crashes. The binding and unbinding leaves an indicator in struct netdev_rx_queue that the given queue is bound, but the binding doesn't take effect until the driver actually reconfigures its queues, and re-initializes its page pool. This issue/weirdness is highlighted in the memory provider proposal[1], and I'm hoping that some generic solution for all memory providers will be discussed; this patch doesn't address that weirdness again. The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released. [1] https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/ Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/linux/netdevice.h | 57 ++++++++++++ include/net/page_pool.h | 27 ++++++ net/core/dev.c | 178 ++++++++++++++++++++++++++++++++++++++ net/core/netdev-genl.c | 101 ++++++++++++++++++++- 4 files changed, 361 insertions(+), 2 deletions(-) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3800d0479698..1b7c5966d2ca 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -53,6 +53,8 @@ #include #include #include +#include +#include struct netpoll_info; struct device; @@ -782,6 +784,55 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id, #endif #endif /* CONFIG_RPS */ +struct netdev_dmabuf_binding { + struct dma_buf *dmabuf; + struct dma_buf_attachment *attachment; + struct sg_table *sgt; + struct net_device *dev; + struct gen_pool *chunk_pool; + + /* The user holds a ref (via the netlink API) for as long as they want + * the binding to remain alive. Each page pool using this binding holds + * a ref to keep the binding alive. Each allocated page_pool_iov holds a + * ref. + * + * The binding undos itself and unmaps the underlying dmabuf once all + * those refs are dropped and the binding is no longer desired or in + * use. + */ + refcount_t ref; + + /* The portid of the user that owns this binding. Used for netlink to + * notify us of the user dropping the bind. + */ + u32 owner_nlportid; + + /* The list of bindings currently active. Used for netlink to notify us + * of the user dropping the bind. + */ + struct list_head list; + + /* rxq's this binding is active on. */ + struct xarray bound_rxq_list; +}; + +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding); + +static inline void +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding) +{ + refcount_inc(&binding->ref); +} + +static inline void +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding) +{ + if (!refcount_dec_and_test(&binding->ref)) + return; + + __netdev_devmem_binding_free(binding); +} + /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { struct xdp_rxq_info xdp_rxq; @@ -796,6 +847,7 @@ struct netdev_rx_queue { #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; #endif + struct netdev_dmabuf_binding *binding; } ____cacheline_aligned_in_smp; /* @@ -5026,6 +5078,11 @@ void netif_set_tso_max_segs(struct net_device *dev, unsigned int segs); void netif_inherit_tso_max(struct net_device *to, const struct net_device *from); +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding); +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd, + u32 rxq_idx, + struct netdev_dmabuf_binding **out); + static inline bool netif_is_macsec(const struct net_device *dev) { return dev->priv_flags & IFF_MACSEC; diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 364fe6924258..61b2066d32b5 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -170,6 +170,33 @@ extern const struct pp_memory_provider_ops hugesp_ops; extern const struct pp_memory_provider_ops huge_ops; extern const struct pp_memory_provider_ops huge_1g_ops; +/* page_pool_iov support */ + +/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist + * entry from the dmabuf is inserted into the genpool as a chunk, and needs + * this owner struct to keep track of some metadata necessary to create + * allocations from this chunk. + */ +struct dmabuf_genpool_chunk_owner { + /* Offset into the dma-buf where this chunk starts. */ + unsigned long base_virtual; + + /* dma_addr of the start of the chunk. */ + dma_addr_t base_dma_addr; + + /* Array of page_pool_iovs for this chunk. */ + struct page_pool_iov *ppiovs; + size_t num_ppiovs; + + struct netdev_dmabuf_binding *binding; +}; + +struct page_pool_iov { + struct dmabuf_genpool_chunk_owner *owner; + + refcount_t refcount; +}; + struct page_pool { struct page_pool_params p; diff --git a/net/core/dev.c b/net/core/dev.c index 8e7d0cb540cd..02a25ccf771a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -151,6 +151,8 @@ #include #include #include +#include +#include #include "dev.h" #include "net-sysfs.h" @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val, return call_netdevice_notifiers_info(val, &info.info); } +/* Device memory support */ + +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool, + struct gen_pool_chunk *chunk, + void *not_used) +{ + struct dmabuf_genpool_chunk_owner *owner = chunk->owner; + + kvfree(owner->ppiovs); + kfree(owner); +} + +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding) +{ + size_t size, avail; + + gen_pool_for_each_chunk(binding->chunk_pool, + netdev_devmem_free_chunk_owner, NULL); + + size = gen_pool_size(binding->chunk_pool); + avail = gen_pool_avail(binding->chunk_pool); + + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu", + size, avail)) + gen_pool_destroy(binding->chunk_pool); + + dma_buf_unmap_attachment(binding->attachment, binding->sgt, + DMA_BIDIRECTIONAL); + dma_buf_detach(binding->dmabuf, binding->attachment); + dma_buf_put(binding->dmabuf); + kfree(binding); +} + +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding) +{ + struct netdev_rx_queue *rxq; + unsigned long xa_idx; + + list_del_rcu(&binding->list); + + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) + if (rxq->binding == binding) + rxq->binding = NULL; + + netdev_devmem_binding_put(binding); +} + +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd, + u32 rxq_idx, struct netdev_dmabuf_binding **out) +{ + struct netdev_dmabuf_binding *binding; + struct netdev_rx_queue *rxq; + struct scatterlist *sg; + struct dma_buf *dmabuf; + unsigned int sg_idx, i; + unsigned long virtual; + u32 xa_idx; + int err; + + rxq = __netif_get_rx_queue(dev, rxq_idx); + + if (rxq->binding) + return -EEXIST; + + dmabuf = dma_buf_get(dmabuf_fd); + if (IS_ERR_OR_NULL(dmabuf)) + return -EBADFD; + + binding = kzalloc_node(sizeof(*binding), GFP_KERNEL, + dev_to_node(&dev->dev)); + if (!binding) { + err = -ENOMEM; + goto err_put_dmabuf; + } + + xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC); + + refcount_set(&binding->ref, 1); + + binding->dmabuf = dmabuf; + + binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent); + if (IS_ERR(binding->attachment)) { + err = PTR_ERR(binding->attachment); + goto err_free_binding; + } + + binding->sgt = dma_buf_map_attachment(binding->attachment, + DMA_BIDIRECTIONAL); + if (IS_ERR(binding->sgt)) { + err = PTR_ERR(binding->sgt); + goto err_detach; + } + + /* For simplicity we expect to make PAGE_SIZE allocations, but the + * binding can be much more flexible than that. We may be able to + * allocate MTU sized chunks here. Leave that for future work... + */ + binding->chunk_pool = gen_pool_create(PAGE_SHIFT, + dev_to_node(&dev->dev)); + if (!binding->chunk_pool) { + err = -ENOMEM; + goto err_unmap; + } + + virtual = 0; + for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) { + dma_addr_t dma_addr = sg_dma_address(sg); + struct dmabuf_genpool_chunk_owner *owner; + size_t len = sg_dma_len(sg); + struct page_pool_iov *ppiov; + + owner = kzalloc_node(sizeof(*owner), GFP_KERNEL, + dev_to_node(&dev->dev)); + owner->base_virtual = virtual; + owner->base_dma_addr = dma_addr; + owner->num_ppiovs = len / PAGE_SIZE; + owner->binding = binding; + + err = gen_pool_add_owner(binding->chunk_pool, dma_addr, + dma_addr, len, dev_to_node(&dev->dev), + owner); + if (err) { + err = -EINVAL; + goto err_free_chunks; + } + + owner->ppiovs = kvmalloc_array(owner->num_ppiovs, + sizeof(*owner->ppiovs), + GFP_KERNEL); + if (!owner->ppiovs) { + err = -ENOMEM; + goto err_free_chunks; + } + + for (i = 0; i < owner->num_ppiovs; i++) { + ppiov = &owner->ppiovs[i]; + ppiov->owner = owner; + refcount_set(&ppiov->refcount, 1); + } + + dma_addr += len; + virtual += len; + } + + /* TODO: need to be able to bind to multiple rx queues on the same + * netdevice. The code should already be able to handle that with + * minimal changes, but the netlink API currently allows for 1 rx + * queue. + */ + err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b, + GFP_KERNEL); + if (err) + goto err_free_chunks; + + rxq->binding = binding; + *out = binding; + + return 0; + +err_free_chunks: + gen_pool_for_each_chunk(binding->chunk_pool, + netdev_devmem_free_chunk_owner, NULL); + gen_pool_destroy(binding->chunk_pool); +err_unmap: + dma_buf_unmap_attachment(binding->attachment, binding->sgt, + DMA_BIDIRECTIONAL); +err_detach: + dma_buf_detach(dmabuf, binding->attachment); +err_free_binding: + kfree(binding); +err_put_dmabuf: + dma_buf_put(dmabuf); + return err; +} + #ifdef CONFIG_NET_INGRESS static DEFINE_STATIC_KEY_FALSE(ingress_needed_key); diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index bf7324dd6c36..288ed0112995 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -141,10 +141,74 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } -/* Stub */ +static LIST_HEAD(netdev_rbinding_list); + int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) { - return 0; + struct netdev_dmabuf_binding *out_binding; + u32 ifindex, dmabuf_fd, rxq_idx; + struct net_device *netdev; + struct sk_buff *rsp; + int err = 0; + void *hdr; + + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUE_IDX)) + return -EINVAL; + + ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); + dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]); + rxq_idx = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_QUEUE_IDX]); + + rtnl_lock(); + + netdev = __dev_get_by_index(genl_info_net(info), ifindex); + if (!netdev) { + err = -ENODEV; + goto err_unlock; + } + + if (rxq_idx >= netdev->num_rx_queues) { + err = -ERANGE; + goto err_unlock; + } + + if (netdev_bind_dmabuf_to_queue(netdev, dmabuf_fd, rxq_idx, + &out_binding)) { + err = -EINVAL; + goto err_unlock; + } + + out_binding->owner_nlportid = info->snd_portid; + list_add_rcu(&out_binding->list, &netdev_rbinding_list); + + rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!rsp) { + err = -ENOMEM; + goto err_unbind; + } + + hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq, + &netdev_nl_family, 0, info->genlhdr->cmd); + if (!hdr) { + err = -EMSGSIZE; + goto err_genlmsg_free; + } + + genlmsg_end(rsp, hdr); + + rtnl_unlock(); + + return genlmsg_reply(rsp, info); + +err_genlmsg_free: + nlmsg_free(rsp); +err_unbind: + netdev_unbind_dmabuf_to_queue(out_binding); +err_unlock: + rtnl_unlock(); + return err; } static int netdev_genl_netdevice_event(struct notifier_block *nb, @@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb, return NOTIFY_OK; } +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state, + void *_notify) +{ + struct netlink_notify *notify = _notify; + struct netdev_dmabuf_binding *rbinding; + + if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC) + return NOTIFY_DONE; + + rcu_read_lock(); + + list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) { + if (rbinding->owner_nlportid == notify->portid) { + netdev_unbind_dmabuf_to_queue(rbinding); + break; + } + } + + rcu_read_unlock(); + + return NOTIFY_OK; +} + static struct notifier_block netdev_genl_nb = { .notifier_call = netdev_genl_netdevice_event, }; +static struct notifier_block netdev_netlink_notifier = { + .notifier_call = netdev_netlink_notify, +}; + static int __init netdev_genl_init(void) { int err; @@ -183,8 +274,14 @@ static int __init netdev_genl_init(void) if (err) goto err_unreg_ntf; + err = netlink_register_notifier(&netdev_netlink_notifier); + if (err) + goto err_unreg_family; + return 0; +err_unreg_family: + genl_unregister_family(&netdev_nl_family); err_unreg_ntf: unregister_netdevice_notifier(&netdev_genl_nb); return err; From patchwork Thu Aug 10 01:57:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348706 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7F15C001B0 for ; Thu, 10 Aug 2023 01:58:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230495AbjHJB6K (ORCPT ); Wed, 9 Aug 2023 21:58:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbjHJB6J (ORCPT ); Wed, 9 Aug 2023 21:58:09 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D70D210DA for ; Wed, 9 Aug 2023 18:58:08 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-c6dd0e46a52so422934276.2 for ; Wed, 09 Aug 2023 18:58:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632688; x=1692237488; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=02sZ2KF4/kqkk3itVZ075jQ951/44aLMfRaJz3dWZO4=; b=6mH2Ew/JcDBG5I+lwTS2EpGNdA9p2OpK3REUFQmgG/geEfGAlApwafQm+F8Gjno4YC vCDpAjVQe/ngWT9ZjuvRnO154w6wMjfn1TMqYO6+aMkf3aUUNdMJ68kQBo2qQ919+N+6 lKxnUZJbXsq2kguUx1ebg3wjqtsA5OXafhjUWBT09q2+GKzL/W7LndWwN5JGkvJDAsIU f37R5BVREmn+7hqWsqzA49Zg1HbjS5FUv9sr3KyQvLdN0VbcJv5ntYIw6TECU5sCQfjR 5iZE0rdJFNV4kBD7r9HDwuh2iYmq6tOP0IggmUN31yMTUrjT1B3pj+8C4fHWrpOgLVl5 IaKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632688; x=1692237488; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=02sZ2KF4/kqkk3itVZ075jQ951/44aLMfRaJz3dWZO4=; b=G8/1dBKjZSJa/OYfOKYSYIJ4PnuzprzW/BaeoyYuA2UYm+4lhweIdGM7R6Rbtw6YHV 2Bx5LL7F/lHbVaoQaU2n6NFGvDAx8dIxGB0irfak0+MmEPBR/8dwmKQEshxFvZVZDFC/ q9wYtQCOm5wA6v7NXCbZ7um3Mu+a/2Tt5/pp2b7aRXp32nHQ0ioL0WaDz+xR3tnrds9M ngClmh9G5L+G2+rvmmvxWDTqxHsTRnP+xr76sbpd3tzgIunrLWAdio15xHop4G7Nvhtm /L88VkXHrEDoswjqV5WNyKWHp2mN2XBW5nNx+VRAD3CH/F/1qohT0K/OLbUqgJp1pXzP pdfw== X-Gm-Message-State: AOJu0YzenSSsdB5cgqqAO0C1ds0Yj94zcQbWaEV//d6vkrBjLUhjgpTg 5CaM43g+2F55fCQYXIn+j9Yv4dE2KU6gYGXz6w== X-Google-Smtp-Source: AGHT+IGCapGPNtCMW44RrnrHw/oUHA4QZiHHbnSEAibZ2bVU1IyKSIRjwoib/ZnyU3K9b3mMe5ZiIByGi0TbafufWQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:778b:0:b0:d09:17f2:d3b0 with SMTP id s133-20020a25778b000000b00d0917f2d3b0mr16984ybc.9.1691632688161; Wed, 09 Aug 2023 18:58:08 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:39 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-4-almasrymina@google.com> Subject: [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates page_pool_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a page_pool_iov struct. page_pool_iov are refcounted and are freed back to the binding when the refcount drops to 0. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/linux/netdevice.h | 4 ++++ include/net/page_pool.h | 26 ++++++++++++++++++++++++++ net/core/dev.c | 36 ++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 1b7c5966d2ca..bb5296e6cb00 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -5078,6 +5078,10 @@ void netif_set_tso_max_segs(struct net_device *dev, unsigned int segs); void netif_inherit_tso_max(struct net_device *to, const struct net_device *from); +struct page_pool_iov * +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); +void netdev_free_devmem(struct page_pool_iov *ppiov); + void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd, u32 rxq_idx, diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 61b2066d32b5..13ae7f668c9e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -197,6 +197,32 @@ struct page_pool_iov { refcount_t refcount; }; +static inline struct dmabuf_genpool_chunk_owner * +page_pool_iov_owner(const struct page_pool_iov *ppiov) +{ + return ppiov->owner; +} + +static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov) +{ + return ppiov - page_pool_iov_owner(ppiov)->ppiovs; +} + +static inline dma_addr_t +page_pool_iov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); + + return owner->base_dma_addr + + ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); +} + +static inline struct netdev_dmabuf_binding * +page_pool_iov_binding(const struct page_pool_iov *ppiov) +{ + return page_pool_iov_owner(ppiov)->binding; +} + struct page_pool { struct page_pool_params p; diff --git a/net/core/dev.c b/net/core/dev.c index 02a25ccf771a..0149335a25b7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2072,6 +2072,42 @@ void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding) kfree(binding); } +struct page_pool_iov *netdev_alloc_devmem(struct netdev_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + struct page_pool_iov *ppiov; + unsigned long dma_addr; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + ppiov = &owner->ppiovs[index]; + + netdev_devmem_binding_get(binding); + + return ppiov; +} + +void netdev_free_devmem(struct page_pool_iov *ppiov) +{ + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov); + + refcount_set(&ppiov->refcount, 1); + + if (gen_pool_has_addr(binding->chunk_pool, + page_pool_iov_dma_addr(ppiov), PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, + page_pool_iov_dma_addr(ppiov), PAGE_SIZE); + + netdev_devmem_binding_put(binding); +} + void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding) { struct netdev_rx_queue *rxq; From patchwork Thu Aug 10 01:57:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348707 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DEE9C001B0 for ; Thu, 10 Aug 2023 01:58:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231828AbjHJB6N (ORCPT ); Wed, 9 Aug 2023 21:58:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42492 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbjHJB6M (ORCPT ); Wed, 9 Aug 2023 21:58:12 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFC21C6 for ; Wed, 9 Aug 2023 18:58:11 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d4db57d2982so461904276.3 for ; Wed, 09 Aug 2023 18:58:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632691; x=1692237491; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=PF7UhYesGVhoeJAxt80bdAiasdVmkSyS3tHMUDi8SQw=; b=pYFgcIsw4RGjSJ6O/WDPYvQAw5gzuBsFuvmi3RCtRNsk292HSz6F6Rq0SBmGzE9+v4 6x/jBSlOkk/J+Yh3t/KTzU3FjaGbJ1FFnGn4rCiMW1WOCebJU52pQfzMGVgw3dCLCtGh ibdISG2VQMfwETS3995hpQNOYOrj6xJozdLmzr81MayqFjb/03ruiIH3Glj9WZwG5EjN RQKnQQbvKQVhynHpODxT7SyW85JCXAEYDtyJ8SWBS0iWtFhq7fF90spF53UKp4OsuKxW lPcMHahFVO0cPAIAIcp3vK/wY25c3Sh6fLtdc5SiYM3yvZuHjR3wr5RO5McZK+W8FoXx NzaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632691; x=1692237491; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=PF7UhYesGVhoeJAxt80bdAiasdVmkSyS3tHMUDi8SQw=; b=Mcbx6RUbMvlxSQCgpLbmqvlNVK8z6envDWyR158UxmLUgjJQw5tCPNXGx5QChMqIcd 6cg/6Hp6bgrlg9muv0IYLdVNUhHzB85EU9bnkTR0Uv3vFfUE4gxV7JVnHYNn0vrFidYe JJBPlkqHFP64l0RCLma1YdlolhHWdEpOY0KTxytF4PV7H+TAF5ttHQgb8e3SvP1lRgQV f0hSslIBtiwNGRjnqrAdwE4aOTyrwtI+Cd1zTsJjqUVtYp5lHXChrD8SiMuHktP6EqTF v0O6oJOJ2VxgkO5Wv9sWkmLyx0+SUYqRwfeq2QJxgeSTQpRNDRp8913ObjU0LBisvz/E e+kQ== X-Gm-Message-State: AOJu0YxtxrUDSp8ci1jDHGuyqNwe72Rp8tPCc4fWiXMCpqY6V2elGmyg CcMQV+2FgfvKfPrHJvprD84g13aBjTx/bSIMRQ== X-Google-Smtp-Source: AGHT+IGkjMr8m1EAO46mSdbpgOuU+k68iX+xxQzg7fLvTju4OvFhxMgZfxumSDV1ACjEoW/UuxWxOjwMGm+YP0pA8A== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:d8c2:0:b0:d43:78e8:f628 with SMTP id p185-20020a25d8c2000000b00d4378e8f628mr17793ybg.6.1691632691163; Wed, 09 Aug 2023 18:58:11 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:40 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-5-almasrymina@google.com> Subject: [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Implement a few updates to Jakub's RFC memory provider[1] API to make it suitable for device memory TCP: 1. Currently for devmem TCP the driver's netdev_rx_queue holds a reference to the netdev_dmabuf_binding struct and needs to pass that to the page_pool's memory provider somehow. For PoC purposes, create a pp->mp_priv field that is set by the driver. Likely needs a better API (likely dependent on the general memory provider API). 2. The current memory_provider API gives the memory_provider the option to override put_page(), but tries page_pool_clear_pp_info() after the memory provider has released the page. IMO if the page freeing is delegated to the provider then the page_pool should not modify the page after release_page() has been called. [1]: https://lore.kernel.org/netdev/20230707183935.997267-1-kuba@kernel.org/ Signed-off-by: Mina Almasry --- include/net/page_pool.h | 1 + net/core/page_pool.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 13ae7f668c9e..e395f82e182b 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -78,6 +78,7 @@ struct page_pool_params { struct device *dev; /* device, for DMA pre-mapping purposes */ struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ u8 memory_provider; /* haaacks! should be user-facing */ + void *mp_priv; /* argument to pass to the memory provider */ enum dma_data_direction dma_dir; /* DMA mapping direction */ unsigned int max_len; /* max DMA sync memory size */ unsigned int offset; /* DMA addr offset */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d50f6728e4f6..df3f431fcff3 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -241,6 +241,7 @@ static int page_pool_init(struct page_pool *pool, goto free_ptr_ring; } + pool->mp_priv = pool->p.mp_priv; if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -564,16 +565,16 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) else __page_pool_release_page_dma(pool, page); - page_pool_clear_pp_info(page); - /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - if (put) + if (put) { + page_pool_clear_pp_info(page); put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). From patchwork Thu Aug 10 01:57:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348708 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7248EC04E69 for ; Thu, 10 Aug 2023 01:58:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231618AbjHJB6P (ORCPT ); Wed, 9 Aug 2023 21:58:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbjHJB6P (ORCPT ); Wed, 9 Aug 2023 21:58:15 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D37D10DA for ; Wed, 9 Aug 2023 18:58:14 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d4db57d2982so461925276.3 for ; Wed, 09 Aug 2023 18:58:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632694; x=1692237494; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=7yDr7JQeHi3HbfjhCajA/kTx8YfoU/wmOXpUofE1h4w=; b=XfCKIi7ayck9Nw0bIZfVpqsATb0SO0Nh0Drm64/DAcULnkhUWKp++UzHd1KQT9bmRZ pmvRDhJqyQiWxcybNrFcOVutxvfmyDLqS9AwT1w31Cwhq+JxPbZmYuYZWYAjji2phxiR UrdHPM/pD1jXE0ooBUufFnt1xheblVx5sw2QpSz1W1XnYdxxk4gz1zmx9BODhaml85mR eH0RQB7DfNzN3GnXzawdpsmIshnQCjy99WbzFWl3RzDJRXIkTI9COJBcT/pjGJC38OH4 W0PZDcOhs9n17MPeeV8thi79paIxx0Dbl9LXmQLGfK6JctFH8DQo7XzWDqmwkE1vECZA asIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632694; x=1692237494; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7yDr7JQeHi3HbfjhCajA/kTx8YfoU/wmOXpUofE1h4w=; b=BxR4eCfOPW4naIVvWwrrVX8Dn3Ik26lXjMhYaRuYrr7scXYIWLziOO3Wth3yANpfnH ROCtppOaQsQQAo/73fdP4UKsCiDBf86rVxpb0Os4uLVaQVkEawrD737JuO854ihU39Go y74pl/SXTX7MXaQqVWZLcG6R79+s/UdcyG5cAQGV+Egos8hor3SBhx6NImTlmMRpDsQx tG8udcNKOfYTOyjZI6lhBlwW/kKIHy2RuOkMSQLl5G2zijT4uSqfKKVu2SVr7t5kzJ+S Z3B3ii784IxyFFjxoH/kISaRjiBUB8wGPqfDnbJP0B1MqcGdu1+Fls0m3EM05VNuad0U BfDg== X-Gm-Message-State: AOJu0YyjjaEsO4Rt59z1ULyH8SufOw01oI1DsOJlFPi5u+WwVEovsDRp Ii4XEgoywzt4F5XNlfT9EJNvrbp6fd96RlSZfA== X-Google-Smtp-Source: AGHT+IHvIgRoUi9uTu8RQhbvocSiejbRnMQw0KEmAOPCPAqjCwCmVL+NueD6xFCvDOx3/je0ZQbqIBEk9eQZYEBgWA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:dad7:0:b0:d14:6868:16a3 with SMTP id n206-20020a25dad7000000b00d14686816a3mr18332ybf.5.1691632693837; Wed, 09 Aug 2023 18:58:13 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:41 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-6-almasrymina@google.com> Subject: [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Implement a memory provider that allocates dmabuf devmem page_pool_iovs. Support of PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, and PP_FLAG_PAGE_FRAG is omitted. PP_FLAG_DMA_MAP is irrelevant as dma-buf devmem is already mapped. The other flags are omitted for simplicity. The provider receives a reference to the struct netdev_dmabuf_binding via the pool->mp_priv pointer. The driver needs to set this pointer for the provider in the page_pool_params. The provider obtains a reference on the netdev_dmabuf_binding which guarantees the binding and the underlying mapping remains alive until the provider is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/net/page_pool.h | 58 ++++++++++++++++++++++++++++++ net/core/page_pool.c | 79 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 137 insertions(+) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/net/page_pool.h b/include/net/page_pool.h index e395f82e182b..537eb36115ed 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -32,6 +32,7 @@ #include /* Needed by ptr_ring */ #include #include +#include #define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA * map/unmap @@ -157,6 +158,7 @@ enum pp_memory_provider_type { PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */ PP_MP_HUGE, /* 2MB, all memory pre-allocated */ PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */ + PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ }; struct pp_memory_provider_ops { @@ -170,9 +172,15 @@ extern const struct pp_memory_provider_ops basic_ops; extern const struct pp_memory_provider_ops hugesp_ops; extern const struct pp_memory_provider_ops huge_ops; extern const struct pp_memory_provider_ops huge_1g_ops; +extern const struct pp_memory_provider_ops dmabuf_devmem_ops; /* page_pool_iov support */ +/* We overload the LSB of the struct page pointer to indicate whether it's + * a page or page_pool_iov. + */ +#define PP_DEVMEM 0x01UL + /* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist * entry from the dmabuf is inserted into the genpool as a chunk, and needs * this owner struct to keep track of some metadata necessary to create @@ -196,6 +204,8 @@ struct page_pool_iov { struct dmabuf_genpool_chunk_owner *owner; refcount_t refcount; + + struct page_pool *pp; }; static inline struct dmabuf_genpool_chunk_owner * @@ -218,12 +228,60 @@ page_pool_iov_dma_addr(const struct page_pool_iov *ppiov) ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); } +static inline unsigned long +page_pool_iov_virtual_addr(const struct page_pool_iov *ppiov) +{ + struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); + + return owner->base_virtual + + ((unsigned long)page_pool_iov_idx(ppiov) << PAGE_SHIFT); +} + static inline struct netdev_dmabuf_binding * page_pool_iov_binding(const struct page_pool_iov *ppiov) { return page_pool_iov_owner(ppiov)->binding; } +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov) +{ + return refcount_read(&ppiov->refcount); +} + +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov, + unsigned int count) +{ + refcount_add(count, &ppiov->refcount); +} + +void __page_pool_iov_free(struct page_pool_iov *ppiov); + +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov, + unsigned int count) +{ + if (!refcount_sub_and_test(count, &ppiov->refcount)) + return; + + __page_pool_iov_free(ppiov); +} + +/* page pool mm helpers */ + +static inline bool page_is_page_pool_iov(const struct page *page) +{ + return (unsigned long)page & PP_DEVMEM; +} + +static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page) +{ + if (page_is_page_pool_iov(page)) + return (struct page_pool_iov *)((unsigned long)page & + ~PP_DEVMEM); + + DEBUG_NET_WARN_ON_ONCE(true); + return NULL; +} + struct page_pool { struct page_pool_params p; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index df3f431fcff3..0a7c08d748b8 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -20,6 +20,7 @@ #include #include #include +#include #include @@ -236,6 +237,9 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_HUGE_1G: pool->mp_ops = &huge_1g_ops; break; + case PP_MP_DMABUF_DEVMEM: + pool->mp_ops = &dmabuf_devmem_ops; + break; default: err = -EINVAL; goto free_ptr_ring; @@ -1006,6 +1010,15 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) } EXPORT_SYMBOL(page_pool_return_skb_page); +void __page_pool_iov_free(struct page_pool_iov *ppiov) +{ + if (ppiov->pp->mp_ops != &dmabuf_devmem_ops) + return; + + netdev_free_devmem(ppiov); +} +EXPORT_SYMBOL_GPL(__page_pool_iov_free); + /*********************** * Mem provider hack * ***********************/ @@ -1538,3 +1551,69 @@ const struct pp_memory_provider_ops huge_1g_ops = { .alloc_pages = mp_huge_1g_alloc_pages, .release_page = mp_huge_1g_release, }; + +/*** "Dmabuf devmem memory provider" ***/ + +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + if (!binding) + return -EINVAL; + + if (pool->p.flags & PP_FLAG_DMA_MAP || + pool->p.flags & PP_FLAG_DMA_SYNC_DEV || + pool->p.flags & PP_FLAG_PAGE_FRAG) + return -EOPNOTSUPP; + + netdev_devmem_binding_get(binding); + return 0; +} + +static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + struct page_pool_iov *ppiov; + + ppiov = netdev_alloc_devmem(binding); + if (!ppiov) + return NULL; + + ppiov->pp = pool; + pool->pages_state_hold_cnt++; + trace_page_pool_state_hold(pool, (struct page *)ppiov, + pool->pages_state_hold_cnt); + ppiov = (struct page_pool_iov *)((unsigned long)ppiov | PP_DEVMEM); + + return (struct page *)ppiov; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + netdev_devmem_binding_put(binding); +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + struct page *page) +{ + struct page_pool_iov *ppiov; + + if (WARN_ON_ONCE(!page_is_page_pool_iov(page))) + return false; + + ppiov = page_to_page_pool_iov(page); + page_pool_iov_put_many(ppiov, 1); + /* We don't want the page pool put_page()ing our page_pool_iovs. */ + return false; +} + +const struct pp_memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops); From patchwork Thu Aug 10 01:57:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348709 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AE53EB64DD for ; Thu, 10 Aug 2023 01:58:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231924AbjHJB6T (ORCPT ); Wed, 9 Aug 2023 21:58:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42586 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbjHJB6S (ORCPT ); Wed, 9 Aug 2023 21:58:18 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C22DC6 for ; Wed, 9 Aug 2023 18:58:17 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-d630af4038fso463176276.0 for ; Wed, 09 Aug 2023 18:58:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632697; x=1692237497; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ewIR3DDgMmJ8BnNKd3qQzObvFMmy60EaYGcnV9PQRmM=; b=lwrkNH5EGaIsg06YIUFIqAi41k/EZqWnDtbvAyBb6jqI5IbB3CEe9V64BFCG8+H77L +/u+XsM+/wF+04vGsdws+ARhE3samlulxdfvJnPZ/WIrBXOXZMPw2vxYnw2tXSsSP0la lJER9poE4gbUR3QFtj6x+c6ml1/3G9+jaD6iz6Aj2zA7h7W0BOZFK2qsUeN+ycLTjKeQ OXAmu77iLnQ2PiamQeUlWsB4OZCmR+iBMokHaRTmL3uG1gdS63yRpbfuf0VettpjZ4OY HhffnwpUdxNdZB9vg2j9ucZ6x6BrrWM5n3hQcb3jQe0VmlcwOm0UdosLm0S5ua3K7j5i lzgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632697; x=1692237497; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ewIR3DDgMmJ8BnNKd3qQzObvFMmy60EaYGcnV9PQRmM=; b=hMuBn+5lSp1hoI2+BEH6kOAIkfyOo+sV54pWT5S7AkbvPqsFHmFLqTqSCKF1NBv7Bp 42ec84Ej3yUn9PYjA5tRzP3nXCB+jDhxO/JRUehs+RPLABfT+2cbmNmGIA2qGvEpW7DT jnxJcBmPxeV+bF3vUz2g7lZhnZanpSqNNj56XUPjh+mrseCT772dg3/rD/IRXnhUr9P7 hk8QEFEyscA8pPZz5w5Uqy/1TFeWK9I7xQs75146P6LKmllBqK6Ma6FytW1FSdeoCLyI 6AQhOJZuivvTa/tN9lX1fnxI4jV86tuBEOwrhDYZrMqNrqy3TgT+VLUmxL7FfhT0AcLp 2owg== X-Gm-Message-State: AOJu0Ywfpatu8Lz9V90dSjEykfGASlAQcrCCAeLvz0UN2Eu8/Xhzu/1T KE0o6BIF5Ju1R3y0mz8JS0SAEHtYRQLVgeX5NQ== X-Google-Smtp-Source: AGHT+IHQXC92xGLBpjtBfTmT7ryt7BwKv8NqRJfGfu7d9AKpAvKZXXGgEuAoictW8Q26UU3lfF4ZIdPaLJbDrx1slw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:d18e:0:b0:c78:c530:6345 with SMTP id i136-20020a25d18e000000b00c78c5306345mr16446ybg.7.1691632696924; Wed, 09 Aug 2023 18:58:16 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:42 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-7-almasrymina@google.com> Subject: [RFC PATCH v2 06/11] page-pool: add device memory support From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Overload the LSB of struct page* to indicate that it's a page_pool_iov. Refactor mm calls on struct page * into helpers, and add page_pool_iov handling on those helpers. Modify callers of these mm APIs with calls to these helpers instead. In areas where struct page* is dereferenced, add a check for special handling of page_pool_iov. The memory providers producing page_pool_iov can set the LSB on the struct page* returned to the page pool. Note that instead of overloading the LSB of page pointers, we can instead define a new union between struct page & struct page_pool_iov and compact it in a new type. However, we'd need to implement the code churn to modify the page_pool & drivers to use this new type. For this POC that is not implemented (feedback welcome). I have a sample implementation of adding a new page_pool_token type in the page_pool to give a general idea here: https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d Full branch here: https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens (In the branches above, page_pool_iov is called devmem_slice). Could also add static_branch to speed up the checks in page_pool_iov memory providers are being used. Signed-off-by: Mina Almasry --- include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++- net/core/page_pool.c | 85 ++++++++++++++++++++++++++++------------- 2 files changed, 131 insertions(+), 28 deletions(-) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 537eb36115ed..f08ca230d68e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page) return NULL; } +static inline int page_pool_page_ref_count(struct page *page) +{ + if (page_is_page_pool_iov(page)) + return page_pool_iov_refcount(page_to_page_pool_iov(page)); + + return page_ref_count(page); +} + +static inline void page_pool_page_get_many(struct page *page, + unsigned int count) +{ + if (page_is_page_pool_iov(page)) + return page_pool_iov_get_many(page_to_page_pool_iov(page), + count); + + return page_ref_add(page, count); +} + +static inline void page_pool_page_put_many(struct page *page, + unsigned int count) +{ + if (page_is_page_pool_iov(page)) + return page_pool_iov_put_many(page_to_page_pool_iov(page), + count); + + if (count > 1) + page_ref_sub(page, count - 1); + + put_page(page); +} + +static inline bool page_pool_page_is_pfmemalloc(struct page *page) +{ + if (page_is_page_pool_iov(page)) + return false; + + return page_is_pfmemalloc(page); +} + +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid) +{ + /* Assume page_pool_iov are on the preferred node without actually + * checking... + * + * This check is only used to check for recycling memory in the page + * pool's fast paths. Currently the only implementation of page_pool_iov + * is dmabuf device memory. It's a deliberate decision by the user to + * bind a certain dmabuf to a certain netdev, and the netdev rx queue + * would not be able to reallocate memory from another dmabuf that + * exists on the preferred node, so, this check doesn't make much sense + * in this case. Assume all page_pool_iovs can be recycled for now. + */ + if (page_is_page_pool_iov(page)) + return true; + + return page_to_nid(page) == pref_nid; +} + struct page_pool { struct page_pool_params p; @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr) { long ret; + if (page_is_page_pool_iov(page)) + return -EINVAL; + /* If nr == pp_frag_count then we have cleared all remaining * references to the page. No need to actually overwrite it, instead * we can leave this to be overwritten by the calling function. @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, static inline dma_addr_t page_pool_get_dma_addr(struct page *page) { - dma_addr_t ret = page->dma_addr; + dma_addr_t ret; + + if (page_is_page_pool_iov(page)) + return page_pool_iov_dma_addr(page_to_page_pool_iov(page)); + + ret = page->dma_addr; if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) { + /* page_pool_iovs are mapped and their dma-addr can't be modified. */ + if (page_is_page_pool_iov(page)) { + DEBUG_NET_WARN_ON_ONCE(true); + return; + } + page->dma_addr = addr; if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) page->dma_addr_upper = upper_32_bits(addr); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 0a7c08d748b8..20c1f74fd844 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) if (unlikely(!page)) break; - if (likely(page_to_nid(page) == pref_nid)) { + if (likely(page_pool_page_is_pref_nid(page, pref_nid))) { pool->alloc.cache[pool->alloc.count++] = page; } else { /* NUMA mismatch; @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, struct page *page, unsigned int dma_sync_size) { - dma_addr_t dma_addr = page_pool_get_dma_addr(page); + dma_addr_t dma_addr; + + /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */ + if (page_is_page_pool_iov(page)) { + DEBUG_NET_WARN_ON_ONCE(true); + return; + } + + dma_addr = page_pool_get_dma_addr(page); dma_sync_size = min(dma_sync_size, pool->p.max_len); dma_sync_single_range_for_device(pool->p.dev, dma_addr, @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) { dma_addr_t dma; + if (page_is_page_pool_iov(page)) { + /* page_pool_iovs are already mapped */ + DEBUG_NET_WARN_ON_ONCE(true); + return true; + } + /* Setup DMA mapping: use 'struct page' area for storing DMA-addr * since dma_addr_t can be either 32 or 64 bits and does not always fit * into page private data (i.e 32bit cpu with 64bit DMA caps) @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) static void page_pool_set_pp_info(struct page_pool *pool, struct page *page) { - page->pp = pool; - page->pp_magic |= PP_SIGNATURE; + if (!page_is_page_pool_iov(page)) { + page->pp = pool; + page->pp_magic |= PP_SIGNATURE; + } else { + page_to_page_pool_iov(page)->pp = pool; + } + if (pool->p.init_callback) pool->p.init_callback(page, pool->p.init_arg); } static void page_pool_clear_pp_info(struct page *page) { + if (page_is_page_pool_iov(page)) { + page_to_page_pool_iov(page)->pp = NULL; + return; + } + page->pp_magic = 0; page->pp = NULL; } @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page, return false; } - /* Caller MUST have verified/know (page_ref_count(page) == 1) */ + /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */ pool->alloc.cache[pool->alloc.count++] = page; recycle_stat_inc(pool, cached); return true; @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * refcnt == 1 means page_pool owns page, and can recycle it. * * page is NOT reusable when allocated when system is under - * some pressure. (page_is_pfmemalloc) + * some pressure. (page_pool_page_is_pfmemalloc) */ - if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { + if (likely(page_pool_page_ref_count(page) == 1 && + !page_pool_page_is_pfmemalloc(page))) { /* Read barrier done in page_ref_count / READ_ONCE */ if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool, if (likely(page_pool_defrag_page(page, drain_count))) return NULL; - if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) { + if (page_pool_page_ref_count(page) == 1 && + !page_pool_page_is_pfmemalloc(page)) { if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) page_pool_dma_sync_for_device(pool, page, -1); @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool) /* Empty recycle ring */ while ((page = ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */ - if (!(page_ref_count(page) == 1)) + if (!(page_pool_page_ref_count(page) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", - __func__, page_ref_count(page)); + __func__, page_pool_page_ref_count(page)); page_pool_return_page(pool, page); } @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) struct page_pool *pp; bool allow_direct; - page = compound_head(page); + if (!page_is_page_pool_iov(page)) { + page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) - return false; + /* page->pp_magic is OR'ed with PP_SIGNATURE after the + * allocation in order to preserve any existing bits, such as + * bit 0 for the head page of compound page and bit 1 for + * pfmemalloc page, so mask those bits for freeing side when + * doing below checking, and page_pool_page_is_pfmemalloc() is + * checked in __page_pool_put_page() to avoid recycling the + * pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; - pp = page->pp; + pp = page->pp; + } else { + pp = page_to_page_pool_iov(page)->pp; + } /* Allow direct recycle if we have reasons to believe that we are * in the same context as the consumer would run, so there's @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx) for (j = 0; j < (1 << MP_HUGE_ORDER); j++) { page = hu->page[idx] + j; - if (page_ref_count(page) != 1) { + if (page_pool_page_ref_count(page) != 1) { pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n", - page_ref_count(page), idx, j); + page_pool_page_ref_count(page), idx, j); return true; } } @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp) continue; if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || - page_ref_count(page) != 1) { + page_pool_page_ref_count(page) != 1) { atomic_inc(&mp_huge_ins_b); continue; } @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool) free = true; for (i = 0; i < MP_HUGE_1G_CNT; i++) { page = hu->page + i; - if (page_ref_count(page) != 1) { + if (page_pool_page_ref_count(page) != 1) { pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n", - page_ref_count(page), i); + page_pool_page_ref_count(page), i); free = false; break; } @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp) page = hu->page + page_i; if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || - page_ref_count(page) != 1) { + page_pool_page_ref_count(page) != 1) { atomic_inc(&mp_huge_ins_b); continue; } From patchwork Thu Aug 10 01:57:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348710 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3AFA3C001E0 for ; Thu, 10 Aug 2023 01:58:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232062AbjHJB6V (ORCPT ); Wed, 9 Aug 2023 21:58:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230365AbjHJB6U (ORCPT ); Wed, 9 Aug 2023 21:58:20 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA1541BCF for ; Wed, 9 Aug 2023 18:58:19 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d1ebc896bd7so434536276.2 for ; Wed, 09 Aug 2023 18:58:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632699; x=1692237499; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=230l8c/KQ2ub7YMPLfgz4BJig3YoPpRYkLaR90fiHkY=; b=UOU2WHdByV36J0UbI8RByL1ERbNx1NeSgDEZpmHEkI8h3kqdPwe4+Kdbiw7B7y0uSW AXvt36vDSALuBy8YhsOJAr3vNYeoEzi+wNYmAEkMqp31kP70XmpbNr/HJgwLC02sphw+ 5GcVWmT2QBY6LrobMNsg6VIkZiJeiSI106vFCYJKORLY+5C82dDXG5ZBVA/eNDBebHcE tDiXeKuurb882atISdP7An6L1nAeZZ4IJtt3WII2swvXRcp3Qgiu6cs+og9EzPSr+w7Q lzAXziXlagA5ULGkJealwWCQGNRPRS0s0R3tH3EznN1r9aO2zO5WPBi9pgzUk2nxaOVU DpoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632699; x=1692237499; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=230l8c/KQ2ub7YMPLfgz4BJig3YoPpRYkLaR90fiHkY=; b=j/bBYqcl9Gc42qNji6C8hL4j2ac+sGKUw0ACbGJKzUsSUxxNg5wOMWIl8WYKiW4ZBy fHY24a8rpwB7ZGQ++xIP4gg0NdhIUmGZJoH4tXSyWz2JvJVh/wuKdFOQdDZRx1QALalv pDgMaDrgBfL6tfVDelMt8xgxhkh93jqph1b8hpANxF46uwDfoYb6CXRU59t/dTowWYhb Bsv8+aiMbA7VwTIe9JkDikZVCIadpoSLvvVA5LxJESDT3RHevpRVydb+21Tk9DsycMQn /1xAqJddUObH8mb5gXjmTGg2SQvL233xBbUu0bzoBV/FwCMxWQrzHXbWdKBI969DtBF7 9tDQ== X-Gm-Message-State: AOJu0YzCytVupRV30sLXh/kpWVc+mTXNvnM6IjkuTdBIuACyHCi76Den WnrMJgfVlLrVU/Fv1FMg/Ru28AOxuX5gbStTMA== X-Google-Smtp-Source: AGHT+IGOBsrytp60efHaHvAx2HEbWRVp5N3ZqokBBk6DlKyxgFv5Ol7RPor2cCHaz6v7c/yxheslvjT0xrqHee9vIQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:6813:0:b0:d49:e117:3a17 with SMTP id d19-20020a256813000000b00d49e1173a17mr18327ybc.4.1691632698981; Wed, 09 Aug 2023 18:58:18 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:43 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-8-almasrymina@google.com> Subject: [RFC PATCH v2 07/11] net: support non paged skb frags From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevent callers to handle this case. Correctly handle skb_frag refcounting in the page_pool_iovs case. Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 40 +++++++++++++++++++++++++++++++++------- net/core/gro.c | 2 +- net/core/skbuff.c | 3 +++ net/ipv4/tcp.c | 10 +++++++++- 4 files changed, 46 insertions(+), 9 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index faaba050f843..5520587050c4 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3389,15 +3389,38 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto, fragto->bv_offset = fragfrom->bv_offset; } +/* Returns true if the skb_frag contains a page_pool_iov. */ +static inline bool skb_frag_is_page_pool_iov(const skb_frag_t *frag) +{ + return page_is_page_pool_iov(frag->bv_page); +} + /** * skb_frag_page - retrieve the page referred to by a paged fragment * @frag: the paged fragment * - * Returns the &struct page associated with @frag. + * Returns the &struct page associated with @frag. Returns NULL if this frag + * has no associated page. */ static inline struct page *skb_frag_page(const skb_frag_t *frag) { - return frag->bv_page; + if (!page_is_page_pool_iov(frag->bv_page)) + return frag->bv_page; + + return NULL; +} + +/** + * skb_frag_page_pool_iov - retrieve the page_pool_iov referred to by fragment + * @frag: the fragment + * + * Returns the &struct page_pool_iov associated with @frag. Returns NULL if this + * frag has no associated page_pool_iov. + */ +static inline struct page_pool_iov * +skb_frag_page_pool_iov(const skb_frag_t *frag) +{ + return page_to_page_pool_iov(frag->bv_page); } /** @@ -3408,7 +3431,7 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + page_pool_page_get_many(frag->bv_page, 1); } /** @@ -3426,13 +3449,13 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) { - struct page *page = skb_frag_page(frag); - #ifdef CONFIG_PAGE_POOL - if (recycle && page_pool_return_skb_page(page, napi_safe)) + if (recycle && page_pool_return_skb_page(frag->bv_page, napi_safe)) return; + page_pool_page_put_many(frag->bv_page, 1); +#else + put_page(skb_frag_page(frag)); #endif - put_page(page); } /** @@ -3472,6 +3495,9 @@ static inline void skb_frag_unref(struct sk_buff *skb, int f) */ static inline void *skb_frag_address(const skb_frag_t *frag) { + if (!skb_frag_page(frag)) + return NULL; + return page_address(skb_frag_page(frag)) + skb_frag_off(frag); } diff --git a/net/core/gro.c b/net/core/gro.c index 0759277dc14e..42d7f6755f32 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -376,7 +376,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff) NAPI_GRO_CB(skb)->frag0 = NULL; NAPI_GRO_CB(skb)->frag0_len = 0; - if (!skb_headlen(skb) && pinfo->nr_frags && + if (!skb_headlen(skb) && pinfo->nr_frags && skb_frag_page(frag0) && !PageHighMem(skb_frag_page(frag0)) && (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) { NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a298992060e6..ac79881a2630 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2939,6 +2939,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; + if (WARN_ON_ONCE(!skb_frag_page(f))) + return false; + if (__splice_segment(skb_frag_page(f), skb_frag_off(f), skb_frag_size(f), offset, len, spd, false, sk, pipe)) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 88f4ebab12ac..7893df0e22ee 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2160,6 +2160,9 @@ static int tcp_zerocopy_receive(struct sock *sk, break; } page = skb_frag_page(frags); + if (WARN_ON_ONCE(!page)) + break; + prefetchw(page); pages[pages_to_map++] = page; length += PAGE_SIZE; @@ -4415,7 +4418,12 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); - struct page *page = skb_frag_page(f) + (offset >> PAGE_SHIFT); + struct page *page = skb_frag_page(f); + + if (WARN_ON_ONCE(!page)) + return 1; + + page += offset >> PAGE_SHIFT; sg_set_page(&sg, page, skb_frag_size(f), offset_in_page(offset)); From patchwork Thu Aug 10 01:57:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348711 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4F9AC001E0 for ; Thu, 10 Aug 2023 01:58:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232305AbjHJB60 (ORCPT ); Wed, 9 Aug 2023 21:58:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232251AbjHJB6Z (ORCPT ); Wed, 9 Aug 2023 21:58:25 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC295C6 for ; Wed, 9 Aug 2023 18:58:22 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5899825ae5aso6164797b3.3 for ; Wed, 09 Aug 2023 18:58:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632702; x=1692237502; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Tx6ZfONqtotsLULugIdGrrRKzq2s4AhG+kEo1kI7Jo0=; b=phKzX3fEAWBvDiJNz7jia4iegX0d1RSB3/d3n7lqZQ045SsL1jD7xofLfLpv2O8FDg hjIttXMblZ6NGf3YqOkk0Sl162JcTr3CZe5dXZgHWf6Hu4F0KQF5b3kVD7JfXxfM4igr PAJjmFuDoBJ+Y1Efk/P2kUS5eiYtd9NUbG+qCxr+4Dmu/os1WQBqba8fjZB5B5i3JPwW MFDdDojNYxHBjieDu7cClM/S/kyIFTnkUmvIZTnHPaG0oEUFno81IW7DaFg7QEJvqCGL P+IztKwQ/0rsaulGbIuwT80VSPZ4as0TFja6TBtHys2OPVSaGntFstguHBC8Avqc7DgK rPNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632702; x=1692237502; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Tx6ZfONqtotsLULugIdGrrRKzq2s4AhG+kEo1kI7Jo0=; b=PGOTSfar6Fh8vKTSNo41ph7A3mf8m0Z3y3rgULREJYNBuot6sWjaymrmTFztcWjT1R C78KJJg/+EbVE+ISOl5v8FMzrMFHhNCdc+OOb3cn4EjJtXrqa4oEV1L1c6fUwQPv7tjL U2Uvbl72b9yXlmWVmMNg+xLYikGg8Dmoxlh1p3WEc1NSh7JNTAJ0XpC2WFxkkNe4XaxJ sdgm7Hfsp2bmxTOuqcF4CRZ/KUnCzbGu50RMtg2zrvOAD5cYCth/RTtvdR/cvZXojV9u IIoYMlEANyixmll9b/gQafoj+T3+FWP/GqmUTqZm73MPKkmABqOYeukS7kFHgBlzScsv /StA== X-Gm-Message-State: AOJu0YyYm/RJh4V0OAn2atrHDju2dglcdlUns7F4wQgY5XeFGK8/muLV PK1k0Dr7j0sDWk3DbASv92w/o0CWFQTTtEVSfQ== X-Google-Smtp-Source: AGHT+IEO6r0ZjyRATxOUR21AKoL40mJNLD83QAwhGNKvI6kYGWdPWzkV0rIfvOVQBM/LHDLM960r1yZ2i+9WiCWk2g== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a81:b65d:0:b0:586:500a:37d6 with SMTP id h29-20020a81b65d000000b00586500a37d6mr16180ywk.2.1691632701950; Wed, 09 Aug 2023 18:58:21 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:44 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-9-almasrymina@google.com> Subject: [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs, and marks the skb as skb->devmem accordingly. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 14 +++++++- include/net/tcp.h | 5 +-- net/core/datagram.c | 6 ++++ net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------ net/ipv4/tcp.c | 6 ++++ net/ipv4/tcp_input.c | 13 +++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 8 files changed, 111 insertions(+), 19 deletions(-) -- 2.41.0.640.ga95def55d0-goog diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 5520587050c4..88a3fc7f99b7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -806,6 +806,8 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @devmem: indicates that all the fragments in this skb are backed by + * device memory. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -992,7 +994,7 @@ struct sk_buff { #if IS_ENABLED(CONFIG_IP_SCTP) __u8 csum_not_inet:1; #endif - + __u8 devmem:1; #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif @@ -1767,6 +1769,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); } +/* Return true if frags in this skb are not readable by the host. */ +static inline bool skb_frags_not_readable(const struct sk_buff *skb) +{ + return skb->devmem; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2469,6 +2477,10 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, struct page *page, int off, int size) { __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size); + if (page_is_page_pool_iov(page)) { + skb->devmem = true; + return; + } /* Propagate page pfmemalloc to the skb if we can. The problem is * that not all callers have unique ownership of the page but rely diff --git a/include/net/tcp.h b/include/net/tcp.h index c5fb90079920..1ea2d7274b8c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -979,7 +979,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb) static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb)); } static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -987,7 +987,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_not_readable(to) == skb_frags_not_readable(from)); } /* Events passed to congestion control interface */ diff --git a/net/core/datagram.c b/net/core/datagram.c index 176eb5834746..cdd4fb129968 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; } + if (skb_frags_not_readable(skb)) + goto short_copy; + /* Copy paged appendix. Hmm... why does this look so complicated? */ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, { int frag; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (msg && msg->msg_ubuf && msg->sg_from_iter) return msg->sg_from_iter(sk, skb, from, length); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index ac79881a2630..1814d413897e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1175,6 +1175,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) struct page *p; u8 *vaddr; + if (skb_frag_is_page_pool_iov(frag)) { + printk("%sskb frag: not readable", level); + len -= frag->bv_len; + if (!len) + break; + continue; + } + skb_frag_foreach_page(frag, skb_frag_off(frag), skb_frag_size(frag), p, p_off, p_len, copied) { @@ -1752,6 +1760,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!num_frags) goto release; @@ -1922,8 +1933,12 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n; + + if (skb_frags_not_readable(skb)) + return NULL; + + n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), NUMA_NO_NODE); if (!n) return NULL; @@ -2249,14 +2264,16 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t gfp_mask) { - /* - * Allocate the copy buffer - */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off; + struct sk_buff *n; + + if (skb_frags_not_readable(skb)) + return NULL; + + /* Allocate the copy buffer */ + n = __alloc_skb(newheadroom + skb->len + newtailroom, gfp_mask, + skb_alloc_rx_flag(skb), NUMA_NO_NODE); if (!n) return NULL; @@ -2595,6 +2612,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end; + if (skb_frags_not_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2748,6 +2768,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -2936,6 +2959,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, /* * then map the fragments */ + if (skb_frags_not_readable(skb)) + return false; + for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; @@ -3159,6 +3185,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3238,6 +3267,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3338,6 +3370,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -3795,7 +3830,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i]; skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; + skb->devmem = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -3809,6 +3846,7 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; @@ -3842,6 +3880,16 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->devmem = devmem; + else + skb->devmem = 0; + + if (skb_shinfo(skb1)->nr_frags) + skb1->devmem = devmem; + else + skb1->devmem = 0; } /** @@ -4077,6 +4125,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; } + if (skb_frags_not_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb); @@ -5678,7 +5729,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, (from->pp_recycle && skb_cloned(from))) return false; - if (len <= skb_tailroom(to)) { + if (skb_frags_not_readable(from) != skb_frags_not_readable(to)) + return false; + + if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -5853,6 +5907,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0; @@ -6513,7 +6570,7 @@ void skb_condense(struct sk_buff *skb) { if (skb->data_len) { if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + skb_cloned(skb) || skb_frags_not_readable(skb)) return; /* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7893df0e22ee..53ec616b1fb7 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2143,6 +2143,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } + if (skb_frags_not_readable(skb)) + break; + if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; @@ -4415,6 +4418,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, if (crypto_ahash_update(req)) return 1; + if (skb_frags_not_readable(skb)) + return 1; + for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 670c3dab24f2..f5b12f963cd8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5204,6 +5204,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list); + if (skb_frags_not_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5224,17 +5227,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; } - if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && !skb_frags_not_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; } +skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) return; __skb_queue_head_init(&tmp); @@ -5278,7 +5284,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 2cb39b6dad02..54bc4de6bce4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2300,7 +2300,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb_frags_not_readable(skb) != skb_frags_not_readable(next)) return false; len -= skb->len; @@ -3169,6 +3170,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (skb_frags_not_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 85ff90a03b0c..308151044032 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) @@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) From patchwork Thu Aug 10 01:57:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348712 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 432BAC001E0 for ; Thu, 10 Aug 2023 01:58:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230365AbjHJB6b (ORCPT ); Wed, 9 Aug 2023 21:58:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35826 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232207AbjHJB6a (ORCPT ); Wed, 9 Aug 2023 21:58:30 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E758C6 for ; Wed, 9 Aug 2023 18:58:25 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-583fe0f84a5so6370257b3.3 for ; Wed, 09 Aug 2023 18:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632704; x=1692237504; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=A72dv4/C3nRhaaIL45Qla7D1fv5wRm/EUSGzlFwP9lc=; b=ExMQ69JxEkQEghmOjyVasRvnuhz6jC2LodZHs0V5yQ46eTTXvNsFNW4Y3Vm7mapiq1 EkHgPogZUe6tuV/HA/bG9obyt4nv21rLc90nVe1NDWLzadva4pylwBpk997/Fq9lFIYn JDoDtpud26yzjICJ9UqHXgbaytjrVPB6PxSoeqn6mOxYihp7vNnGgm62VfzX3DWsvWlB lojyqTJSfpNOBPk2xOE769LpI3SqjqGJFgAr7HFsPxe4mkBIoZuQmzUTQfpLoHNVQ3Jo PkG/Xhz4f/OMQVjAmSQB8I7CAorAkNcIQ2eioBWwrz4eGnCDMECul4hqOMVEBrxUSjpE hRxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632704; x=1692237504; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=A72dv4/C3nRhaaIL45Qla7D1fv5wRm/EUSGzlFwP9lc=; b=iDUq1PVtR815uZxUg62u8iRiny1jHZaOSgYOO8n/c1dD2FBAEvRp/GrKCd2TfKDz59 qB9TNbrBbrwgNHRDQfTOwGAh+P3De8eSoTz5ipZ7XxU3S0T4kvC71dhTUZ4OjTwSAx0K sDiYbz55xxNcv6vmaPptPDiZvi1O6WxhR3ti+af6OQCc7ONZkJJsR8ogt30EpCrqs2mk woiHkmEOA8W3FS+PsePc2ByLa/Cq5oqS0AFQ46p/IrBevJbibu7GM2PG6EjJKUzYp8Gv hLj4oLqyFqCmqfogNWNr87gocWHMcY9zjs1jTkWu9wd2//JPHja1t42i8pyKFgowxWNY pLiw== X-Gm-Message-State: AOJu0YzHScF6fwamZbb/uWP/l2JbGasaSg/tu+bQxuq74Aymnx68222x nYhzWPRLEDPFdWSMAC3v+pyqk7TYmojzL8H00A== X-Google-Smtp-Source: AGHT+IHgModR+ZQzG5GYWO5cFEJqbZO8Zz/HWXObhE9KVaBkIDk3FCSw7qbAx/envuhDtjI23LEIRDR/6UkUczA98g== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a81:af0f:0:b0:589:a997:f9ce with SMTP id n15-20020a81af0f000000b00589a997f9cemr16711ywh.2.1691632704659; Wed, 09 Aug 2023 18:58:24 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:45 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-10-almasrymina@google.com> Subject: [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_pages, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/linux/socket.h | 1 + include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 6 + net/ipv4/tcp.c | 180 +++++++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 7 ++ 6 files changed, 196 insertions(+), 5 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 39b74d83c7c4..102733ae888d 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/net/sock.h b/include/net/sock.h index 2eb916d1ff64..5d2a97001152 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -353,6 +353,7 @@ struct sk_filter; * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference * @sk_bind2_node: bind node in the bhash2 table + * @sk_user_pages: xarray of pages the user is holding a reference on. */ struct sock { /* @@ -545,6 +546,7 @@ struct sock { struct rcu_head sk_rcu; netns_tracker ns_tracker; struct hlist_node sk_bind2_node; + struct xarray sk_user_pages; }; enum sk_pacing { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..aacb97f16b78 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_HEADER 98 +#define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER +#define SO_DEVMEM_OFFSET 99 +#define SCM_DEVMEM_OFFSET SO_DEVMEM_OFFSET + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..ae94763b1963 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,12 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct cmsg_devmem { + __u64 frag_offset; + __u32 frag_size; + __u32 frag_token; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 53ec616b1fb7..7a5279b61a89 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -461,6 +461,7 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_user_pages, XA_FLAGS_ALLOC1); } EXPORT_SYMBOL(tcp_init_sock); @@ -2306,6 +2307,144 @@ static int tcp_inq_hint(struct sock *sk) return inq; } +/* On error, returns the -errno. On success, returns number of bytes sent to the + * user. May not consume all of @remaining_len. + */ +static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, + int remaining_len) +{ + struct cmsg_devmem cmsg_devmem = { 0 }; + unsigned int start; + int i, copy, n; + int sent = 0; + int err = 0; + + do { + start = skb_headlen(skb); + + if (!skb_frags_not_readable(skb)) { + err = -ENODEV; + goto out; + } + + /* Copy header. */ + copy = start - offset; + if (copy > 0) { + copy = min(copy, remaining_len); + + n = copy_to_iter(skb->data + offset, copy, + &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto out; + } + + offset += copy; + remaining_len -= copy; + + /* First a cmsg_devmem for # bytes copied to user + * buffer. + */ + memset(&cmsg_devmem, 0, sizeof(cmsg_devmem)); + cmsg_devmem.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_HEADER, + sizeof(cmsg_devmem), &cmsg_devmem); + if (err) + goto out; + + sent += copy; + + if (remaining_len == 0) + goto out; + } + + /* after that, send information of devmem pages through a + * sequence of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page_pool_iov *ppiov; + u64 frag_offset; + u32 user_token; + int end; + + /* skb_frags_not_readable() should indicate that ALL the + * frags in this skb are unreadable page_pool_iovs. + * We're checking for that flag above, but also check + * individual pages here. If the tcp stack is not + * setting skb->devmem correctly, we still don't want to + * crash here when accessing pgmap or priv below. + */ + if (!skb_frag_page_pool_iov(frag)) { + net_err_ratelimited("Found non-devmem skb with page_pool_iov"); + err = -ENODEV; + goto out; + } + + ppiov = skb_frag_page_pool_iov(frag); + end = start + skb_frag_size(frag); + copy = end - offset; + + if (copy > 0) { + copy = min(copy, remaining_len); + + frag_offset = page_pool_iov_virtual_addr(ppiov) + + skb_frag_off(frag) + offset - + start; + cmsg_devmem.frag_offset = frag_offset; + cmsg_devmem.frag_size = copy; + err = xa_alloc((struct xarray *)&sk->sk_user_pages, + &user_token, frag->bv_page, + xa_limit_31b, GFP_KERNEL); + if (err) + goto out; + + cmsg_devmem.frag_token = user_token; + + offset += copy; + remaining_len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, + SO_DEVMEM_OFFSET, + sizeof(cmsg_devmem), + &cmsg_devmem); + if (err) + goto out; + + page_pool_iov_get_many(ppiov, 1); + + sent += copy; + + if (remaining_len == 0) + goto out; + } + start = end; + } + + if (!remaining_len) + goto out; + + /* if remaining_len is not satisfied yet, we need to go to the + * next frag in the frag_list to satisfy remaining_len. + */ + skb = skb_shinfo(skb)->frag_list ?: skb->next; + + offset = offset - start; + } while (skb); + + if (remaining_len) { + err = -EFAULT; + goto out; + } + +out: + if (!sent) + sent = err; + + return sent; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2318,6 +2457,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int flags, struct scm_timestamping_internal *tss, int *cmsg_flags) { + bool last_copied_devmem, last_copied_init = false; struct tcp_sock *tp = tcp_sk(sk); int copied = 0; u32 peek_seq; @@ -2492,15 +2632,45 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_msg(skb, offset, msg, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; + if (last_copied_init && + last_copied_devmem != skb->devmem) break; + + if (!skb->devmem) { + err = skb_copy_datagram_msg(skb, offset, msg, + used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + } else { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* skb->devmem skbs can only be received + * with the MSG_SOCK_DEVMEM flag. + */ + if (!copied) + copied = -EFAULT; + + break; + } + + err = tcp_recvmsg_devmem(sk, skb, offset, msg, + used); + if (err <= 0) { + if (!copied) + copied = -EFAULT; + + break; + } + used = err; } } + last_copied_devmem = skb->devmem; + last_copied_init = true; + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index cecd5a135e64..4472b9357569 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2295,6 +2295,13 @@ static int tcp_v4_init_sock(struct sock *sk) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + struct page *page; + unsigned long index; + + xa_for_each(&sk->sk_user_pages, index, page) + page_pool_page_put_many(page, 1); + + xa_destroy(&sk->sk_user_pages); trace_tcp_destroy_sock(sk); From patchwork Thu Aug 10 01:57:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348713 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46002EB64DD for ; Thu, 10 Aug 2023 01:58:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232324AbjHJB6g (ORCPT ); Wed, 9 Aug 2023 21:58:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232207AbjHJB6f (ORCPT ); Wed, 9 Aug 2023 21:58:35 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AB4D2106 for ; Wed, 9 Aug 2023 18:58:28 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c8f360a07a2so399680276.2 for ; Wed, 09 Aug 2023 18:58:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632707; x=1692237507; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/4QeO+pagKrTYOIfbZJRrXpwc/dNI+RJG2rrrCnM7Sk=; b=QOGVvS1pbyhBQujGmuVNxv+/RrIDY5iWys+aLMNejmgkPzbwbsQ39t7Ryf0s6uiGEV +XJsoLagljlLbHu4+IIsQZcLz/1h0S2ktpRk8BFOpe0tHuX0RzFFrJGBfqeL1uu9Fw5m o1xkhCmM8JO7BW4LqDZ9cRQT2LS10ZCbkFc7bTRUF/S6uJwdTsIKQ8YjP+1edPCNHEJB VKLNg7O/tdQ+pZiMw2WQOueY3Pw9UjkUhegULxMkoAeutnNdrTGl3v7afyzgTxi1nkwD 3ASZcw5xdSozHd1Hh/yUiZqUPrFO+yrdEoAFiaPgkznP4h/24CK+RZ9Xypyq2g/k0hFm hdRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632707; x=1692237507; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/4QeO+pagKrTYOIfbZJRrXpwc/dNI+RJG2rrrCnM7Sk=; b=BKMrwvaAVG2IT17U19mBxhQ+Z0uHtzPanHGrms83v2t85LG2m6QInTHsEGqkTznmBG HFbAFV0mh/3BxyCpZhMPKqt2wXR7/6jpXdwoq0C4nigO+txOl2rqHYL6+/kQL3+IOMZ1 e5sqp5NeW22E+Qh90lhLcCPTPoLKYafypS5NxuJQe8Jutf/iDLYaKtW+b+2booAoH8yj yI2gaQ4NFh++kAkljt1OMC1Hek5tbxdie30MOPIfAEDZ3yha6VFpgflX+Z6j1IlT03Dm 7B/ngjRrqNlOGWWbTrXV6nOgnhDo+EW/c9/1BWk88LhbdfwoYnIlkc+1WyHEiM6qUlUj onFQ== X-Gm-Message-State: AOJu0YwFvycztMlDF4jyr99AIAV3BTE7Ki2I2ety2jfap9tuB8n6cEJS Yb0bfCwbzfVpEX+ALQrb0BTmXEemeXVlIQxAGQ== X-Google-Smtp-Source: AGHT+IG6tqnzv0CWLPyFfi6YUDF997bmI60WCtTuPYwi+ZupJbatUqvnXIgWHRoDy1fJOcPQx7Y9AftVpScO5FVfEQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:a2c4:0:b0:d62:e781:5f02 with SMTP id c4-20020a25a2c4000000b00d62e7815f02mr17141ybn.13.1691632707331; Wed, 09 Aug 2023 18:58:27 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:46 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-11-almasrymina@google.com> Subject: [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com, Willem de Bruijn , Kaiyuan Zhang Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 ++++ net/core/sock.c | 36 +++++++++++++++++++++++++++++++ 3 files changed, 41 insertions(+) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index aacb97f16b78..eb93b43394d4 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,7 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_HEADER 98 #define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER #define SO_DEVMEM_OFFSET 99 diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index ae94763b1963..71314bf41590 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -26,6 +26,10 @@ struct cmsg_devmem { __u32 frag_token; }; +struct devmemtoken { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index ab1e8d1bd5a1..2736b770a399 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1049,6 +1049,39 @@ static int sock_reserve_memory(struct sock *sk, int bytes) return 0; } +static noinline_for_stack int +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + struct devmemtoken tokens[128]; + unsigned int num_tokens, i, j; + int ret; + + if (sk->sk_type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP) + return -EBADF; + + if (optlen % sizeof(struct devmemtoken) || optlen > sizeof(tokens)) + return -EINVAL; + + num_tokens = optlen / sizeof(struct devmemtoken); + if (copy_from_sockptr(tokens, optval, optlen)) + return -EFAULT; + + ret = 0; + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + struct page *page = xa_erase(&sk->sk_user_pages, + tokens[i].token_start + j); + + if (page) { + page_pool_page_put_many(page, 1); + ret++; + } + } + } + + return ret; +} + void sockopt_lock_sock(struct sock *sk) { /* When current->bpf_ctx is set, the setsockopt is called from @@ -1528,6 +1561,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname, WRITE_ONCE(sk->sk_txrehash, (u8)val); break; + case SO_DEVMEM_DONTNEED: + ret = sock_devmem_dontneed(sk, optval, optlen); + break; default: ret = -ENOPROTOOPT; break; From patchwork Thu Aug 10 01:57:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13348714 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9581EC001B0 for ; Thu, 10 Aug 2023 01:58:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232288AbjHJB6h (ORCPT ); Wed, 9 Aug 2023 21:58:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232207AbjHJB6h (ORCPT ); Wed, 9 Aug 2023 21:58:37 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 824341999 for ; Wed, 9 Aug 2023 18:58:30 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-d5e792a163dso430781276.1 for ; Wed, 09 Aug 2023 18:58:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691632710; x=1692237510; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9Caa76uxp69dq+E2Vi9h630g4vh+SvLVlx5BvK0Ixnc=; b=7PLGGQxtydGLGfTypGpnE6gs/Ks2xBSVr4A5cmDSf0B7qvw2LuNNNxYqdiIc3V/Dul hODEH6Oo32zK5eW/9I8+dJ7EraWFGLxoJajaGa7Gm+/EiZa4jX4fjm32twQKyi0dQPhn cFjV1lTdx8uOeg6L/AwDC25Vc8AxmweLCgT+lMGFmPPIJ1l1EoIFf/fsXa1a8EZHl3ub eGezHvqpPVwKipazbsKvubrBsG2BMYP/L6no+6+Cz2d+NQb/A09qqz533l3JtHrDdUG2 AweMQTJjelETVhcM7oaut0gPRpMH0krhhQ33tChGluoUsS8ZWx91r6h290cd8mRciDoQ v2QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691632710; x=1692237510; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9Caa76uxp69dq+E2Vi9h630g4vh+SvLVlx5BvK0Ixnc=; b=IeFfqFTSmvuMLRjht0x7XRm/8RY70SY8hp/p9Tr9gBTBUgwA7MT08uHLT3CaqPbJWn CjXRsA/J3x+RcLHYeF/SLUsd9ZOGJnrlZponIQQfLPYvU+7ml03VgboS5XH0xyuXbDG0 VrOiHv1yxq0hWJ97QL8lR4rnQnyZZgybTVqCXvga/C9PjEKfbGQaI3Jp9jEzeIPjaREq cG9AEqOr1iA7/4eJ3HiC7atEpu1sjX5FXpCJMt5AkBuOZi1+hmszwImwsICUNbctaCOG m1foKustgvnsr3X2tcgcmcK4GOHvvmHkokOg6A+msL/nY2U15iEnrxHjslt7gvAllQEO 6Ffg== X-Gm-Message-State: AOJu0YwyXGT0CZs2D51dCJ8mLAZlfUH+2LHvwL8uA/16KU4C91FJl2GI zdULtsKHcRHdOzJvUlRhm0888KrQakuaIW56+Q== X-Google-Smtp-Source: AGHT+IEig7NAv7ccrOqMwa8RpGx/YS7/Pj9SteVtedOxAZTrR2IA1eglRl+i4zjywkcEjJvhawHOLdfXqWOajvDi0Q== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:73ad:9ed5:e067:2b9b]) (user=almasrymina job=sendgmr) by 2002:a25:4294:0:b0:d45:daf4:1fc5 with SMTP id p142-20020a254294000000b00d45daf41fc5mr17517yba.3.1691632709828; Wed, 09 Aug 2023 18:58:29 -0700 (PDT) Date: Wed, 9 Aug 2023 18:57:47 -0700 In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230810015751.3297321-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230810015751.3297321-12-almasrymina@google.com> Subject: [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Jason Gunthorpe , Hari Ramakrishnan , Dan Williams , Andy Lutomirski , stephen@networkplumber.org, sdf@google.com Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it sends and receives data using the devmem TCP APIs. It uses udmabuf as the dmabuf provider. It is compatible with a regular netcat running on a peer, or a ncdevmem running on a peer. In addition to normal netcat support, ncdevmem has a validation mode, where it sends a specific pattern and validates this pattern on the receiver side to ensure data integrity. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 5 + tools/testing/selftests/net/ncdevmem.c | 534 +++++++++++++++++++++++++ 3 files changed, 540 insertions(+) create mode 100644 tools/testing/selftests/net/ncdevmem.c diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 501854a89cc0..5f2f8f01c800 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -16,6 +16,7 @@ ipsec ipv6_flowlabel ipv6_flowlabel_mgr msg_zerocopy +ncdevmem nettest psock_fanout psock_snd diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index ae53c26af51b..3181611552d3 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -5,6 +5,10 @@ CFLAGS += -Wall -Wl,--no-as-needed -O2 -g CFLAGS += -I../../../../usr/include/ $(KHDR_INCLUDES) # Additional include paths needed by kselftest.h CFLAGS += -I../ +CFLAGS += -I../../../net/ynl/generated/ +CFLAGS += -I../../../net/ynl/lib/ + +LDLIBS += ../../../net/ynl/lib/ynl.a ../../../net/ynl/generated/protos.a LDLIBS += -lmnl @@ -90,6 +94,7 @@ TEST_PROGS += test_vxlan_mdb.sh TEST_PROGS += test_bridge_neigh_suppress.sh TEST_PROGS += test_vxlan_nolocalbypass.sh TEST_PROGS += test_bridge_backup_port.sh +TEST_GEN_FILES += ncdevmem TEST_FILES := settings diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c new file mode 100644 index 000000000000..2efcc98f6067 --- /dev/null +++ b/tools/testing/selftests/net/ncdevmem.c @@ -0,0 +1,534 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include +#include +#include +#include +#include +#include +#include +#define __iovec_defined +#include +#include + +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "netdev-user.h" +#include + +#define PAGE_SHIFT 12 +#define PAGE_SIZE 4096 +#define TEST_PREFIX "ncdevmem" +#define NUM_PAGES 16000 + +#ifndef MSG_SOCK_DEVMEM +#define MSG_SOCK_DEVMEM 0x2000000 +#endif + +/* + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider. + * + * Usage: + * + * * Without validation: + * + * On server: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -l \ + * -p 5201 + * + * On client: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -p 5201 + * + * * With Validation: + * On server: + * ncdevmem -s -c -l -f eth1 -n 0000:06:00.0 \ + * -p 5202 -v 1 + * + * On client: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -p 5202 \ + * -v 100000 + * + * Note this is compatible with regular netcat. i.e. the sender or receiver can + * be replaced with regular netcat to test the RX or TX path in isolation. + */ + +static char *server_ip = "192.168.1.4"; +static char *client_ip = "192.168.1.2"; +static char *port = "5201"; +static size_t do_validation; +static int queue_num = 15; +static char *ifname = "eth1"; +static char *nic_pci_addr = "0000:06:00.0"; +static unsigned int iterations; + +void print_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + int i; + for (i = 0; i < size; i++) { + printf("%02hhX ", p[i]); + } + printf("\n"); +} + +void print_nonzero_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + unsigned int i; + for (i = 0; i < size; i++) { + if (p[i]) + printf("%c", p[i]); + } + printf("\n"); +} + +void initialize_validation(void *line, size_t size) +{ + static unsigned char seed = 1; + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + ptr[i] = seed; + seed++; + if (seed == 254) + seed = 0; + } +} + +void validate_buffer(void *line, size_t size) +{ + static unsigned char seed = 1; + int errors = 0; + + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + if (ptr[i] != seed) { + fprintf(stderr, + "Failed validation: expected=%u, " + "actual=%u, index=%lu\n", + seed, ptr[i], i); + errors++; + if (errors > 20) + exit(1); + } + seed++; + if (seed == do_validation) + seed = 0; + } + + fprintf(stdout, "Validated buffer\n"); +} + +/* Triggers a driver reset... + * + * The proper way to do this is probably 'ethtool --reset', but I don't have + * that supported on my current test bed. I resort to changing this + * configuration in the driver which also causes a driver reset... + */ +static void reset_flow_steering(void) +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off", + "eth1"); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on", + "eth1"); + system(command); +} + +static void configure_flow_steering(void) +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s " + "src-port %s dst-port %s queue %d", + ifname, client_ip, server_ip, port, port, queue_num); + system(command); +} + +/* Triggers a device reset, which causes the dmabuf pages binding to take + * effect. A better and more generic way to do this may be ethtool --reset. + */ +static void trigger_device_reset(void) +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split off", + ifname); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split on", + ifname); + system(command); +} + +static void bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd, + unsigned int queue_num, struct ynl_sock **ys) +{ + struct ynl_error yerr; + + struct netdev_bind_rx_req *req; + int ret = 0; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + if (!*ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return; + } + + if (ynl_subscribe(*ys, "mgmt")) + goto err_close; + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, ifindex); + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); + netdev_bind_rx_req_set_queue_idx(req, queue_num); + + ret = netdev_bind_rx(*ys, req); + netdev_bind_rx_req_free(req); + if (!ret) { + perror("netdev_bind_rx"); + goto err_close; + } + + return; + +err_close: + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg); + ynl_sock_destroy(*ys); + exit(1); + return; +} + +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size) +{ + struct udmabuf_create create; + int ret; + + *devfd = open("/dev/udmabuf", O_RDWR); + if (*devfd < 0) { + fprintf(stderr, + "%s: [skip,no-udmabuf: Unable to access DMA " + "buffer device file]\n", + TEST_PREFIX); + exit(70); + } + + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); + if (*memfd < 0) { + printf("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(72); + } + + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) { + printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(73); + } + + ret = ftruncate(*memfd, dmabuf_size); + if (ret == -1) { + printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(74); + } + + memset(&create, 0, sizeof(create)); + + create.memfd = *memfd; + create.offset = 0; + create.size = dmabuf_size; + *buf = ioctl(*devfd, UDMABUF_CREATE, &create); + if (*buf < 0) { + printf("%s: [FAIL, create udmabuf]\n", TEST_PREFIX); + exit(75); + } +} + +int do_server(void) +{ + int devfd, memfd, buf, ret; + size_t dmabuf_size; + struct ynl_sock *ys; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + bind_rx_queue(3 /* index for eth1 */, buf, queue_num, &ys); + + char *buf_mem = NULL; + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) { + perror("mmap()"); + exit(1); + } + + /* Need to trigger the NIC to reallocate its RX pages, otherwise the + * bind doesn't take effect. + */ + trigger_device_reset(); + + sleep(1); + + reset_flow_steering(); + configure_flow_steering(); + + struct sockaddr_in server_sin; + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(76); + } + + int opt = 1; + ret = setsockopt(socket_fd, SOL_SOCKET, + SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt, + sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + printf("binding to address %s:%d\n", server_ip, + ntohs(server_sin.sin_port)); + + ret = bind(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) { + printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno)); + exit(76); + } + + ret = listen(socket_fd, 1); + if (ret) { + printf("%s: [FAIL, listen]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + struct sockaddr_in client_addr; + socklen_t client_addr_len = sizeof(client_addr); + + char buffer[256]; + + inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer, + sizeof(buffer)); + printf("Waiting or connection on %s:%d\n", buffer, + ntohs(server_sin.sin_port)); + int client_fd = accept(socket_fd, &client_addr, &client_addr_len); + + inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer, + sizeof(buffer)); + printf("Got connection from %s:%d\n", buffer, + ntohs(client_addr.sin_port)); + + char iobuf[819200]; + char ctrl_data[sizeof(int) * 20000]; + + size_t total_received = 0; + size_t i = 0; + size_t page_aligned_frags = 0; + size_t non_page_aligned_frags = 0; + while (1) { + bool is_devmem = false; + printf("\n\n"); + + struct msghdr msg = { 0 }; + struct iovec iov = { .iov_base = iobuf, + .iov_len = sizeof(iobuf) }; + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + ssize_t ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM); + printf("recvmsg ret=%ld\n", ret); + if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) { + continue; + } + if (ret < 0) { + perror("recvmsg"); + continue; + } + if (ret == 0) { + printf("client exited\n"); + goto cleanup; + } + + i++; + struct cmsghdr *cm = NULL; + struct cmsg_devmem *cmsg_devmem = NULL; + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_OFFSET && + cm->cmsg_type != SCM_DEVMEM_HEADER)) { + fprintf(stdout, "skipping non-devmem cmsg\n"); + continue; + } + + cmsg_devmem = (struct cmsg_devmem *)CMSG_DATA(cm); + is_devmem = true; + + if (cm->cmsg_type == SCM_DEVMEM_HEADER) { + /* TODO: process data copied from skb's linear + * buffer. + */ + fprintf(stdout, + "SCM_DEVMEM_HEADER. " + "cmsg_devmem->frag_size=%u\n", + cmsg_devmem->frag_size); + + continue; + } + + struct devmemtoken token = { cmsg_devmem->frag_token, + 1 }; + + total_received += cmsg_devmem->frag_size; + printf("received frag_page=%llu, in_page_offset=%llu," + " frag_offset=%llu, frag_size=%u, token=%u" + " total_received=%lu\n", + cmsg_devmem->frag_offset >> PAGE_SHIFT, + cmsg_devmem->frag_offset % PAGE_SIZE, + cmsg_devmem->frag_offset, cmsg_devmem->frag_size, + cmsg_devmem->frag_token, total_received); + + if (cmsg_devmem->frag_size % PAGE_SIZE) + non_page_aligned_frags++; + else + page_aligned_frags++; + + struct dma_buf_sync sync = { 0 }; + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + if (do_validation) + validate_buffer( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + else + print_nonzero_bytes( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + ret = setsockopt(client_fd, SOL_SOCKET, + SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + if (ret != 1) { + perror("SO_DEVMEM_DONTNEED not enough tokens"); + exit(1); + } + } + if (!is_devmem) + printf("flow steering error\n"); + + printf("total_received=%lu\n", total_received); + } + +cleanup: + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + munmap(buf_mem, dmabuf_size); + close(client_fd); + close(socket_fd); + close(buf); + close(memfd); + close(devfd); + ynl_sock_destroy(ys); + trigger_device_reset(); + + return 0; +} + +int main(int argc, char *argv[]) +{ + int is_server = 0, opt; + + while ((opt = getopt(argc, argv, "ls:c:p:v:q:f:n:i:")) != -1) { + switch (opt) { + case 'l': + is_server = 1; + break; + case 's': + server_ip = optarg; + break; + case 'c': + client_ip = optarg; + break; + case 'p': + port = optarg; + break; + case 'v': + do_validation = atoll(optarg); + break; + case 'q': + queue_num = atoi(optarg); + break; + case 'f': + ifname = optarg; + break; + case 'n': + nic_pci_addr = optarg; + break; + case 'i': + iterations = atoll(optarg); + break; + case '?': + printf("unknown option: %c\n", optopt); + break; + } + } + + for (; optind < argc; optind++) { + printf("extra arguments: %s\n", argv[optind]); + } + + if (is_server) + return do_server(); + + return 0; +}