From patchwork Mon Jul 10 22:32:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307758 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2ACCBEB64D9 for ; Mon, 10 Jul 2023 22:33:49 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 28DAD10E195; Mon, 10 Jul 2023 22:33:48 +0000 (UTC) Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by gabe.freedesktop.org (Postfix) with ESMTPS id BF0C310E195 for ; Mon, 10 Jul 2023 22:33:45 +0000 (UTC) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-56942667393so60545467b3.2 for ; Mon, 10 Jul 2023 15:33:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028425; x=1691620425; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=n3XnWoX3mq+b99bSYScr9hgrqKqfzZzVVDFO4lEAixo=; b=aKK4x+BPpAO/YJnlbxEPDSQ7Or5Frwhwee2iD2GBEfe1fPtmZVHgx7xOzFYwlpwBd0 CIAhwJFzySe6vd1iF5hHC7H5Bx4ikw2GiXtyeS+3mlpQ9IGJE/tNmOre93nGmlNqG7go H7sT0O60Au84IyYseY02KCEXg7RyIWAeAXh7dCgGut854L9f3xU2dkEJqjFqZYXmGFlG Joyc0e+Us0w0oYp3zs6SdDLqnd4mFFLgNGOx2cS/wRtoXMK+OGZP1jT/03NO1qBDNTNY VajQypKncfQ8ybRP4+obiiz7E94H3YhLRSSy4tUt/CNwfeshxUkcAM5tsYxecciDD/PR 1PlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028425; x=1691620425; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=n3XnWoX3mq+b99bSYScr9hgrqKqfzZzVVDFO4lEAixo=; b=Ed8LVRixQm2hhGuKPanoQ9QZHyw5zaQ3jSUAbwn6hZ5yG0qkIoVyiY1kiq3Mjv4O2E DF/+PJkFP15a3bQTXEX/8cBLyaQ9nxbn4XY4akfjdfzXURpnC2/ZSgNXskg/se3oiwX8 Zuc4GZGtDmA3/OK+ytHlYs0PRPQtCnUegs2Y2Ikjt42wCLWjVAvOxV7Mnh4gefubHoU6 5y3Qlq2Fq16zByYiJrS9TsRL2D+R8YLlxWvx0w8h5Fcf6assvmqBGE0LkiopHO4b+oJQ kn6eiqfUjSnepCelWwFuESywOLRHE1m0BmysBiVhpD/sOULgLus7jCZFAQLuhBVW2yoq 6p/g== X-Gm-Message-State: ABy/qLY9LWFtDVTJIyD6NLlzrKwj4kXrINnP+dgwJ6ANhbdWJW3pOI0b FAUMDntPTE6CorAL7LcRnHoulNRnVZMzD7IfLA== X-Google-Smtp-Source: APBJJlH/b3oD7gi2ZFJXKajzk/Ysv+e4OCddKdR0c+g1aGsA92FGsICZLym7YRNcyjFr1h3bBzs5D0UkTD/ZG9VMLw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:cf0c:0:b0:570:b1:ca37 with SMTP id u12-20020a81cf0c000000b0057000b1ca37mr97547ywi.5.1689028424874; Mon, 10 Jul 2023 15:33:44 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:52 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-2-almasrymina@google.com> Subject: [RFC PATCH 01/10] dma-buf: add support for paged attachment mappings From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Currently dmabuf p2p memory doesn't present itself in the form of struct pages and the memory can't be easily used with code that expects memory in that form. Add support for paged attachment mappings. We use existing dmabuf APIs to create a mapped attachment (dma_buf_attach() & dma_buf_map_attachment()), and we create struct pages for this mapping. We write the dma_addr's from the sg_table into the created pages. These pages can then be passed into code that expects struct pages and can largely operate on these pages with minimal modifications: 1. The pages need not be dma mapped. The dma addr can be queried from page->zone_device_data and used directly. 2. The pages are not kmappable. Add a new ioctl that enables the user to create a struct page backed dmabuf attachment mapping. This ioctl returns a new file descriptor which represents the dmabuf pages. The pages are freed when (a) the user closes the file, and (b) the struct pages backing the dmabuf are no longer in use. Once the pages are no longer in use, the mapped attachment is removed. The support added in this patch should be generic - the pages are created by the base code, but the user specifies the type of page to create using the dmabuf_create_pages_info->type flag. The base code hands of any handling specific to the use case of the ops of that page type. Signed-off-by: Mina Almasry --- drivers/dma-buf/dma-buf.c | 223 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 90 ++++++++++++++ include/uapi/linux/dma-buf.h | 9 ++ 3 files changed, 322 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index aa4ea8530cb3..50b1d813cf5c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -442,12 +443,16 @@ static long dma_buf_import_sync_file(struct dma_buf *dmabuf, } #endif +static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info); + static long dma_buf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct dma_buf *dmabuf; struct dma_buf_sync sync; enum dma_data_direction direction; + struct dma_buf_create_pages_info create_info; int ret; dmabuf = file->private_data; @@ -484,6 +489,12 @@ static long dma_buf_ioctl(struct file *file, case DMA_BUF_SET_NAME_A: case DMA_BUF_SET_NAME_B: return dma_buf_set_name(dmabuf, (const char __user *)arg); + case DMA_BUF_CREATE_PAGES: + if (copy_from_user(&create_info, (void __user *)arg, + sizeof(create_info))) + return -EFAULT; + + return dma_buf_create_pages(file, &create_info); #if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: @@ -1613,6 +1624,218 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +static int dma_buf_pages_release(struct inode *inode, struct file *file) +{ + struct dma_buf_pages *priv = file->private_data; + + if (priv->type_ops->dma_buf_pages_release) + priv->type_ops->dma_buf_pages_release(priv, file); + + percpu_ref_kill(&priv->pgmap.ref); + /* Drop initial ref after percpu_ref_kill(). */ + percpu_ref_put(&priv->pgmap.ref); + + return 0; +} + +static void dma_buf_page_free(struct page *page) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + + pgmap = page->pgmap; + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + + if (priv->type_ops->dma_buf_page_free) + priv->type_ops->dma_buf_page_free(priv, page); +} + +const struct dev_pagemap_ops dma_buf_pgmap_ops = { + .page_free = dma_buf_page_free, +}; +EXPORT_SYMBOL_GPL(dma_buf_pgmap_ops); + +const struct file_operations dma_buf_pages_fops = { + .release = dma_buf_pages_release, +}; +EXPORT_SYMBOL_GPL(dma_buf_pages_fops); + +#ifdef CONFIG_ZONE_DEVICE +static void dma_buf_pages_destroy(struct percpu_ref *ref) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + + pgmap = container_of(ref, struct dev_pagemap, ref); + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + + if (priv->type_ops->dma_buf_pages_destroy) + priv->type_ops->dma_buf_pages_destroy(priv); + + kvfree(priv->pages); + kfree(priv); + + dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction); + dma_buf_detach(priv->dmabuf, priv->attachment); + dma_buf_put(priv->dmabuf); + pci_dev_put(priv->pci_dev); +} + +static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info) +{ + int err, fd, i, pg_idx; + struct scatterlist *sg; + struct dma_buf_pages *priv; + struct file *new_file; + + fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC); + if (fd < 0) { + err = fd; + goto out_err; + } + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) { + err = -ENOMEM; + goto out_put_fd; + } + + priv->pgmap.type = MEMORY_DEVICE_PRIVATE; + priv->pgmap.ops = &dma_buf_pgmap_ops; + init_completion(&priv->pgmap.done); + + /* This refcount is incremented every time a page in priv->pages is + * allocated, and decremented every time a page is freed. When + * it drops to 0, the dma_buf_pages can be destroyed. An initial ref is + * held and the dma_buf_pages is not destroyed until that is dropped. + */ + err = percpu_ref_init(&priv->pgmap.ref, dma_buf_pages_destroy, 0, + GFP_KERNEL); + if (err) + goto out_free_priv; + + /* Initial ref to be dropped after percpu_ref_kill(). */ + percpu_ref_get(&priv->pgmap.ref); + + priv->pci_dev = pci_get_domain_bus_and_slot( + 0, create_info->pci_bdf[0], + PCI_DEVFN(create_info->pci_bdf[1], create_info->pci_bdf[2])); + if (!priv->pci_dev) { + err = -ENODEV; + goto out_exit_percpu_ref; + } + + priv->dmabuf = dma_buf_get(create_info->dma_buf_fd); + if (IS_ERR(priv->dmabuf)) { + err = PTR_ERR(priv->dmabuf); + goto out_put_pci_dev; + } + + if (priv->dmabuf->size % PAGE_SIZE != 0) { + err = -EINVAL; + goto out_put_dma_buf; + } + + priv->attachment = dma_buf_attach(priv->dmabuf, &priv->pci_dev->dev); + if (IS_ERR(priv->attachment)) { + err = PTR_ERR(priv->attachment); + goto out_put_dma_buf; + } + + priv->num_pages = priv->dmabuf->size / PAGE_SIZE; + priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page), + GFP_KERNEL); + if (!priv->pages) { + err = -ENOMEM; + goto out_detach_dma_buf; + } + + for (i = 0; i < priv->num_pages; i++) { + struct page *page = &priv->pages[i]; + + mm_zero_struct_page(page); + set_page_zone(page, ZONE_DEVICE); + set_page_count(page, 1); + page->pgmap = &priv->pgmap; + } + + priv->direction = DMA_BIDIRECTIONAL; + priv->sgt = dma_buf_map_attachment(priv->attachment, priv->direction); + if (IS_ERR(priv->sgt)) { + err = PTR_ERR(priv->sgt); + goto out_free_pages; + } + + /* write each dma addresses from sgt to each page */ + pg_idx = 0; + for_each_sgtable_dma_sg(priv->sgt, sg, i) { + size_t len = sg_dma_len(sg); + dma_addr_t dma_addr = sg_dma_address(sg); + + BUG_ON(!PAGE_ALIGNED(len)); + while (len > 0) { + priv->pages[pg_idx].zone_device_data = (void *)dma_addr; + pg_idx++; + dma_addr += PAGE_SIZE; + len -= PAGE_SIZE; + } + } + + new_file = anon_inode_getfile("[dma_buf_pages]", &dma_buf_pages_fops, + (void *)priv, O_RDWR | O_CLOEXEC); + if (IS_ERR(new_file)) { + err = PTR_ERR(new_file); + goto out_unmap_dma_buf; + } + + priv->type = create_info->type; + priv->create_flags = create_info->create_flags; + + switch (priv->type) { + default: + err = -EINVAL; + goto out_put_new_file; + } + + if (priv->type_ops->dma_buf_pages_init) { + err = priv->type_ops->dma_buf_pages_init(priv, new_file); + if (err) + goto out_put_new_file; + } + + fd_install(fd, new_file); + return fd; + +out_put_new_file: + fput(new_file); +out_unmap_dma_buf: + dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction); +out_free_pages: + kvfree(priv->pages); +out_detach_dma_buf: + dma_buf_detach(priv->dmabuf, priv->attachment); +out_put_dma_buf: + dma_buf_put(priv->dmabuf); +out_put_pci_dev: + pci_dev_put(priv->pci_dev); +out_exit_percpu_ref: + percpu_ref_exit(&priv->pgmap.ref); +out_free_priv: + kfree(priv); +out_put_fd: + put_unused_fd(fd); +out_err: + return err; +} +#else +static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info) +{ + return -ENOTSUPP; +} +#endif + #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 3f31baa3293f..5789006180ea 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -540,6 +540,36 @@ struct dma_buf_export_info { void *priv; }; +struct dma_buf_pages; + +struct dma_buf_pages_type_ops { + int (*dma_buf_pages_init)(struct dma_buf_pages *priv, + struct file *file); + void (*dma_buf_pages_release)(struct dma_buf_pages *priv, + struct file *file); + void (*dma_buf_pages_destroy)(struct dma_buf_pages *priv); + void (*dma_buf_page_free)(struct dma_buf_pages *priv, + struct page *page); +}; + +struct dma_buf_pages { + /* fields for dmabuf */ + struct dma_buf *dmabuf; + struct dma_buf_attachment *attachment; + struct sg_table *sgt; + struct pci_dev *pci_dev; + enum dma_data_direction direction; + + /* fields for dma-buf pages */ + size_t num_pages; + struct page *pages; + struct dev_pagemap pgmap; + + unsigned int type; + const struct dma_buf_pages_type_ops *type_ops; + __u64 create_flags; +}; + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name @@ -631,4 +661,64 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); + +#ifdef CONFIG_DMA_SHARED_BUFFER +extern const struct file_operations dma_buf_pages_fops; +extern const struct dev_pagemap_ops dma_buf_pgmap_ops; + +static inline bool is_dma_buf_pages_file(struct file *file) +{ + return file->f_op == &dma_buf_pages_fops; +} + +static inline bool is_dma_buf_page(struct page *page) +{ + return (is_zone_device_page(page) && page->pgmap && + page->pgmap->ops == &dma_buf_pgmap_ops); +} + +static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{ + return (dma_addr_t)page->zone_device_data; +} + +static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + struct scatterlist *s; + int i; + + for_each_sg(sg, s, nents, i) { + struct page *pg = sg_page(s); + + s->dma_address = dma_buf_page_to_dma_addr(pg); + sg_dma_len(s) = s->length; + } + + return nents; +} +#else +static inline bool is_dma_buf_page(struct page *page) +{ + return false; +} + +static inline bool is_dma_buf_pages_file(struct file *file) +{ + return false; +} + +static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{ + return 0; +} + +static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + return 0; +} +#endif + + #endif /* __DMA_BUF_H__ */ diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index 5a6fda66d9ad..d0f63a2ab7e4 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -179,4 +179,13 @@ struct dma_buf_import_sync_file { #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file) #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file) +struct dma_buf_create_pages_info { + __u8 pci_bdf[3]; + __s32 dma_buf_fd; + __u32 type; + __u64 create_flags; +}; + +#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info) + #endif From patchwork Mon Jul 10 22:32:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307759 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EAA9BEB64D9 for ; Mon, 10 Jul 2023 22:33:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 26B9010E2E0; Mon, 10 Jul 2023 22:33:52 +0000 (UTC) Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E3F310E2E0 for ; Mon, 10 Jul 2023 22:33:50 +0000 (UTC) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-c595cadae4bso5724281276.1 for ; Mon, 10 Jul 2023 15:33:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028429; x=1691620429; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=K10CMsALExOmOcR/YTwAYoH6sOUce0sACyLlyAd3oCA=; b=PmjrhD360OT99Lc8iPVcJYoR6DDqdeqIAZ5+tJAhFT5ahoktvUJx3QBnBPzN/WyXln kbA5Ozqmx/gY8WIxPOw00lEwu6Irt1PeBniAsNqQA8De/a7Gmw+d3Ar8j+YKrZssZ8uA BuSorRLnfwVf9kLGmV78yHD3Id1Ztae6MVmUiqkzc97p1exlkKRRIKvklbv9OqJSUqmP Qk97r7IUXC5lwlf/gsOz/X4tKxwtoTHHXF4t3kR8h/ivJz2kuJv7koREvjKNpq8Fdh4o MhP81DlKHR9GRSfe72HCOY8i4aU5RoyUS+8Tp6rfoY7xD/75fsJWZLCBFtePU5WPjwxt ZKhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028429; x=1691620429; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=K10CMsALExOmOcR/YTwAYoH6sOUce0sACyLlyAd3oCA=; b=aOz+WsKlOu0HLwhiWEjkoOkx+WVw3knU/LNOHOtmJrDv4Y2tOiqyWK/3yAz7R0Mc0p Wb8gwvkbZP7m3A9Zo//wb6evmKg8BPDI91g2qxvMy9EAGSSuqNhtzm9neTO1P8P+Vwg8 xseJdz+4dQFziKWpJhiR+JoZLkc9/kKDdcQG28FSycxg9FyQbkNmTZ+RwahHxLFxy+Mg vewoShOZr4PPFf6s941J6H8GYsnZTy1t11X947/OjcjLpDl0cuZl6m6Dv1trp9s4xSKR THcSzQjc326bR+zQ/snGcQ/S3dpurVDYjl8jq2rFn8C9cwZhP2L6WSDRV6qRgyyiTw2c MZwA== X-Gm-Message-State: ABy/qLYXcb5mZsIekquZ5abUcRMLHcaifGLEw44RL4kfyy7jW0jd8g1Y ZcobGEkIzPjZ02PrFtgL7Sbv5xH+s7ZhHyGdhA== X-Google-Smtp-Source: APBJJlELfkUsONGQv1JgPO9H1X9zJoqx1LLA0iy04SlqfvxZOZRiNmkJzPsHKMMO+0eNN0K5dDAd6ysBuYoO3e4CPQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:1ed4:0:b0:c48:b822:36db with SMTP id e203-20020a251ed4000000b00c48b82236dbmr78934ybe.10.1689028429046; Mon, 10 Jul 2023 15:33:49 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:53 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-3-almasrymina@google.com> Subject: [RFC PATCH 02/10] dma-buf: add support for NET_RX pages From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Use the paged attachment mappings support to create NET_RX pages. NET_RX pages are pages that can be used in the networking receive path: Bind the pages to the driver's rx queues specified by the create_flags param, and create a gen_pool to hold the free pages available for the driver to allocate. Signed-off-by: Mina Almasry --- drivers/dma-buf/dma-buf.c | 174 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 20 ++++ include/linux/netdevice.h | 1 + include/uapi/linux/dma-buf.h | 2 + 4 files changed, 197 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 50b1d813cf5c..acb86bf406f4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1681,6 +1682,8 @@ static void dma_buf_pages_destroy(struct percpu_ref *ref) pci_dev_put(priv->pci_dev); } +const struct dma_buf_pages_type_ops net_rx_ops; + static long dma_buf_create_pages(struct file *file, struct dma_buf_create_pages_info *create_info) { @@ -1793,6 +1796,9 @@ static long dma_buf_create_pages(struct file *file, priv->create_flags = create_info->create_flags; switch (priv->type) { + case DMA_BUF_PAGES_NET_RX: + priv->type_ops = &net_rx_ops; + break; default: err = -EINVAL; goto out_put_new_file; @@ -1966,3 +1972,171 @@ static void __exit dma_buf_deinit(void) dma_buf_uninit_sysfs_statistics(); } __exitcall(dma_buf_deinit); + +/******************************** + * dma_buf_pages_net_rx * + ********************************/ + +void dma_buf_pages_net_rx_release(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + unsigned long xa_idx; + + xa_for_each(&priv->net_rx.bound_rxq_list, xa_idx, rxq) + if (rxq->dmabuf_pages == file) + rxq->dmabuf_pages = NULL; +} + +static int dev_is_class(struct device *dev, void *class) +{ + if (dev->class != NULL && !strcmp(dev->class->name, class)) + return 1; + + return 0; +} + +int dma_buf_pages_net_rx_init(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + struct net_device *netdev; + int xa_id, err, rxq_idx; + struct device *device; + + priv->net_rx.page_pool = + gen_pool_create(PAGE_SHIFT, dev_to_node(&priv->pci_dev->dev)); + + if (!priv->net_rx.page_pool) + return -ENOMEM; + + /* + * We start with PAGE_SIZE instead of 0 since gen_pool_alloc_*() returns + * NULL on error + */ + err = gen_pool_add_virt(priv->net_rx.page_pool, PAGE_SIZE, 0, + PAGE_SIZE * priv->num_pages, + dev_to_node(&priv->pci_dev->dev)); + if (err) + goto out_destroy_pool; + + xa_init_flags(&priv->net_rx.bound_rxq_list, XA_FLAGS_ALLOC); + + device = device_find_child(&priv->pci_dev->dev, "net", dev_is_class); + if (!device) { + err = -ENODEV; + goto out_destroy_xarray; + } + + netdev = to_net_dev(device); + if (!netdev) { + err = -ENODEV; + goto out_put_dev; + } + + for (rxq_idx = 0; rxq_idx < (sizeof(priv->create_flags) * 8); + rxq_idx++) { + if (!(priv->create_flags & (1ULL << rxq_idx))) + continue; + + if (rxq_idx >= netdev->num_rx_queues) { + err = -ERANGE; + goto out_release_rx; + } + + rxq = __netif_get_rx_queue(netdev, rxq_idx); + + err = xa_alloc(&priv->net_rx.bound_rxq_list, &xa_id, rxq, + xa_limit_32b, GFP_KERNEL); + if (err) + goto out_release_rx; + + /* We previously have done a dma_buf_attach(), which validates + * that the net_device we're trying to attach to can reach the + * dmabuf, so we don't need to check here as well. + */ + rxq->dmabuf_pages = file; + } + put_device(device); + return 0; + +out_release_rx: + dma_buf_pages_net_rx_release(priv, file); +out_put_dev: + put_device(device); +out_destroy_xarray: + xa_destroy(&priv->net_rx.bound_rxq_list); +out_destroy_pool: + gen_pool_destroy(priv->net_rx.page_pool); + return err; +} + +void dma_buf_pages_net_rx_free(struct dma_buf_pages *priv) +{ + xa_destroy(&priv->net_rx.bound_rxq_list); + gen_pool_destroy(priv->net_rx.page_pool); +} + +static unsigned long dma_buf_page_to_gen_pool_addr(struct page *page) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + unsigned long offset; + + pgmap = page->pgmap; + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + offset = page - priv->pages; + /* Offset + 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. The gen_pool returns 0 on error, and virt + * address 0 is indistinguishable from an error. + */ + return (offset + 1) << PAGE_SHIFT; +} + +static struct page * +dma_buf_gen_pool_addr_to_page(unsigned long addr, struct dma_buf_pages *priv) +{ + /* - 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. See comment in dma_buf_create_pages() + * for details. + */ + unsigned long offset = (addr >> PAGE_SHIFT) - 1; + return &priv->pages[offset]; +} + +void dma_buf_page_free_net_rx(struct dma_buf_pages *priv, struct page *page) +{ + unsigned long addr = dma_buf_page_to_gen_pool_addr(page); + + if (gen_pool_has_addr(priv->net_rx.page_pool, addr, PAGE_SIZE)) + gen_pool_free(priv->net_rx.page_pool, addr, PAGE_SIZE); +} + +const struct dma_buf_pages_type_ops net_rx_ops = { + .dma_buf_pages_init = dma_buf_pages_net_rx_init, + .dma_buf_pages_release = dma_buf_pages_net_rx_release, + .dma_buf_pages_destroy = dma_buf_pages_net_rx_free, + .dma_buf_page_free = dma_buf_page_free_net_rx, +}; + +struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + unsigned long gen_pool_addr; + struct page *pg; + + if (!(priv->type & DMA_BUF_PAGES_NET_RX)) + return NULL; + + gen_pool_addr = gen_pool_alloc(priv->net_rx.page_pool, PAGE_SIZE); + if (!gen_pool_addr) + return NULL; + + if (!PAGE_ALIGNED(gen_pool_addr)) { + net_err_ratelimited("dmabuf page pool allocation not aligned"); + gen_pool_free(priv->net_rx.page_pool, gen_pool_addr, PAGE_SIZE); + return NULL; + } + + pg = dma_buf_gen_pool_addr_to_page(gen_pool_addr, priv); + + percpu_ref_get(&priv->pgmap.ref); + return pg; +} diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 5789006180ea..e8e66d6407d0 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,9 @@ #include #include #include +#include +#include +#include struct device; struct dma_buf; @@ -552,6 +555,11 @@ struct dma_buf_pages_type_ops { struct page *page); }; +struct dma_buf_pages_net_rx { + struct gen_pool *page_pool; + struct xarray bound_rxq_list; +}; + struct dma_buf_pages { /* fields for dmabuf */ struct dma_buf *dmabuf; @@ -568,6 +576,10 @@ struct dma_buf_pages { unsigned int type; const struct dma_buf_pages_type_ops *type_ops; __u64 create_flags; + + union { + struct dma_buf_pages_net_rx net_rx; + }; }; /** @@ -671,6 +683,8 @@ static inline bool is_dma_buf_pages_file(struct file *file) return file->f_op == &dma_buf_pages_fops; } +struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv); + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && @@ -718,6 +732,12 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, { return 0; } + +static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + return NULL; +} + #endif diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c2f0c6002a84..7a087ffa9baa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -796,6 +796,7 @@ struct netdev_rx_queue { #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; #endif + struct file __rcu *dmabuf_pages; } ____cacheline_aligned_in_smp; /* diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index d0f63a2ab7e4..b392cef9d3c6 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -186,6 +186,8 @@ struct dma_buf_create_pages_info { __u64 create_flags; }; +#define DMA_BUF_PAGES_NET_RX (1 << 0) + #define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info) #endif From patchwork Mon Jul 10 22:32:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307760 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A01E6C001B0 for ; Mon, 10 Jul 2023 22:33:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0109110E2E2; Mon, 10 Jul 2023 22:33:55 +0000 (UTC) Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5CB6E10E2E2 for ; Mon, 10 Jul 2023 22:33:53 +0000 (UTC) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-57003dac4a8so87446727b3.1 for ; Mon, 10 Jul 2023 15:33:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028432; x=1691620432; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ZtAFtzT26HwoXDlgQMNRwGQzav+ArXt3GcR/tyBO9gA=; b=HkBlwsebRlph1pf39ww19ir3TXRKAaMo7qsHRnHvMwd7eaBtsd+aEsyn2PFrjYVPN/ Q1m2zbfqNxMH7b2QmwqEmx7Hw8+0VDdPsjPbv5ClyeuSbRANbqwhxJwBlxKfNPSzGspo 7yQorKtO9RR3lY8Caocl1cEReHVAh3gKZjRNe8K/5mwfYof1P1cHyHw8LwIbAtpadOQi xt/cqEvGCmmTBBOu377T/OItahks2lVVcdnKMKB3565gPdYG/TvH20z3G9f0jRiObQC7 /Ikbp8b7IOidW9ZdGHVTIfQ7XQf/3jcuOVmetsVhHpuKf5wo9wt+kW67ruZRHH/8+c9U xfJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028432; x=1691620432; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZtAFtzT26HwoXDlgQMNRwGQzav+ArXt3GcR/tyBO9gA=; b=LEMua2jgp2xjH5J+St5GOniA29ckSdnMk2tC6iTBAkfAi02H/w40EWaLq8swThtyOJ K+sdnz1TBYu3Pt+6vMRxmjNH2D6jlr8jG7vV4c7wXju4CuzVmJVYVJ0nUSXLiqEiTUxn 1DRNKuT4iKV24ez6ClWXkohmuFcu9QEFd5IigaGrmhGP49Nnab/stKNXnp1FNZSOVuQ1 qJA0tb7RojZo+CaVnofLIEAulkd4hmTIRO7G4v4iJB5CTtlKlR4Gah9g9Hn9BXndDiSU DzqmUe4VNap4y9ysirsYKUD5S6qy2FOQi7kX/YThqV76o2m/G33r18jAjXasC7GFn641 kg/g== X-Gm-Message-State: ABy/qLYGflD9Ueq1mzNPzxn6AMXW7ygyl2OZJeA98LQgFYKPK6B3Sk3F SkoZtveVZNzxchtn6Ve6fCzCWN2VsMQULYG55w== X-Google-Smtp-Source: APBJJlHZU5kmoFW1exgCNXZVHrLAP9FADaXQZjpO/QBcTqC0+9vh1t2Y55mCZ3uaTrm2ya/0CivoRDUtK7ota9l6xA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:7e0c:0:b0:56c:e9fe:3cb4 with SMTP id o12-20020a817e0c000000b0056ce9fe3cb4mr204351ywn.1.1689028432231; Mon, 10 Jul 2023 15:33:52 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:54 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-4-almasrymina@google.com> Subject: [RFC PATCH 03/10] dma-buf: add support for NET_TX pages From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Used the paged attachment mappings support to create NET_TX pages. NET_TX pages can be used in the networking transmit path: 1. Create an iov_iter & bio_vec entries to represent this dmabuf. 2. Initialize the bio_vec with the backing dmabuf pages. Signed-off-by: Mina Almasry --- drivers/dma-buf/dma-buf.c | 47 ++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 7 ++++++ include/uapi/linux/dma-buf.h | 1 + 3 files changed, 55 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index acb86bf406f4..3ca71297b9b4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1683,6 +1683,7 @@ static void dma_buf_pages_destroy(struct percpu_ref *ref) } const struct dma_buf_pages_type_ops net_rx_ops; +const struct dma_buf_pages_type_ops net_tx_ops; static long dma_buf_create_pages(struct file *file, struct dma_buf_create_pages_info *create_info) @@ -1799,6 +1800,9 @@ static long dma_buf_create_pages(struct file *file, case DMA_BUF_PAGES_NET_RX: priv->type_ops = &net_rx_ops; break; + case DMA_BUF_PAGES_NET_TX: + priv->type_ops = &net_tx_ops; + break; default: err = -EINVAL; goto out_put_new_file; @@ -2140,3 +2144,46 @@ struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) percpu_ref_get(&priv->pgmap.ref); return pg; } + +/******************************** + * dma_buf_pages_net_tx * + ********************************/ + +static void dma_buf_pages_net_tx_release(struct dma_buf_pages *priv, + struct file *file) +{ + int i; + for (i = 0; i < priv->num_pages; i++) + put_page(&priv->pages[i]); +} + +static int dma_buf_pages_net_tx_init(struct dma_buf_pages *priv, + struct file *file) +{ + int i; + priv->net_tx.tx_bv = kvmalloc_array(priv->num_pages, + sizeof(struct bio_vec), GFP_KERNEL); + if (!priv->net_tx.tx_bv) + return -ENOMEM; + + for (i = 0; i < priv->num_pages; i++) { + priv->net_tx.tx_bv[i].bv_page = &priv->pages[i]; + priv->net_tx.tx_bv[i].bv_offset = 0; + priv->net_tx.tx_bv[i].bv_len = PAGE_SIZE; + } + percpu_ref_get_many(&priv->pgmap.ref, priv->num_pages); + iov_iter_bvec(&priv->net_tx.iter, WRITE, priv->net_tx.tx_bv, + priv->num_pages, priv->dmabuf->size); + return 0; +} + +static void dma_buf_pages_net_tx_free(struct dma_buf_pages *priv) +{ + kvfree(priv->net_tx.tx_bv); +} + +const struct dma_buf_pages_type_ops net_tx_ops = { + .dma_buf_pages_init = dma_buf_pages_net_tx_init, + .dma_buf_pages_release = dma_buf_pages_net_tx_release, + .dma_buf_pages_destroy = dma_buf_pages_net_tx_free, +}; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index e8e66d6407d0..93228a2fec47 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -555,6 +556,11 @@ struct dma_buf_pages_type_ops { struct page *page); }; +struct dma_buf_pages_net_tx { + struct iov_iter iter; + struct bio_vec *tx_bv; +}; + struct dma_buf_pages_net_rx { struct gen_pool *page_pool; struct xarray bound_rxq_list; @@ -579,6 +585,7 @@ struct dma_buf_pages { union { struct dma_buf_pages_net_rx net_rx; + struct dma_buf_pages_net_tx net_tx; }; }; diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index b392cef9d3c6..546f211a7556 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -187,6 +187,7 @@ struct dma_buf_create_pages_info { }; #define DMA_BUF_PAGES_NET_RX (1 << 0) +#define DMA_BUF_PAGES_NET_TX (2 << 0) #define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info) From patchwork Mon Jul 10 22:32:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307761 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AD08DEB64D9 for ; Mon, 10 Jul 2023 22:33:59 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CA07C10E2E5; Mon, 10 Jul 2023 22:33:58 +0000 (UTC) Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0D13710E2E4 for ; Mon, 10 Jul 2023 22:33:56 +0000 (UTC) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c690b153f67so4281204276.2 for ; Mon, 10 Jul 2023 15:33:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028436; x=1691620436; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=QDPrrkJolhYj8+JA2ecQJjxVsJNIQec1wNqvH72xEAE=; b=bliRqYBYA7SP+SPxK2IjcbatX7Omsm+v4X2DL5ulNehUSHXhYkbBCjCoPA4+lde/+r Mv4vn1NPHsq9zQaJ/BS5q4yOT28xnB+GR7b90+Msgrod+u9besvi+CHAzr3d2lqkB/AA yxX32+4hCEHv/rYkC3zNVkxpRlZhCActP39jEwWjryIEygF1XkF2wIN5uXGWgJII1qn3 4AzQQnHnn+AHSw/qKKc5jLJMbzVKVGyCcgjIEEOOjxsfJU/XS8Za+1qyJC6zoQp8SC9Z eZmHcI7zYkSKm3we5mSp3N6GMumLtl7F3sh7WJbaYkTjS79SpPYbISoFN9wtRlENHs+q GjoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028436; x=1691620436; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QDPrrkJolhYj8+JA2ecQJjxVsJNIQec1wNqvH72xEAE=; b=QI9R5BN/vRZV/R8rnDhtlM5w4o35k5+QCGhBYGDpVpShK0nEJPM+qsKr2MUTkYtBfB DspOc2wMARuuXG7Gr1q4iRb87l6Cb3/8/0Nry54/hUAdW5pLu1tCwqXW70GMJXUWsnEz Y3TpO9N73Q0ImEV6DGeMk6fs2X1dILiQhOecxva9cB6+tYYXcboHVSNCThQ0A8EJbKhA Hc9VgEhduZ82zZYQsQe53SVMNfUak6HARFKQH9skM0iJoWXNwD8fjz99EkZPufs+60cI ocAbuDYL+xOAX3RIsR/JUyi3bO1ISB2zCP/FOxZ+2hEqhMATidYaAdoDeG1YyU8Y0xcG gTWw== X-Gm-Message-State: ABy/qLYwMszvsq4V+Op5cvvpQVCsvDCVMWeM36YHtWHft9y3bhVZAjpT ctHoy5p9ZWPCihyPB11AUQuqVD7T5uBaEDhmpA== X-Google-Smtp-Source: APBJJlG3bwEoPIkQ3o2GIl51Ddf3aoB7B54zxP7HixWCIvf2JZmkwqiiEqZGTtJ7VvQ3rxT2A7pGP8pckAgyDOt8qg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:41ca:0:b0:c6b:c65d:6d02 with SMTP id o193-20020a2541ca000000b00c6bc65d6d02mr65909yba.9.1689028435810; Mon, 10 Jul 2023 15:33:55 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:55 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-5-almasrymina@google.com> Subject: [RFC PATCH 04/10] net: add support for skbs with unreadable frags From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_page_desc() & skb_fill_page_desc_noacc() now checks frags added to skbs for dmabuf pages, and marks the skb as skb->devmem if the page is a device memory page. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coallescing devmem skbs with non devmem skbs. Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 15 +++++++++ include/net/tcp.h | 6 ++-- net/core/skbuff.c | 73 ++++++++++++++++++++++++++++++++++-------- net/ipv4/tcp.c | 3 ++ net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 7 files changed, 97 insertions(+), 22 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0b40417457cd..f5e03aa84160 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -38,6 +38,7 @@ #endif #include #include +#include /** * DOC: skb checksums @@ -805,6 +806,8 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @devmem: indicates that all the fragments in this skb is backed by + * device memory. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -992,6 +995,7 @@ struct sk_buff { __u8 csum_not_inet:1; #endif + __u8 devmem:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif @@ -1766,6 +1770,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); } +/* Return true if frags in this skb are not readable by the host. */ +static inline bool skb_frags_not_readable(const struct sk_buff *skb) +{ + return skb->devmem; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2469,6 +2479,8 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, page = compound_head(page); if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; + if (is_dma_buf_page(page)) + skb->devmem = true; } /** @@ -2511,6 +2523,9 @@ static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i, __skb_fill_page_desc_noacc(shinfo, i, page, off, size); shinfo->nr_frags = i + 1; + + if (is_dma_buf_page(page)) + skb->devmem = true; } void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, diff --git a/include/net/tcp.h b/include/net/tcp.h index 5066e4586cf0..6d86ed3736ad 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -986,7 +986,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb) static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb)); } static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -994,7 +994,9 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_not_readable(to) == + skb_frags_not_readable(from)); } /* Events passed to congestion control interface */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index cea28d30abb5..9b83da794641 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1191,11 +1191,16 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) skb_frag_size(frag), p, p_off, p_len, copied) { seg_len = min_t(int, p_len, len); - vaddr = kmap_atomic(p); - print_hex_dump(level, "skb frag: ", - DUMP_PREFIX_OFFSET, - 16, 1, vaddr + p_off, seg_len, false); - kunmap_atomic(vaddr); + if (!is_dma_buf_page(p)) { + vaddr = kmap_atomic(p); + print_hex_dump(level, "skb frag: ", + DUMP_PREFIX_OFFSET, 16, 1, + vaddr + p_off, seg_len, false); + kunmap_atomic(vaddr); + } else { + printk("%sskb frag: devmem", level); + } + len -= seg_len; if (!len) break; @@ -1764,6 +1769,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!num_frags) goto release; @@ -1934,8 +1942,10 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(size, gfp_mask, + skb_alloc_rx_flag(skb), + NUMA_NO_NODE); if (!n) return NULL; @@ -2266,9 +2276,10 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, /* * Allocate the copy buffer */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(newheadroom + skb->len + newtailroom, + gfp_mask, skb_alloc_rx_flag(skb), + NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off; @@ -2609,6 +2620,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end; + if (skb_frags_not_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2762,6 +2776,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -2835,7 +2852,7 @@ static struct page *linear_to_page(struct page *page, unsigned int *len, { struct page_frag *pfrag = sk_page_frag(sk); - if (!sk_page_frag_refill(sk, pfrag)) + if (!sk_page_frag_refill(sk, pfrag) || is_dma_buf_page(pfrag->page)) return NULL; *len = min_t(unsigned int, *len, pfrag->size - pfrag->offset); @@ -3164,6 +3181,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3243,6 +3263,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3343,6 +3366,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -3800,7 +3826,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i]; skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; + skb->devmem = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -3814,11 +3842,13 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; skb->len = len; skb->data_len = len - pos; + skb->devmem = skb1->devmem = 0; for (i = 0; i < nfrags; i++) { int size = skb_frag_size(&skb_shinfo(skb)->frags[i]); @@ -3847,6 +3877,12 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->devmem = devmem; + + if (skb_shinfo(skb1)->nr_frags) + skb1->devmem = devmem; } /** @@ -4082,6 +4118,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; } + if (skb_frags_not_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb); @@ -5681,7 +5720,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, (from->pp_recycle && skb_cloned(from))) return false; - if (len <= skb_tailroom(to)) { + if (skb_frags_not_readable(from) != skb_frags_not_readable(to)) + return false; + + if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -5997,6 +6039,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0; @@ -6656,8 +6701,8 @@ EXPORT_SYMBOL(pskb_extract); void skb_condense(struct sk_buff *skb) { if (skb->data_len) { - if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + if (skb->data_len > skb->end - skb->tail || skb_cloned(skb) || + skb_frags_not_readable(skb)) return; /* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8d20d9221238..51e8d5872670 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4520,6 +4520,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, if (crypto_ahash_update(req)) return 1; + if (skb_frags_not_readable(skb)) + return 1; + for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bf8b22218dd4..8d28d96a3c24 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5188,6 +5188,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list); + if (skb_frags_not_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5208,17 +5211,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; } - if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && !skb_frags_not_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; } +skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) return; __skb_queue_head_init(&tmp); @@ -5262,7 +5268,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index cfe128b81a01..eddade864c7f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2310,7 +2310,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb->devmem != next->devmem) return false; len -= skb->len; @@ -3087,6 +3088,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (skb_frags_not_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a2dbeb264f26..9b31f688163c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) @@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) From patchwork Mon Jul 10 22:32:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307762 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1FBBCC04A6A for ; Mon, 10 Jul 2023 22:34:05 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8053410E2E6; Mon, 10 Jul 2023 22:34:04 +0000 (UTC) Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3FF8710E2E6 for ; Mon, 10 Jul 2023 22:34:03 +0000 (UTC) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c83284edf0eso1866272276.3 for ; Mon, 10 Jul 2023 15:34:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028442; x=1691620442; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=cs78TMZKTTsnyLaGsD47TqXXpiU8IKn3E+LzkzIF7BU=; b=3a5uxA+qQl4NZTJNL69iIkTvOyWVY7//EUdVncePAl7DuYdYysCLTB+Scqjs0DbaRb knVSDHY19O2jw3TmS+RU9Ak2mdsBgSK6tfW6u5GV1mUKkWyYVITRFk6IlJEv5ytnZkeh zBT9t0KymYJ4QcgVINn2Gq3UU2ZggoPreRx+Zv/eoJI962CJj+J+KIPmh9/akMc2ypBs hRWw+J3GhzD0inq7pRnsKc7SkPkoyPF2IBr7akov++47y3RLln3OTm3ZRyFBuMw9dTux qWTBDMWGCdy7dhQBGYAsQnhwui5piSP/JLPmLDeSdKvpAE1zn/5xU46UzOfFbxNEAau/ kuKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028442; x=1691620442; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cs78TMZKTTsnyLaGsD47TqXXpiU8IKn3E+LzkzIF7BU=; b=TkmEKK6bENpq7925VAW+vbW9rTZIE35Cc9Fg1T5OS4BCgjl43uWIZMSgI2NCGRGAST yyU2bI+WHxRhMOyn4pLqcN/Y30rBpEuLfKpkHs6ITWKoyvrCc4h4hhbFGolEFREO+YrB OA7Rlme5cSebnhWAceuIBEH0emWY9QYUuwGJq66uGte/rPtX3jsP91Et6qrGNHpOjhVi QOd8FRCZKM0aa20srbT54HM0ufeW/+/KBRUObxCqgmP1TmGx0C8XUD1lQhJDj9O24hIw AzOfvhIA9Ll8RL3fzkz/MaIanQ0HKjlJ16XqG343bhs1CRjB9jVuX0fBdEPOE1iGPPiN U0Hg== X-Gm-Message-State: ABy/qLYmdkaESK23pej7XXmKR44eAzk8VJphRXTN0cx8a4l30+U97+Ag wMn7d1fdleI2sEdgDfwEy+TQ5pTeIxVGpxLCsA== X-Google-Smtp-Source: APBJJlHdH7YiMUAwux3ndVFThsAxLjTWbqRUoy5A1wbBCVTRZZjWp0g5h005ViK1cG50WI4QslyEnITMzBCMfQV5rg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:6c44:0:b0:c5d:b034:28e2 with SMTP id h65-20020a256c44000000b00c5db03428e2mr73516ybc.7.1689028442390; Mon, 10 Jul 2023 15:34:02 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:56 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-6-almasrymina@google.com> Subject: [RFC PATCH 05/10] tcp: implement recvmsg() RX path for devmem TCP From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_pagepool, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Mina Almasry --- include/linux/socket.h | 1 + include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 6 + net/core/datagram.c | 3 + net/ipv4/tcp.c | 186 +++++++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 8 ++ 7 files changed, 209 insertions(+), 2 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 13c3a237b9c9..12905b2f1215 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ #define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file diff --git a/include/net/sock.h b/include/net/sock.h index 6f428a7f3567..c615666ff19a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -353,6 +353,7 @@ struct sk_filter; * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference * @sk_bind2_node: bind node in the bhash2 table + * @sk_pagepool: page pool associated with this socket. */ struct sock { /* @@ -545,6 +546,7 @@ struct sock { struct rcu_head sk_rcu; netns_tracker ns_tracker; struct hlist_node sk_bind2_node; + struct xarray sk_pagepool; }; enum sk_pacing { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 638230899e98..88f9234f78cb 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -132,6 +132,11 @@ #define SO_RCVMARK 75 +#define SO_DEVMEM_HEADER 98 +#define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER +#define SO_DEVMEM_OFFSET 99 +#define SCM_DEVMEM_OFFSET SO_DEVMEM_OFFSET + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..8b0be0f50838 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,12 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct cmsg_devmem { + __u32 frag_offset; + __u32 frag_size; + __u32 frag_token; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/datagram.c b/net/core/datagram.c index 176eb5834746..3a82598aa6ed 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -455,6 +455,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, skb_walk_frags(skb, frag_iter) { int end; + if (frag_iter->devmem) + goto short_copy; + WARN_ON(start > offset + len); end = start + frag_iter->len; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 51e8d5872670..a894b8a9dbb0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -279,6 +279,7 @@ #include #include #include +#include /* Track pending CMSGs. */ enum { @@ -460,6 +461,7 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_pagepool, XA_FLAGS_ALLOC); } EXPORT_SYMBOL(tcp_init_sock); @@ -2408,6 +2410,165 @@ static int tcp_inq_hint(struct sock *sk) return inq; } +static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, int len) +{ + unsigned int start = skb_headlen(skb); + struct cmsg_devmem cmsg_devmem = { 0 }; + unsigned int tokens_added_idx = 0; + int i, copy = start - offset, n; + struct sk_buff *frag_iter; + u32 *tokens_added; + int err = 0; + + if (!skb->devmem) + return -ENODEV; + + tokens_added = kzalloc(sizeof(u32) * skb_shinfo(skb)->nr_frags, + GFP_KERNEL); + + if (!tokens_added) + return -ENOMEM; + + /* Copy header. */ + if (copy > 0) { + copy = min(copy, len); + + n = copy_to_iter(skb->data + offset, copy, &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto err_release_pages; + } + + offset += copy; + len -= copy; + + /* First a cmsg_devmem for # bytes copied to user buffer */ + cmsg_devmem.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_HEADER, + sizeof(cmsg_devmem), &cmsg_devmem); + if (err) + goto err_release_pages; + + if (len == 0) + goto out; + } + + /* after that, send information of devmem pages through a sequence + * of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag); + struct dma_buf_pages *priv; + u32 user_token, frag_offset; + struct page *dmabuf_pages; + int end; + + /* skb->devmem should indicate that ALL the pages in this skb + * are dma buf pages. We're checking for that flag above, but + * also check individual pages here. If the driver is not + * setting skb->devmem correctly, we still don't want to crash + * here when accessing pgmap or priv below. + */ + if (!is_dma_buf_page(page)) { + net_err_ratelimited("Found non-devmem skb with dma_buf " + "page"); + err = -ENODEV; + goto err_release_pages; + } + + end = start + skb_frag_size(frag); + copy = end - offset; + memset(&cmsg_devmem, 0, sizeof(cmsg_devmem)); + + if (copy > 0) { + copy = min(copy, len); + + priv = (struct dma_buf_pages *)page->pp->mp_priv; + + dmabuf_pages = priv->pages; + frag_offset = ((page - dmabuf_pages) << PAGE_SHIFT) + + skb_frag_off(frag) + offset - start; + cmsg_devmem.frag_offset = frag_offset; + cmsg_devmem.frag_size = copy; + err = xa_alloc((struct xarray *)&sk->sk_pagepool, + &user_token, page, xa_limit_31b, + GFP_KERNEL); + if (err) + goto err_release_pages; + + tokens_added[tokens_added_idx++] = user_token; + + get_page(page); + cmsg_devmem.frag_token = user_token; + + offset += copy; + len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_OFFSET, + sizeof(cmsg_devmem), &cmsg_devmem); + if (err) { + put_page(page); + goto err_release_pages; + } + + if (len == 0) + goto out; + } + start = end; + } + + if (!len) + goto out; + + /* if len is not satisfied yet, we need to skb_walk_frags() to satisfy + * len + */ + skb_walk_frags(skb, frag_iter) + { + int end; + + if (!frag_iter->devmem) { + err = -EFAULT; + goto err_release_pages; + } + + WARN_ON(start > offset + len); + end = start + frag_iter->len; + copy = end - offset; + if (copy > 0) { + if (copy > len) + copy = len; + err = tcp_recvmsg_devmem(sk, frag_iter, offset - start, + msg, copy); + if (err) + goto err_release_pages; + len -= copy; + if (len == 0) + goto out; + offset += copy; + } + start = end; + } + + if (len) { + err = -EFAULT; + goto err_release_pages; + } + + goto out; + +err_release_pages: + for (i = 0; i < tokens_added_idx; i++) + put_page(xa_erase((struct xarray *)&sk->sk_pagepool, + tokens_added[i])); + +out: + kfree(tokens_added); + return err; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2428,7 +2589,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int err; int target; /* Read at least this many bytes */ long timeo; - struct sk_buff *skb, *last; + struct sk_buff *skb, *last, *skb_last_copied = NULL; u32 urg_hole = 0; err = -ENOTCONN; @@ -2593,7 +2754,27 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } } - if (!(flags & MSG_TRUNC)) { + if (skb_last_copied && skb_last_copied->devmem != skb->devmem) + break; + + if (skb->devmem) { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* skb->devmem skbs can only be received with + * the MSG_SOCK_DEVMEM flag. + */ + + copied = -EFAULT; + break; + } + + err = tcp_recvmsg_devmem(sk, skb, offset, msg, used); + if (err) { + if (!copied) + copied = -EFAULT; + break; + } + skb_last_copied = skb; + } else if (!(flags & MSG_TRUNC)) { err = skb_copy_datagram_msg(skb, offset, msg, used); if (err) { /* Exception. Bailout! */ @@ -2601,6 +2782,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, copied = -EFAULT; break; } + skb_last_copied = skb; } WRITE_ONCE(*seq, *seq + used); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 06d2573685ca..d7dee38e0410 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2291,6 +2291,14 @@ void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + unsigned long index; + struct page *page; + + xa_for_each(&sk->sk_pagepool, index, page) + put_page(page); + + xa_destroy(&sk->sk_pagepool); + trace_tcp_destroy_sock(sk); tcp_clear_xmit_timers(sk); From patchwork Mon Jul 10 22:32:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307763 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7FD9CEB64DC for ; Mon, 10 Jul 2023 22:34:10 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E17A710E2E8; Mon, 10 Jul 2023 22:34:09 +0000 (UTC) Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by gabe.freedesktop.org (Postfix) with ESMTPS id B62C210E2E9 for ; Mon, 10 Jul 2023 22:34:07 +0000 (UTC) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-565a33c35b1so59603167b3.0 for ; Mon, 10 Jul 2023 15:34:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028447; x=1691620447; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=eMznknX6LhCZzxkpGK1anYqC2AReGQzyhNKGt0dhRMs=; b=BtdvUbd29zwhezN9vWnqcJKpf0bV2LKN/+c8VWC9oiyT99FFcNEJDOvUA5o+m21DRJ e9aNejKksk1rdwKl8O1MZ9yP/O1OkMbpULehfNB94qYTmzji2HykYi+osFgOuhdAF+qD KGUZyMgcGBl8J91gT7Vb/XlYX07mdXJEBkEmKhCq++SE1RIb3ka/J0r4X0WZmUp1LyEo SjYNX6E3UyUwTUHmBfJ7fNPviqSEF3jRznivgjuuqoMmWVaZBvnRoFgKdOZiBTRle0wA 2DGPZWSVDlrn2z8oZhGDAjAlxNPaIc+/JHcTW0CD86xRjmhJfySPnG/Os836KYedkH0U WpvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028447; x=1691620447; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eMznknX6LhCZzxkpGK1anYqC2AReGQzyhNKGt0dhRMs=; b=N1JcPEItXDtR+UdmEj62N7x3pLIzoMIZ0ym7MiKTPjKmXclc+i7Qq4ySNKZBHUuBAv QjrmxjRaLmgRijwAhTVRIX4av4C7vZlTep8Nda+htGutAPYrgWmi/bgFBnk3TeQk2FZL qV10f2Vv7NG315SNYzKHzPwubCQhYHBaPEvVRWcKZN8gzctQfo4Kmna/TSndhG8pF19W Fzk5MsBBcbE7eZHPi6Kb9HNFwrmWoi7SdkpTFqnyo/f5y5y58m9k5/gIXyOBG/G+6QPw I2CN0eL57Q5VlikzTAmkrBqHfLf2mzF66nf0LSzGX/bmeqj2Hoj7ia/5x2XDykRWrSjJ TPzA== X-Gm-Message-State: ABy/qLbVki+k57o1cAvQ6ycc9rYQvIH8cRzTLX/LySOqr88Vv8FPUXmz lMI6NoiESPMs7fM0UPGCvcyNco3psetHu2Nxfw== X-Google-Smtp-Source: APBJJlHZxtTyAip5d5fJgZLpVb+rUoGbj/Pap2HolJ5oEOujITpsR+hl7i9eHYLZIaU+exV6ykL+cfJhHKzieUdutg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:e90b:0:b0:573:285a:c2a3 with SMTP id d11-20020a81e90b000000b00573285ac2a3mr95478ywm.1.1689028446916; Mon, 10 Jul 2023 15:34:06 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:57 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-7-almasrymina@google.com> Subject: [RFC PATCH 06/10] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use. Signed-off-by: Mina Almasry --- include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 +++ net/core/sock.c | 41 +++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 88f9234f78cb..2a5a7f5da358 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -132,6 +132,7 @@ #define SO_RCVMARK 75 +#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_HEADER 98 #define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER #define SO_DEVMEM_OFFSET 99 diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 8b0be0f50838..faaa765fd5a4 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -26,6 +26,10 @@ struct cmsg_devmem { __u32 frag_token; }; +struct devmemtoken { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index 24f2761bdb1d..f9b9d9ec7322 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1531,7 +1531,48 @@ int sk_setsockopt(struct sock *sk, int level, int optname, /* Paired with READ_ONCE() in tcp_rtx_synack() */ WRITE_ONCE(sk->sk_txrehash, (u8)val); break; + case SO_DEVMEM_DONTNEED: { + struct devmemtoken tokens[128]; + unsigned int num_tokens, i, j; + if (sk->sk_type != SOCK_STREAM || + sk->sk_protocol != IPPROTO_TCP) { + ret = -EBADF; + break; + } + + if (optlen % sizeof(struct devmemtoken) || + optlen > sizeof(tokens)) { + ret = -EINVAL; + break; + } + + num_tokens = optlen / sizeof(struct devmemtoken); + if (copy_from_sockptr(tokens, optval, optlen)) { + ret = -EFAULT; + break; + } + + ret = 0; + + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + struct page *pg = xa_erase(&sk->sk_pagepool, + tokens[i].token_start + j); + + if (pg) + put_page(pg); + else + /* -EINTR here notifies the userspace + * that not all tokens passed to it have + * been freed. + */ + ret = -EINTR; + } + } + + break; + } default: ret = -ENOPROTOOPT; break; From patchwork Mon Jul 10 22:32:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307764 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 071A1EB64DA for ; Mon, 10 Jul 2023 22:34:14 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1A61710E2E9; Mon, 10 Jul 2023 22:34:13 +0000 (UTC) Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by gabe.freedesktop.org (Postfix) with ESMTPS id D65F810E2ED for ; Mon, 10 Jul 2023 22:34:10 +0000 (UTC) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-c4cda9d3823so4304333276.1 for ; Mon, 10 Jul 2023 15:34:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028450; x=1691620450; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mZRt6IHRVYwjrDgVxc5BdSo0qSi/7hDrkUlzlo5nha0=; b=F1oYk6PYa7tb7uIVT7ncRImo9dzRRTMW5Ebj/UOQML5cCAoP1bLxMrobu7CbsW0S0Y /G53r5rrIvyYyG7hi6OPzXlzkrD5CGMaY2hGiwXxL/WiSz79VteXRkiY81i7pEKLwp3l 17i21BWEE+TiQROAnL1CRW8Ot45TQa2M7Zp54xbYqCQ+Shk7eiW886z2C/Mfw+okEbQ3 +Ztb3SAnWz8mhB7hcleaQBtg3wQ1hi2zzpD2QPLMozZVO5zAIhJQh5CKSeVrTq/Oq2dZ VwWPCjINgVuULzor/doPFvcq/lk4R/bGAoXL66l0NggTBvJcD8v6CnyLUTPM0bsChWTJ PBKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028450; x=1691620450; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mZRt6IHRVYwjrDgVxc5BdSo0qSi/7hDrkUlzlo5nha0=; b=B26DzciQEsjMlL6b+EtK+mkceBizo2MfFW/Ro/4l1q+nXfh/C9fUxBGe1jzfUAyLQc 020Gu4bRNqhwHQLRqFIl6vWsBXRWo1De+HiQHD+N60+SPGB9cd+nqfnczUd65MiSX6CJ /lrphFr/LPoiA2Kr0Xjjj5PEAFr3332iilluZ4akaGE+zNgbzWkmVo7K6U/Q3JKCqTaq 1IhJBxEp/8FVmy9FoDJb2DVMLBHSdFvZaFT1v/byJaFNdNny1W0LOybKy8whyOfhaPKc M+qkqzMcwynxfFk+TelzpeCy90I18C+71WeDYYLVTGbAjgiiPMWAM7AQq4BHMGxKxUmk umOg== X-Gm-Message-State: ABy/qLYg0gSQ6boCfKCgndA9VRpfNn13Oy7rPQ4WH0ck6ihc8+aMlxdM zUMDMBfUaae7W1ppVTb8Bb8J+2C38HfJ6xUT2g== X-Google-Smtp-Source: APBJJlGwcwSQfIa1k2C4AKA9b3D+Y30QHjPYL3qtWa/GF75GmX5sDpCzR7Nk3qYThT2cxoPLhXyLaIWlt8Ip9FM82g== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:c902:0:b0:c61:7151:6727 with SMTP id z2-20020a25c902000000b00c6171516727mr76617ybf.10.1689028449938; Mon, 10 Jul 2023 15:34:09 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:58 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-8-almasrymina@google.com> Subject: [RFC PATCH 07/10] tcp: implement sendmsg() TX path for for devmem tcp From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" For device memory TCP, we let the user provide the kernel with a cmsg container 2 items: 1. the dmabuf pages fd that the user would like to send data from. 2. the offset into this dmabuf that the user would like to start sending from. In tcp_sendmsg_locked(), if this cmsg is provided, we send the data using the dmabuf NET_TX pages bio_vec. Also provide drivers with a new skb_devmem_frag_dma_map() helper. This helper is similar to skb_frag_dma_map(), but it first checks whether the frag being mapped is backed by dmabuf NET_TX pages, and provides the correct dma_addr if so. Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 19 +++++++++-- include/net/sock.h | 2 ++ net/core/skbuff.c | 8 ++--- net/core/sock.c | 6 ++++ net/ipv4/tcp.c | 73 +++++++++++++++++++++++++++++++++++++++++- 5 files changed, 101 insertions(+), 7 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f5e03aa84160..ad4e7bfcab07 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1660,8 +1660,8 @@ static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, } int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, - struct msghdr *msg, int len, - struct ubuf_info *uarg); + struct msghdr *msg, struct iov_iter *iov_iter, + int len, struct ubuf_info *uarg); /* Internal */ #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) @@ -3557,6 +3557,21 @@ static inline dma_addr_t skb_frag_dma_map(struct device *dev, skb_frag_off(frag) + offset, size, dir); } +/* Similar to skb_frag_dma_map, but handles devmem skbs correctly. */ +static inline dma_addr_t skb_devmem_frag_dma_map(struct device *dev, + const struct sk_buff *skb, + const skb_frag_t *frag, + size_t offset, size_t size, + enum dma_data_direction dir) +{ + if (unlikely(skb->devmem && is_dma_buf_page(skb_frag_page(frag)))) { + dma_addr_t dma_addr = + dma_buf_page_to_dma_addr(skb_frag_page(frag)); + return dma_addr + skb_frag_off(frag) + offset; + } + return skb_frag_dma_map(dev, frag, offset, size, dir); +} + static inline struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask) { diff --git a/include/net/sock.h b/include/net/sock.h index c615666ff19a..733865f89635 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1890,6 +1890,8 @@ struct sockcm_cookie { u64 transmit_time; u32 mark; u32 tsflags; + u32 devmem_fd; + u32 devmem_offset; }; static inline void sockcm_init(struct sockcm_cookie *sockc, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9b83da794641..b1e28e7ad6a8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1685,8 +1685,8 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref) EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort); int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, - struct msghdr *msg, int len, - struct ubuf_info *uarg) + struct msghdr *msg, struct iov_iter *iov_iter, + int len, struct ubuf_info *uarg) { struct ubuf_info *orig_uarg = skb_zcopy(skb); int err, orig_len = skb->len; @@ -1697,12 +1697,12 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, if (orig_uarg && uarg != orig_uarg) return -EEXIST; - err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); + err = __zerocopy_sg_from_iter(msg, sk, skb, iov_iter, len); if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) { struct sock *save_sk = skb->sk; /* Streams do not free skb on error. Reset to prev state. */ - iov_iter_revert(&msg->msg_iter, skb->len - orig_len); + iov_iter_revert(iov_iter, skb->len - orig_len); skb->sk = sk; ___pskb_trim(skb, orig_len); skb->sk = save_sk; diff --git a/net/core/sock.c b/net/core/sock.c index f9b9d9ec7322..854624bee5d0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2813,6 +2813,12 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, return -EINVAL; sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg)); break; + case SCM_DEVMEM_OFFSET: + if (cmsg->cmsg_len != CMSG_LEN(2 * sizeof(u32))) + return -EINVAL; + sockc->devmem_fd = ((u32 *)CMSG_DATA(cmsg))[0]; + sockc->devmem_offset = ((u32 *)CMSG_DATA(cmsg))[1]; + break; /* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */ case SCM_RIGHTS: case SCM_CREDENTIALS: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a894b8a9dbb0..85d6cdc832ef 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -280,6 +280,7 @@ #include #include #include +#include /* Track pending CMSGs. */ enum { @@ -1216,6 +1217,52 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied, return err; } +static int tcp_prepare_devmem_data(struct msghdr *msg, int devmem_fd, + unsigned int devmem_offset, + struct file **devmem_file, + struct iov_iter *devmem_tx_iter, size_t size) +{ + struct dma_buf_pages *priv; + int err = 0; + + *devmem_file = fget_raw(devmem_fd); + if (!*devmem_file) { + err = -EINVAL; + goto err; + } + + if (!is_dma_buf_pages_file(*devmem_file)) { + err = -EBADF; + goto err_fput; + } + + priv = (*devmem_file)->private_data; + if (!priv) { + WARN_ONCE(!priv, "dma_buf_pages_file has no private_data"); + err = -EINTR; + goto err_fput; + } + + if (!(priv->type & DMA_BUF_PAGES_NET_TX)) + return -EINVAL; + + if (devmem_offset + size > priv->dmabuf->size) { + err = -ENOSPC; + goto err_fput; + } + + *devmem_tx_iter = priv->net_tx.iter; + iov_iter_advance(devmem_tx_iter, devmem_offset); + + return 0; + +err_fput: + fput(*devmem_file); + *devmem_file = NULL; +err: + return err; +} + int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { struct tcp_sock *tp = tcp_sk(sk); @@ -1227,6 +1274,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) int process_backlog = 0; bool zc = false; long timeo; + struct file *devmem_file = NULL; + struct iov_iter devmem_tx_iter; flags = msg->msg_flags; @@ -1295,6 +1344,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) } } + if (sockc.devmem_fd) { + err = tcp_prepare_devmem_data(msg, sockc.devmem_fd, + sockc.devmem_offset, &devmem_file, + &devmem_tx_iter, size); + if (err) + goto out_err; + } + /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -1408,7 +1465,17 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) goto wait_for_space; } - err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + if (devmem_file) { + err = skb_zerocopy_iter_stream(sk, skb, msg, + &devmem_tx_iter, + copy, uarg); + if (err > 0) + iov_iter_advance(&msg->msg_iter, err); + } else { + err = skb_zerocopy_iter_stream(sk, skb, msg, + &msg->msg_iter, + copy, uarg); + } if (err == -EMSGSIZE || err == -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; @@ -1462,6 +1529,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) } out_nopush: net_zcopy_put(uarg); + if (devmem_file) + fput(devmem_file); return copied + copied_syn; do_error: @@ -1470,6 +1539,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copied + copied_syn) goto out; out_err: + if (devmem_file) + fput(devmem_file); net_zcopy_put_abort(uarg, true); err = sk_stream_error(sk, flags, err); /* make sure we wake any epoll edge trigger waiter */ From patchwork Mon Jul 10 22:32:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5A900EB64D9 for ; Mon, 10 Jul 2023 22:34:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8B37310E2EB; Mon, 10 Jul 2023 22:34:17 +0000 (UTC) Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8029E10E2EB for ; Mon, 10 Jul 2023 22:34:16 +0000 (UTC) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5707177ff8aso57554037b3.2 for ; Mon, 10 Jul 2023 15:34:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028455; x=1691620455; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oKO6sogguZyjcmLGJ/hPDJOveqN84Swf5N5/pg86eqs=; b=Xc1Br1GD5LT+ysV69O+uNAFOOUAbf02O+HrvNnkbfJ1izSzFy6Ok06vRSUdPziTGaE +3glhnpZcNM+j4eKCMIQwzsvEIs7TUtCF6YQ7B3LB5UhdlozLITb0gWXFx15pEiIQwHB lb1XngFtKIZWG1x07iobyzwkMGwrncaUqi7OjBAV9mfOg4lMhlt3qQue+doX9PnCJO43 rujdU8bT5yD1p9K7vmBkF+fpO8gWOxm5COgUqgl1cz9uYCBrniBM+9z6RMAxOXhqziPJ biD0LDnt/JaMQ2Wj+vQZk9IEDe9NJK+hbJMwVlywjwf9Mrl5ow415UQIgWSBJjmiua/K kleg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028455; x=1691620455; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oKO6sogguZyjcmLGJ/hPDJOveqN84Swf5N5/pg86eqs=; b=MTZwC9wwnWo4B88Bfzg5qzp1+/X+yM0Zya6gsebesr0qCLW72LMunoYC/y0cJeqiS3 +4l6lkGHm3K9tHN3XN6cMv0EmbXjkNc57m1OtPDFgA0EJruj0aORa1TJw8TXEQdMCshY OabiOVfDZnPXPJy8NUcg8qj95CeYEZK/JArS1Xyk/Ha4Ky8mKKMpNTFTQa6iSjCm/h9o txMrHIRc7MIXQ1qptAkH4HtBdlZnE+Fes6AVJCrFkanpGTPhMfSs8dfxx7DXGNcvuKQF qHbufn4JlVGYjbRfCuFYIW0hbHVMumAV4gL9KUXt3JRSDZMhAGTdJMlvcEYN0LBO9Tjp 8vMQ== X-Gm-Message-State: ABy/qLan8qUPS3dNcyd5cCa8B9P485FekEOMAWpgNMxBQFjxrRBdYD04 h1C5G7Qt9mlOGtOj69OermZ9XWCjUoOa7QUHGQ== X-Google-Smtp-Source: APBJJlG5e3EZCuprxXAZBHD12ycHywImC9zBDIViU2VYzdwhsM3l+IXw2PlsHbyEhMIScVpgSI4Mos+P/4wWIVOHRw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:d305:0:b0:56d:2a88:49e5 with SMTP id y5-20020a81d305000000b0056d2a8849e5mr98145ywi.2.1689028455695; Mon, 10 Jul 2023 15:34:15 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:59 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-9-almasrymina@google.com> Subject: [RFC PATCH 08/10] selftests: add ncdevmem, netcat for devmem TCP From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it sends and receives data using the devmem TCP APIs. It uses udmabuf as the dmabuf provider. It is compatible with a regular netcat running on a peer, or a ncdevmem running on a peer. In addition to normal netcat support, ncdevmem has a validation mode, where it sends a specific pattern and validates this pattern on the receiver side to ensure data integrity. Signed-off-by: Mina Almasry --- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/ncdevmem.c | 693 +++++++++++++++++++++++++ 3 files changed, 695 insertions(+) create mode 100644 tools/testing/selftests/net/ncdevmem.c diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index f27a7338b60e..1153ba177f1b 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -16,6 +16,7 @@ ipsec ipv6_flowlabel ipv6_flowlabel_mgr msg_zerocopy +ncdevmem nettest psock_fanout psock_snd diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index c12df57d5539..ea5ccbd99d3b 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -84,6 +84,7 @@ TEST_GEN_FILES += ip_local_port_range TEST_GEN_FILES += bind_wildcard TEST_PROGS += test_vxlan_mdb.sh TEST_PROGS += test_bridge_neigh_suppress.sh +TEST_GEN_FILES += ncdevmem TEST_FILES := settings diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c new file mode 100644 index 000000000000..4f62f22bf763 --- /dev/null +++ b/tools/testing/selftests/net/ncdevmem.c @@ -0,0 +1,693 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include +#include +#include +#include +#include +#include +#include +#define __iovec_defined +#include +#include + +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#define PAGE_SHIFT 12 +#define PAGE_SIZE 4096 +#define TEST_PREFIX "ncdevmem" +#define NUM_PAGES 16000 + +#ifndef MSG_SOCK_DEVMEM +#define MSG_SOCK_DEVMEM 0x2000000 +#endif + +/* + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider. + * + * Usage: + * + * * Without validation: + * + * On server: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -l \ + * -p 5201 + * + * On client: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -p 5201 + * + * * With Validation: + * On server: + * ncdevmem -s -c -l -f eth1 -n 0000:06:00.0 \ + * -p 5202 -v 1 + * + * On client: + * ncdevmem -s -c -f eth1 -n 0000:06:00.0 -p 5202 \ + * -v 100000 + * + * Note this is compatible with regular netcat. i.e. the sender or receiver can + * be replaced with regular netcat to test the RX or TX path in isolation. + */ + +static char *server_ip = "192.168.1.4"; +static char *client_ip = "192.168.1.2"; +static char *port = "5201"; +static size_t do_validation = 0; +static int queue_num = 15; +static char *ifname = "eth1"; +static char *nic_pci_addr = "0000:06:00.0"; +static unsigned int iterations = 0; + +void print_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + int i; + for (i = 0; i < size; i++) { + printf("%02hhX ", p[i]); + } + printf("\n"); +} + +void print_nonzero_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + unsigned int i; + for (i = 0; i < size; i++) { + if (p[i]) + printf("%c", p[i]); + } + printf("\n"); +} + +void initialize_validation(void *line, size_t size) +{ + static unsigned char seed = 1; + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + ptr[i] = seed; + seed++; + if (seed == 254) + seed = 0; + } +} + +void validate_buffer(void *line, size_t size) +{ + static unsigned char seed = 1; + int errors = 0; + + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + if (ptr[i] != seed) { + fprintf(stderr, + "Failed validation: expected=%u, " + "actual=%u, index=%lu\n", + seed, ptr[i], i); + errors++; + if (errors > 20) + exit(1); + } + seed++; + if (seed == 254) + seed = 0; + } + + fprintf(stdout, "Validated buffer\n"); +} + +/* Triggers a driver reset... + * + * The proper way to do this is probably 'ethtool --reset', but I don't have + * that supported on my current test bed. I resort to changing this + * configuration in the driver which also causes a driver reset... + */ +static void reset_flow_steering() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off", + "eth1"); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on", + "eth1"); + system(command); +} + +static void configure_flow_steering() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s " + "src-port %s dst-port %s queue %d", + ifname, client_ip, server_ip, port, port, queue_num); + system(command); +} + +/* Triggers a device reset, which causes the dmabuf pages binding to take + * effect. A better and more generic way to do this may be ethtool --reset. + */ +static void trigger_device_reset() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split off", + ifname); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split on", + ifname); + system(command); +} + +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size) +{ + struct udmabuf_create create; + int ret; + + *devfd = open("/dev/udmabuf", O_RDWR); + if (*devfd < 0) { + fprintf(stderr, + "%s: [skip,no-udmabuf: Unable to access DMA " + "buffer device file]\n", + TEST_PREFIX); + exit(70); + } + + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); + if (*memfd < 0) { + printf("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(72); + } + + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) { + printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(73); + } + + ret = ftruncate(*memfd, dmabuf_size); + if (ret == -1) { + printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(74); + } + + memset(&create, 0, sizeof(create)); + + create.memfd = *memfd; + create.offset = 0; + create.size = dmabuf_size; + *buf = ioctl(*devfd, UDMABUF_CREATE, &create); + if (*buf < 0) { + printf("%s: [FAIL, create udmabuf]\n", TEST_PREFIX); + exit(75); + } +} + + +int do_server() +{ + struct dma_buf_create_pages_info pages_create_info = { 0 }; + int devfd, memfd, buf, buf_pages, ret; + size_t dmabuf_size; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + pages_create_info.dma_buf_fd = buf; + pages_create_info.type = DMA_BUF_PAGES_NET_RX; + pages_create_info.create_flags = (1 << queue_num); + + ret = sscanf(nic_pci_addr, "0000:%hhx:%hhx.%hhx", + &pages_create_info.pci_bdf[0], + &pages_create_info.pci_bdf[1], + &pages_create_info.pci_bdf[2]); + + if (ret != 3) { + printf("%s: [FAIL, parse fail]\n", TEST_PREFIX); + exit(76); + } + + buf_pages = ioctl(buf, DMA_BUF_CREATE_PAGES, &pages_create_info); + if (buf_pages < 0) { + perror("create pages"); + exit(77); + } + + char *buf_mem = NULL; + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) { + perror("mmap()"); + exit(1); + } + + /* Need to trigger the NIC to reallocate its RX pages, otherwise the + * bind doesn't take effect. + */ + trigger_device_reset(); + + sleep(1); + + reset_flow_steering(); + configure_flow_steering(); + + struct sockaddr_in server_sin; + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(76); + } + + int opt = 1; + ret = setsockopt(socket_fd, SOL_SOCKET, + SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt, + sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + printf("binding to address %s:%d\n", server_ip, + ntohs(server_sin.sin_port)); + + ret = bind(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) { + printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno)); + exit(76); + } + + ret = listen(socket_fd, 1); + if (ret) { + printf("%s: [FAIL, listen]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + struct sockaddr_in client_addr; + socklen_t client_addr_len = sizeof(client_addr); + + char buffer[256]; + + inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer, + sizeof(buffer)); + printf("Waiting or connection on %s:%d\n", buffer, + ntohs(server_sin.sin_port)); + int client_fd = accept(socket_fd, &client_addr, &client_addr_len); + + inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer, + sizeof(buffer)); + printf("Got connection from %s:%d\n", buffer, + ntohs(client_addr.sin_port)); + + char iobuf[819200]; + char ctrl_data[sizeof(int) * 20000]; + + size_t total_received = 0; + size_t i = 0; + size_t page_aligned_frags = 0; + size_t non_page_aligned_frags = 0; + while (1) { + bool is_devmem = false; + printf("\n\n"); + + struct msghdr msg = { 0 }; + struct iovec iov = { .iov_base = iobuf, + .iov_len = sizeof(iobuf) }; + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + ssize_t ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM); + printf("recvmsg ret=%ld\n", ret); + if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) { + continue; + } + if (ret < 0) { + perror("recvmsg"); + continue; + } + if (ret == 0) { + printf("client exited\n"); + goto cleanup; + } + + i++; + struct cmsghdr *cm = NULL; + struct cmsg_devmem *cmsg_devmem = NULL; + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_OFFSET && + cm->cmsg_type != SCM_DEVMEM_HEADER)) { + fprintf(stdout, "skipping non-devmem cmsg\n"); + continue; + } + + cmsg_devmem = (struct cmsg_devmem *)CMSG_DATA(cm); + is_devmem = true; + + if (cm->cmsg_type == SCM_DEVMEM_HEADER) { + // TODO: process data copied from skb's linear + // buffer. + fprintf(stdout, + "SCM_DEVMEM_HEADER. " + "cmsg_devmem->frag_size=%u\n", + cmsg_devmem->frag_size); + + continue; + } + + struct devmemtoken token = { cmsg_devmem->frag_token, + 1 }; + + total_received += cmsg_devmem->frag_size; + printf("received frag_page=%u, in_page_offset=%u," + " frag_offset=%u, frag_size=%u, token=%u" + " total_received=%lu\n", + cmsg_devmem->frag_offset >> PAGE_SHIFT, + cmsg_devmem->frag_offset % PAGE_SIZE, + cmsg_devmem->frag_offset, cmsg_devmem->frag_size, + cmsg_devmem->frag_token, total_received); + + if (cmsg_devmem->frag_size % PAGE_SIZE) + non_page_aligned_frags++; + else + page_aligned_frags++; + + struct dma_buf_sync sync = { 0 }; + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + if (do_validation) + validate_buffer( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + else + print_nonzero_bytes( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + ret = setsockopt(client_fd, SOL_SOCKET, + SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + if (ret) { + perror("SO_DEVMEM_DONTNEED"); + exit(1); + } + } + if (!is_devmem) + printf("flow steering error\n"); + + printf("total_received=%lu\n", total_received); + } + +cleanup: + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + munmap(buf_mem, dmabuf_size); + close(client_fd); + close(socket_fd); + close(buf_pages); + close(buf); + close(memfd); + close(devfd); + trigger_device_reset(); + + return 0; +} + +int do_client() +{ + printf("doing client\n"); + + struct dma_buf_create_pages_info pages_create_info = { 0 }; + int devfd, memfd, buf, buf_pages, ret; + size_t dmabuf_size; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + pages_create_info.dma_buf_fd = buf; + pages_create_info.type = DMA_BUF_PAGES_NET_TX; + + ret = sscanf(nic_pci_addr, "0000:%hhx:%hhx.%hhx", + &pages_create_info.pci_bdf[0], + &pages_create_info.pci_bdf[1], + &pages_create_info.pci_bdf[2]); + + if (ret != 3) { + printf("%s: [FAIL, parse fail]\n", TEST_PREFIX); + exit(76); + } + + buf_pages = ioctl(buf, DMA_BUF_CREATE_PAGES, &pages_create_info); + if (buf_pages < 0) { + perror("create pages"); + exit(77); + } + + char *buf_mem = NULL; + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) { + perror("mmap()"); + exit(1); + } + + struct sockaddr_in server_sin; + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(76); + } + + int opt = 1; + ret = setsockopt(socket_fd, SOL_SOCKET, + SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt, + sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + struct sockaddr_in client_sin; + client_sin.sin_family = AF_INET; + client_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(client_sin.sin_family, client_ip, &client_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + ret = bind(socket_fd, &client_sin, sizeof(client_sin)); + if (ret) { + printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno)); + exit(76); + } + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + ret = connect(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) { + printf("%s: [FAIL, connect]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + char *line = NULL; + size_t line_size = 0; + size_t len = 0; + + size_t i = 0; + while (!iterations || i < iterations) { + i++; + free(line); + line = NULL; + if (do_validation) { + line = malloc(do_validation); + if (!line) { + fprintf(stderr, "Failed to allocate\n"); + exit(1); + } + memset(line, 0, do_validation); + initialize_validation(line, do_validation); + line_size = do_validation; + } else { + line_size = getline(&line, &len, stdin); + } + + if (line_size < 0) + continue; + + fprintf(stdout, "DEBUG: read line_size=%lu\n", line_size); + + struct dma_buf_sync sync = { 0 }; + sync.flags = DMA_BUF_SYNC_WRITE | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + memset(buf_mem, 0, dmabuf_size); + memcpy(buf_mem, line, line_size); + + if (do_validation) + validate_buffer(buf_mem, line_size); + + struct iovec iov = { .iov_base = NULL, .iov_len = line_size }; + + struct msghdr msg = { 0 }; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + + char ctrl_data[CMSG_SPACE(sizeof(int) * 2)]; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + + struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_DEVMEM_OFFSET; + cmsg->cmsg_len = CMSG_LEN(sizeof(int) * 2); + *((int *)CMSG_DATA(cmsg)) = buf_pages; + ((int *)CMSG_DATA(cmsg))[1] = 0; + + ret = sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + if (ret < 0) + perror("sendmsg"); + else + fprintf(stdout, "DEBUG: sendmsg_ret=%d\n", ret); + + sync.flags = DMA_BUF_SYNC_WRITE | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + // sleep for a bit before we overwrite the dmabuf that's being + // sent. + if (do_validation) + sleep(1); + } + + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + while (1) + sleep(10); + + munmap(buf_mem, dmabuf_size); + + free(line); + close(socket_fd); + close(buf_pages); + close(buf); + close(memfd); + close(devfd); + trigger_device_reset(); + + return 0; +} + +int main(int argc, char *argv[]) +{ + int is_server = 0, opt; + + // put ':' in the starting of the + // string so that program can + //distinguish between '?' and ':' + while ((opt = getopt(argc, argv, "ls:c:p:v:q:f:n:i:")) != -1) { + switch (opt) { + case 'l': + is_server = 1; + break; + case 's': + server_ip = optarg; + break; + case 'c': + client_ip = optarg; + break; + case 'p': + port = optarg; + break; + case 'v': + do_validation = atoll(optarg); + break; + case 'q': + queue_num = atoi(optarg); + break; + case 'f': + ifname = optarg; + break; + case 'n': + nic_pci_addr = optarg; + break; + case 'i': + iterations = atoll(optarg); + break; + case '?': + printf("unknown option: %c\n", optopt); + break; + } + } + + // optind is for the extra arguments + // which are not parsed + for (; optind < argc; optind++) { + printf("extra arguments: %s\n", argv[optind]); + } + + if (is_server) + return do_server(); + + return do_client(); +} From patchwork Mon Jul 10 22:33:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307766 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7091AC001DE for ; Mon, 10 Jul 2023 22:34:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CE68C10E305; Mon, 10 Jul 2023 22:34:20 +0000 (UTC) Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by gabe.freedesktop.org (Postfix) with ESMTPS id D461110E2F6 for ; Mon, 10 Jul 2023 22:34:19 +0000 (UTC) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c850943cf9aso1651770276.1 for ; Mon, 10 Jul 2023 15:34:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028459; x=1691620459; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UnMZL8UhzWb/0aBP8bCpuACSexb4BfJiKzPPZTajpks=; b=ZycEIM/tzN0od6DMdpG2ZBlkJqdG/YaEAiXfT+38tW9bimX5mFjLfnmha8sMd1FwJJ P1YTcXkuScRdFZ+6Nke3Sdy4NjMA2W+JEp3F6Wzh3QRgc3G5LGn1Zu1iHZ4RyIqBjoHq k9jtQN4DN+tPVpc1G/NOC+sNs6OMyg6jWMhgxeuaJcoDHd6HbM1Xh1fktrKhCJh7t8TZ 2mOqQGiKrSnTwbDIJsz/RpKI8MeOMnJHXoAtfD0XSqgWhXjbLkfzJLjhD09BOkgZES0H docmIVXKHe4UX6CgXt/bh6mysKACg9wyOus054kiN2CxJmmkeSAQzjRyHNih7/Suny73 Z7kA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028459; x=1691620459; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UnMZL8UhzWb/0aBP8bCpuACSexb4BfJiKzPPZTajpks=; b=CVN9oVs2dwZysk/xHqcuTq9SiVMs0TyZ98xwYAhcqOuEjoiBUyIjricxPFXy8qgEzj FqOtjfLshFEXqnNhuDl9PunRdRJm5ulzf0v8DdWJHNmCbiH9vSMMcG/4Trijcx32/9Nv OrxUPTUoXdtJweRJeoYe7vZAHAyuSdRVmk2G0Zx6LuhOvgKKRCqKa2ZqUfgq9HK7x/y8 GPNUPNpMUOmUdMe2bMgRfXA+UQmz/dBV3u8O4kRkrxQLVYLlWnRCg+UITRuhAWJ2E/QK sqGeyVHtoV5zW7iIVcbFmVsFXLcxC7TxGuaN2DPMt6x8hE808jz1HaUIMjZ4oc8f1G8o AedQ== X-Gm-Message-State: ABy/qLasoSQJckT3Rg1GG+SD/r0v5tpE8XRqg0/CUq0URW6O0Y0hCnsv clXb0SWKFfqkgNBAxvjpJ6PZeglpnWQ2Os6wPg== X-Google-Smtp-Source: APBJJlEXtctaXcsu/1jDVZnB9gs4bZSM839Y7nw3IybEX7aGlwuQfZwRkT28XLvKo+wcjjO6pG3pnxDTATxBbfxAfA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:11c6:0:b0:c85:934:7ad2 with SMTP id 189-20020a2511c6000000b00c8509347ad2mr21545ybr.8.1689028458996; Mon, 10 Jul 2023 15:34:18 -0700 (PDT) Date: Mon, 10 Jul 2023 15:33:00 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-10-almasrymina@google.com> Subject: [RFC PATCH 09/10] memory-provider: updates core provider API for devmem TCP From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Implement a few updates to Jakub's RFC memory provider API to make it suitable for device memory TCP: 1. Currently for devmem TCP the driver's netdev_rx_queue holds a reference to the dma_buf_pages struct and needs to pass that to the page_pool's memory provider somehow. For PoC purposes, create a pp->mp_priv field that is set by the driver. Likely needs a better API (likely dependant on the general memory provider API). 2. The current memory_provider API gives the memory_provider the option to override put_page(), but tries page_pool_clear_pp_info() after the memory provider has released the page. IMO if the page freeing is delegated to the provider then the page_pool should not modify the page after release_page() has been called. Signed-off-by: Mina Almasry --- include/net/page_pool.h | 1 + net/core/page_pool.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 364fe6924258..7b6668479baf 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -78,6 +78,7 @@ struct page_pool_params { struct device *dev; /* device, for DMA pre-mapping purposes */ struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ u8 memory_provider; /* haaacks! should be user-facing */ + void *mp_priv; /* argument to pass to the memory provider */ enum dma_data_direction dma_dir; /* DMA mapping direction */ unsigned int max_len; /* max DMA sync memory size */ unsigned int offset; /* DMA addr offset */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d50f6728e4f6..df3f431fcff3 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -241,6 +241,7 @@ static int page_pool_init(struct page_pool *pool, goto free_ptr_ring; } + pool->mp_priv = pool->p.mp_priv; if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -564,16 +565,16 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) else __page_pool_release_page_dma(pool, page); - page_pool_clear_pp_info(page); - /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - if (put) + if (put) { + page_pool_clear_pp_info(page); put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). From patchwork Mon Jul 10 22:33:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13307767 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 72F8CEB64D9 for ; Mon, 10 Jul 2023 22:34:26 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CDA2710E2EE; Mon, 10 Jul 2023 22:34:25 +0000 (UTC) Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by gabe.freedesktop.org (Postfix) with ESMTPS id F0E8810E2EF for ; Mon, 10 Jul 2023 22:34:23 +0000 (UTC) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c595cadae4bso5724509276.1 for ; Mon, 10 Jul 2023 15:34:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028463; x=1691620463; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oQJYRa4cYbFlxJ1mDmha+0HxPANtp6LWuy6ed2Yjbu4=; b=CetzuFarHJgPUiGBwDAWLpdzt0w3V9hQbqMPhuZfIenfXirykHD2HLGYdWQSkINvJh xJG2uW65yyAAb+o06RvuUp3WpLKZAzNYnV7EddNjt5dfXFCFq8AZmqCWdcr58enYZIQE n5JK8gg0yI5kVZJS/Tnc0gRJy7edUPwC0tcHLkojiMtzkNIt/J3L+ITk1kszc47h7Vz7 7HfhvqbXuGyD3D189Y0m4jp9NOnOMRg/24QxNfglicfLYXQlHIKScA6XdF716DYeVEDF C2/NnZaF4OaiJaWHPvY80aoJTwZFVHVTEJFSi6QbKnAMO9h++YLyvxmo53PBx0z1bgkV UbJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028463; x=1691620463; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oQJYRa4cYbFlxJ1mDmha+0HxPANtp6LWuy6ed2Yjbu4=; b=hQ0I6QkurfZPpXskF1fHO0VjcJsfDgwHeIxKEfsZqTq6HLAt/IqH/IJk4iJ3Hb2bge keNm1ZKNYO0edjck+8aDrFJWjrYse+MmQtmzNDX4KbnZTDrGeNRVMPZfBjd7pr4uj8Tl QiGDKjGH16VGAqg/G/5WW/T9w0JyjyHZEymBS0CEwvaYBXhmeHZgv67RZvF28Di4KgzJ Gg6DyjbPh/qos2eqG6VxEKfNPoC36iK+dqeFSDBcL+HDtocNVOF2zIc6yG6rq45cGPty i0cl54QWWzNbEUwS13d95CzAnEYZfDPb0W652qrT92bN/EdyLKREHaalztaAlN3ItEgM +SRQ== X-Gm-Message-State: ABy/qLZTpPbsSxAPGxcBXOjcF2BMhpLdwdrP7sfWounJ6FmfsyRIvjM6 F2m7cm/YBDe1QsmxMOS3+6sCBBcle4tu+r527w== X-Google-Smtp-Source: APBJJlHJiw9FD0eEFF7lSrjMDR+hUHliNX9UFzA9i3ufkboc97+XMgqpqjgotP7HgV453EARzYkdxteZjWVekbKkug== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:4252:0:b0:c6f:6ffe:f904 with SMTP id p79-20020a254252000000b00c6f6ffef904mr50579yba.9.1689028463039; Mon, 10 Jul 2023 15:34:23 -0700 (PDT) Date: Mon, 10 Jul 2023 15:33:01 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-11-almasrymina@google.com> Subject: [RFC PATCH 10/10] memory-provider: add dmabuf devmem provider From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mina Almasry , Willem de Bruijn , Jesper Dangaard Brouer , Arnd Bergmann , =?utf-8?q?Christian_K=C3=B6nig?= , David Ahern , Ilias Apalodimas , Sumit Semwal , jgg@ziepe.ca, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shuah Khan , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Use Jakub's memory provider PoC API: https://github.com/kuba-moo/linux/tree/pp-providers To implement a dmabuf devmem memory provider. The provider allocates NET_RX dmabuf pages to the page pool. This abstracts any custom memory allocation or freeing changes for devmem TCP from drivers using the page pool. The memory provider allocates NET_RX pages from the dmabuf pages provided by the driver. These pages are ZONE_DEVICE pages with the sg dma_addrs stored in the zone_device_data entry in the page. The page pool entries in struct page are in a union with the ZONE_DEVICE entries, and - without special handling - the page pool would accidentally overwrite the data in the ZONE_DEVICE fields. To solve this, the memory provider converts the page from a ZONE_DEVICE page to a ZONE_NORMAL page upon giving it to the page pool, and converts it back to ZONE_DEVICE page upon getting it back from the page pool. This is safe to do because the NET_RX pages are dmabuf pages created to hold the dma_addr in the dma_buf_map_attachement sg_table entries, and are only used with code that handles them specifically. However, since dmabuf pages can now also be page pool page, we need to update 2 places to detect this correctly: 1. is_dma_buf_page() needs to be updated to correctly detect dmabuf pages after they've been inserted into the pool. 2. dma_buf_page_to_dma_addr() needs to be updated. For page pool pages, the dma_addr exists in page->dma_addr. For non page pool pages, the dma_addr exists in page->zone_device_data. Signed-off-by: Mina Almasry --- include/linux/dma-buf.h | 29 ++++++++++- include/net/page_pool.h | 20 ++++++++ net/core/page_pool.c | 104 ++++++++++++++++++++++++++++++++++++---- 3 files changed, 143 insertions(+), 10 deletions(-) diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 93228a2fec47..896359fa998d 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -692,15 +692,26 @@ static inline bool is_dma_buf_pages_file(struct file *file) struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv); +static inline bool is_dma_buf_page_net_rx(struct page *page) +{ + struct dma_buf_pages *priv; + + return (is_page_pool_page(page) && (priv = page->pp->mp_priv) && + priv->pgmap.ops == &dma_buf_pgmap_ops); +} + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && - page->pgmap->ops == &dma_buf_pgmap_ops); + page->pgmap->ops == &dma_buf_pgmap_ops) || + is_dma_buf_page_net_rx(page); } static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) { - return (dma_addr_t)page->zone_device_data; + return is_dma_buf_page_net_rx(page) ? + (dma_addr_t)page->dma_addr : + (dma_addr_t)page->zone_device_data; } static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, @@ -718,6 +729,16 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, return nents; } + +static inline bool is_dma_buf_pages_priv(void *ptr) +{ + struct dma_buf_pages *priv = (struct dma_buf_pages *)ptr; + + if (!priv || priv->pgmap.ops != &dma_buf_pgmap_ops) + return false; + + return true; +} #else static inline bool is_dma_buf_page(struct page *page) { @@ -745,6 +766,10 @@ static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv return NULL; } +static inline bool is_dma_buf_pages_priv(void *ptr) +{ + return false; +} #endif diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 7b6668479baf..a57757a13cc8 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -157,6 +157,7 @@ enum pp_memory_provider_type { PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */ PP_MP_HUGE, /* 2MB, all memory pre-allocated */ PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */ + PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ }; struct pp_memory_provider_ops { @@ -170,6 +171,7 @@ extern const struct pp_memory_provider_ops basic_ops; extern const struct pp_memory_provider_ops hugesp_ops; extern const struct pp_memory_provider_ops huge_ops; extern const struct pp_memory_provider_ops huge_1g_ops; +extern const struct pp_memory_provider_ops dmabuf_devmem_ops; struct page_pool { struct page_pool_params p; @@ -420,4 +422,22 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); } +static inline bool is_page_pool_page(struct page *page) +{ + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; + + if (!page->pp) + return false; + + return true; +} + #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index df3f431fcff3..e626d4e309c1 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -236,6 +236,9 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_HUGE_1G: pool->mp_ops = &huge_1g_ops; break; + case PP_MP_DMABUF_DEVMEM: + pool->mp_ops = &dmabuf_devmem_ops; + break; default: err = -EINVAL; goto free_ptr_ring; @@ -975,14 +978,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + if (!is_page_pool_page(page)) return false; pp = page->pp; @@ -1538,3 +1534,95 @@ const struct pp_memory_provider_ops huge_1g_ops = { .alloc_pages = mp_huge_1g_alloc_pages, .release_page = mp_huge_1g_release, }; + +/*** "Dmabuf devmem page" ***/ + +/* Dmabuf devmem memory provider allocates DMA_BUF_PAGES_NET_RX pages which are + * backing the dma_buf_map_attachment() from the NIC to the device memory. + * + * These pages are wrappers around the dma_addr of the sg entries in the + * sg_table returned from dma_buf_map_attachment(). They can be passed to the + * networking stack, which will generate devmem skbs from them and process them + * correctly. + */ +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct dma_buf_pages *priv; + + priv = pool->mp_priv; + if (!is_dma_buf_pages_priv(priv)) + return -EINVAL; + + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ +} + +static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct dma_buf_pages *priv = pool->mp_priv; + dma_addr_t dma_addr; + struct page *page; + + page = dma_buf_pages_net_rx_alloc(priv); + if (!page) + return page; + + /* It shouldn't be possible for the allocation to give us a page not + * belonging to this page_pool's pgmap. + */ + BUG_ON(page->pgmap != &priv->pgmap); + + /* netdev_rxq_alloc_dma_buf_page() allocates a ZONE_DEVICE page. + * Prepare to convert it into a page_pool page. We need to hold pgmap + * and zone_device_data (which holds the dma_addr). + * + * DMA_BUF_PAGES_NET_RX are dmabuf pages created specifically to wrap + * the dma_addr of the sg_table into a struct page. These pages are + * used by code specifically equipped to handle them, so this + * conversation from ZONE_DEVICE page to page pool page should be safe. + */ + dma_addr = (dma_addr_t)page->zone_device_data; + + set_page_zone(page, ZONE_NORMAL); + page->pp_magic = 0; + page_pool_set_pp_info(pool, page); + + page->dma_addr = dma_addr; + + return page; +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + struct page *page) +{ + struct dma_buf_pages *priv = pool->mp_priv; + unsigned long dma_addr = page->dma_addr; + + page_pool_clear_pp_info(page); + + /* As the page pool releases the page, restore it back to a ZONE_DEVICE + * page so it gets freed according to the + * page->pgmap->ops->page_free(). + */ + set_page_zone(page, ZONE_DEVICE); + page->zone_device_data = (void*)dma_addr; + page->pgmap = &priv->pgmap; + put_page(page); + + /* Return false here as we don't want the page pool touching the page + * after it's released to us. + */ + return false; +} + +const struct pp_memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops);