From patchwork Mon Jan 6 13:01:09 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927413 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D2C327726; Mon, 6 Jan 2025 13:08:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168913; cv=none; b=NWcdPQZercydvNKi9eJppTKNo9915BjggU0B8izDUka4phlLgWK969seslpNMpGjJkZ7fJe8djAaqx1Dy26XnTp2zlhX7+6LHhP4sabSgEDbj0yMd4HliXUztEGRdTM/mYTM6Nxjl0dj13pWPlrLOHh8NCZUH9mDzb12v54x16g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168913; c=relaxed/simple; bh=+D4VNYTgIwwWMxTiK1lJjo3rk6IBtMA9Ka3FjNPbRRw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DE3xyBm3EaeGlP+EGGCgS5bWNjCG1MpVeXOjfWOO3fOcT7nukym2rmViMjGJ9JdMxNg99Ru38vwBJoHdW5G5BTm8olHR/SjUE4LBy1+Qc3XrGBogFsaOAmXYzwl1/8u4Uh+CCfwOuMtTg6GYdH7qV3/qpwEIsPr7UibBWZtc03c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4YRZF22Jdcz22kfs; Mon, 6 Jan 2025 21:06:14 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 53F69140135; Mon, 6 Jan 2025 21:08:27 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:26 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Wei Fang , Shenwei Wang , Clark Wang , Andrew Lunn , Eric Dumazet , Jeroen de Borst , Praveen Kaligineedi , Shailend Chand , Tony Nguyen , Przemek Kitszel , Alexander Lobakin , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Felix Fietkau , Lorenzo Bianconi , Ryder Lee , Shayne Chen , Sean Wang , Kalle Valo , Matthias Brugger , AngeloGioacchino Del Regno , Simon Horman , Ilias Apalodimas , , , , , , , , , Subject: [PATCH net-next v6 1/8] page_pool: introduce page_pool_get_pp() API Date: Mon, 6 Jan 2025 21:01:09 +0800 Message-ID: <20250106130116.457938-2-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org introduce page_pool_get_pp() API to avoid caller accessing page->pp directly. Signed-off-by: Yunsheng Lin --- drivers/net/ethernet/freescale/fec_main.c | 8 +++++--- .../net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +- drivers/net/ethernet/intel/iavf/iavf_txrx.c | 6 ++++-- drivers/net/ethernet/intel/idpf/idpf_txrx.c | 14 +++++++++----- drivers/net/ethernet/intel/libeth/rx.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 ++- drivers/net/netdevsim/netdev.c | 6 ++++-- drivers/net/wireless/mediatek/mt76/mt76.h | 2 +- include/net/libeth/rx.h | 3 ++- include/net/page_pool/helpers.h | 5 +++++ 10 files changed, 34 insertions(+), 17 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index b2daed55bf6c..18d2119dbec1 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -1009,7 +1009,8 @@ static void fec_enet_bd_init(struct net_device *dev) struct page *page = txq->tx_buf[i].buf_p; if (page) - page_pool_put_page(page->pp, page, 0, false); + page_pool_put_page(page_pool_get_pp(page), + page, 0, false); } txq->tx_buf[i].buf_p = NULL; @@ -1549,7 +1550,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget) xdp_return_frame_rx_napi(xdpf); } else { /* recycle pages of XDP_TX frames */ /* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */ - page_pool_put_page(page->pp, page, 0, true); + page_pool_put_page(page_pool_get_pp(page), page, 0, true); } txq->tx_buf[index].buf_p = NULL; @@ -3307,7 +3308,8 @@ static void fec_enet_free_buffers(struct net_device *ndev) } else { struct page *page = txq->tx_buf[i].buf_p; - page_pool_put_page(page->pp, page, 0, false); + page_pool_put_page(page_pool_get_pp(page), + page, 0, false); } txq->tx_buf[i].buf_p = NULL; diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c index 403f0f335ba6..87422b8828ff 100644 --- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c @@ -210,7 +210,7 @@ void gve_free_to_page_pool(struct gve_rx_ring *rx, if (!page) return; - page_pool_put_full_page(page->pp, page, allow_direct); + page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct); buf_state->page_info.page = NULL; } diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c index 26b424fd6718..e1bf5554f6e3 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c @@ -1050,7 +1050,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb, const struct libeth_fqe *rx_buffer, unsigned int size) { - u32 hr = rx_buffer->page->pp->p.offset; + struct page_pool *pool = page_pool_get_pp(rx_buffer->page); + u32 hr = pool->p.offset; skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page, rx_buffer->offset + hr, size, rx_buffer->truesize); @@ -1067,7 +1068,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb, static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer, unsigned int size) { - u32 hr = rx_buffer->page->pp->p.offset; + struct page_pool *pool = page_pool_get_pp(rx_buffer->page); + u32 hr = pool->p.offset; struct sk_buff *skb; void *va; diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c index 2fa9c36e33c9..04f2347716ca 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c @@ -385,7 +385,8 @@ static void idpf_rx_page_rel(struct libeth_fqe *rx_buf) if (unlikely(!rx_buf->page)) return; - page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false); + page_pool_put_full_page(page_pool_get_pp(rx_buf->page), rx_buf->page, + false); rx_buf->page = NULL; rx_buf->offset = 0; @@ -3098,7 +3099,8 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb, void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb, unsigned int size) { - u32 hr = rx_buf->page->pp->p.offset; + struct page_pool *pool = page_pool_get_pp(rx_buf->page); + u32 hr = pool->p.offset; skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page, rx_buf->offset + hr, size, rx_buf->truesize); @@ -3130,8 +3132,10 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr, if (!libeth_rx_sync_for_cpu(buf, copy)) return 0; - dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset; - src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset; + dst = page_address(hdr->page) + hdr->offset + + page_pool_get_pp(hdr->page)->p.offset; + src = page_address(buf->page) + buf->offset + + page_pool_get_pp(buf->page)->p.offset; memcpy(dst, src, LARGEST_ALIGN(copy)); buf->offset += copy; @@ -3149,7 +3153,7 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr, */ struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size) { - u32 hr = buf->page->pp->p.offset; + u32 hr = page_pool_get_pp(buf->page)->p.offset; struct sk_buff *skb; void *va; diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c index 66d1d23b8ad2..8de0c3a3b146 100644 --- a/drivers/net/ethernet/intel/libeth/rx.c +++ b/drivers/net/ethernet/intel/libeth/rx.c @@ -207,7 +207,7 @@ EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, "LIBETH"); */ void libeth_rx_recycle_slow(struct page *page) { - page_pool_recycle_direct(page->pp, page); + page_pool_recycle_direct(page_pool_get_pp(page), page); } EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, "LIBETH"); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c index 94b291662087..30baca49c71e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c @@ -716,7 +716,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq, /* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) * as we know this is a page_pool page. */ - page_pool_recycle_direct(page->pp, page); + page_pool_recycle_direct(page_pool_get_pp(page), + page); } while (++n < num); break; diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index e068a9761c09..e583415f9088 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -632,7 +632,8 @@ nsim_pp_hold_write(struct file *file, const char __user *data, if (!ns->page) ret = -ENOMEM; } else { - page_pool_put_full_page(ns->page->pp, ns->page, false); + page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page, + false); ns->page = NULL; } @@ -831,7 +832,8 @@ void nsim_destroy(struct netdevsim *ns) /* Put this intentionally late to exercise the orphaning path */ if (ns->page) { - page_pool_put_full_page(ns->page->pp, ns->page, false); + page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page, + false); ns->page = NULL; } diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h index ca2dba3ac65d..4d0e41a7bf4a 100644 --- a/drivers/net/wireless/mediatek/mt76/mt76.h +++ b/drivers/net/wireless/mediatek/mt76/mt76.h @@ -1688,7 +1688,7 @@ static inline void mt76_put_page_pool_buf(void *buf, bool allow_direct) { struct page *page = virt_to_head_page(buf); - page_pool_put_full_page(page->pp, page, allow_direct); + page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct); } static inline void * diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h index 43574bd6612f..f4ae75f9cc1b 100644 --- a/include/net/libeth/rx.h +++ b/include/net/libeth/rx.h @@ -137,7 +137,8 @@ static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe, return false; } - page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len); + page_pool_dma_sync_for_cpu(page_pool_get_pp(page), page, fqe->offset, + len); return true; } diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 543f54fa3020..9c4dbd2289b1 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -83,6 +83,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats) } #endif +static inline struct page_pool *page_pool_get_pp(struct page *page) +{ + return page->pp; +} + /** * page_pool_dev_alloc_pages() - allocate a page. * @pool: pool from which to allocate From patchwork Mon Jan 6 13:01:10 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927418 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29E2E1DE4EC; Mon, 6 Jan 2025 13:08:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.191 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; cv=none; b=s45Q/5yOMb/WYQX+MRIvh2uLs3G65/zbCPLIy80PqXLxqygNBp82AwVoVidBmU7gSWWzHN4UqfPVyaPJECvtxlP33p5eRrh+e9/zIzZ4b7qeHQbtEY7qwPBzCXP1TeGqBn08KbMqN4yoIt8jQqq/Ws2wurEkmHJ0UKPafHj7HAI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; c=relaxed/simple; bh=SE6agWJD0vY2chxfdQBW+/+kxBRBLrA9FT5tlI79wUw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mq1/KXGbXeMoufMgxMsqpuVCqXqxvqyKCHcZhNDrZ0EME/vEAZ38lxTGsDDWu+9OdluLbt2X2k0WOkuJO8Bjk+HHP5GypPZv34JrDnqXUDn6pW7EUoy+xbbzjyvE5Y1xMiKnCJSMdVB/Z/ygFFpib+qyXAMiK+S3IwemuOY4Muw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.191 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4YRZDB29q3z1kxrQ; Mon, 6 Jan 2025 21:05:30 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 1EFD718001B; Mon, 6 Jan 2025 21:08:29 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:28 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Alexander Lobakin , Xuan Zhuo , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 2/8] page_pool: fix timing for checking and disabling napi_local Date: Mon, 6 Jan 2025 21:01:10 +0800 Message-ID: <20250106130116.457938-3-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org page_pool page may be freed from skb_defer_free_flush() in softirq context without binding to any specific napi, it may cause use-after-free problem due to the below time window, as below, CPU1 may still access napi->list_owner after CPU0 free the napi memory: CPU 0 CPU1 page_pool_destroy() skb_defer_free_flush() . . . napi = READ_ONCE(pool->p.napi); . . page_pool_disable_direct_recycling() . driver free napi memory . . . . napi && READ_ONCE(napi->list_owner) == cpuid . . Use rcu mechanism to avoid the above problem. Note, the above was found during code reviewing on how to fix the problem in [1]. As the following IOMMU fix patch depends on synchronize_rcu() added in this patch and the time window is so small that it doesn't seem to be an urgent fix, so target the net-next as the IOMMU fix patch does. 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy") Signed-off-by: Yunsheng Lin CC: Alexander Lobakin Reviewed-by: Xuan Zhuo --- net/core/page_pool.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 9733206d6406..1aa7b93bdcc8 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, static bool page_pool_napi_local(const struct page_pool *pool) { const struct napi_struct *napi; + bool napi_local; u32 cpuid; if (unlikely(!in_softirq())) @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool) if (READ_ONCE(pool->cpuid) == cpuid) return true; + /* Synchronizated with page_pool_destory() to avoid use-after-free + * for 'napi'. + */ + rcu_read_lock(); napi = READ_ONCE(pool->p.napi); + napi_local = napi && READ_ONCE(napi->list_owner) == cpuid; + rcu_read_unlock(); - return napi && READ_ONCE(napi->list_owner) == cpuid; + return napi_local; } void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool) if (!page_pool_release(pool)) return; + /* Paired with rcu lock in page_pool_napi_local() to enable clearing + * of pool->p.napi in page_pool_disable_direct_recycling() is seen + * before returning to driver to free the napi instance. + */ + synchronize_rcu(); + page_pool_detached(pool); pool->defer_start = jiffies; pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; From patchwork Mon Jan 6 13:01:11 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927419 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DAC181DE4D3; Mon, 6 Jan 2025 13:08:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; cv=none; b=kkIEToU8zj7o5BMjjMc6h0vAz+a5sJ6qpVF8pStl/4qZU84fWgGRvhRSOKm7p+BqJoXHoBV6H5Dk3KIgqmOEwOv2uvX4aapzk1BVkGOlcO+qqUowLFg6WgAxsYGgPLg3BQB34VRgNWDQqpLoAgMOGoDE3gY/AUHzQOk+nA0h5MQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; c=relaxed/simple; bh=K2KmTHJgNsem0mMkk7xJT+VccUc/EeNLQfnCmS+c64A=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DepRQwkQXpOzNPRxIvamTk0rqvzkCy/H47MYLSK87LKoRGs5ON+uIdJKUKzFv+RsgPjcPbKqbhQt+pMbgNZ81efp7d4AXPT1H5parZR1hvfIc+96kecQnhEFr4ycaRKEJ0yhRkYKQU450itSayfeJBg6v6hnmgDIZZXp8FDVWrU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4YRZF80FqqzRhlv; Mon, 6 Jan 2025 21:06:20 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 4307D18010A; Mon, 6 Jan 2025 21:08:33 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:32 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Simon Horman , Jesper Dangaard Brouer , Ilias Apalodimas , , , Subject: [PATCH net-next v6 3/8] page_pool: fix IOMMU crash when driver has already unbound Date: Mon, 6 Jan 2025 21:01:11 +0800 Message-ID: <20250106130116.457938-4-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org Networking driver with page_pool support may hand over page still with dma mapping to network stack and try to reuse that page after network stack is done with it and passes it back to page_pool to avoid the penalty of dma mapping/unmapping. With all the caching in the network stack, some pages may be held in the network stack without returning to the page_pool soon enough, and with VF disable causing the driver unbound, the page_pool does not stop the driver from doing it's unbounding work, instead page_pool uses workqueue to check if there is some pages coming back from the network stack periodically, if there is any, it will do the dma unmmapping related cleanup work. As mentioned in [1], attempting DMA unmaps after the driver has already unbound may leak resources or at worst corrupt memory. Fundamentally, the page pool code cannot allow DMA mappings to outlive the driver they belong to. Currently it seems there are at least two cases that the page is not released fast enough causing dma unmmapping done after driver has already unbound: 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30 secs. 2. skb_defer_free_flush(): this may cause infinite delay if there is no triggering for net_rx_action(). In order not to call DMA APIs to do DMA unmmapping after driver has already unbound and stall the unloading of the networking driver, use some pre-allocated item blocks to record inflight pages including the ones which are handed over to network stack, so the page_pool can do the DMA unmmapping for those pages when page_pool_destroy() is called. As the pre-allocated item blocks need to be large enough to avoid performance degradation, add a 'item_fast_empty' stat to indicate the unavailability of the pre-allocated item blocks. The overhead of tracking of inflight pages is about 10ns~20ns, which causes about 10% performance degradation for the test case of time_bench_page_pool03_slow() in [2]. Note, the devmem patchset seems to make the bug harder to fix, and may make backporting harder too. As there is no actual user for the devmem and the fixing for devmem is unclear for now, this patch does not consider fixing the case for devmem yet. 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ 2. https://github.com/netoptimizer/prototype-kernel CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Fixes: f71fec47c2df ("page_pool: make sure struct device is stable") Signed-off-by: Yunsheng Lin Tested-by: Yonglong Liu --- include/linux/mm_types.h | 2 +- include/linux/skbuff.h | 1 + include/net/netmem.h | 24 ++- include/net/page_pool/helpers.h | 8 +- include/net/page_pool/types.h | 36 +++- net/core/devmem.c | 4 +- net/core/netmem_priv.h | 5 +- net/core/page_pool.c | 304 +++++++++++++++++++++++++++----- net/core/page_pool_priv.h | 10 +- 9 files changed, 338 insertions(+), 56 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7361a8f3ab68..abbb19f3244a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -120,7 +120,7 @@ struct page { * page_pool allocated pages. */ unsigned long pp_magic; - struct page_pool *pp; + struct page_pool_item *pp_item; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bb2b751d274a..a47a23527724 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -39,6 +39,7 @@ #include #include #include +#include /** * DOC: skb checksums diff --git a/include/net/netmem.h b/include/net/netmem.h index 1b58faa4f20f..c848a48b8e96 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -23,7 +23,7 @@ DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); struct net_iov { unsigned long __unused_padding; unsigned long pp_magic; - struct page_pool *pp; + struct page_pool_item *pp_item; struct dmabuf_genpool_chunk_owner *owner; unsigned long dma_addr; atomic_long_t pp_ref_count; @@ -33,7 +33,7 @@ struct net_iov { * * struct { * unsigned long pp_magic; - * struct page_pool *pp; + * struct page_pool_item *pp_item; * unsigned long _pp_mapping_pad; * unsigned long dma_addr; * atomic_long_t pp_ref_count; @@ -49,7 +49,7 @@ struct net_iov { static_assert(offsetof(struct page, pg) == \ offsetof(struct net_iov, iov)) NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); -NET_IOV_ASSERT_OFFSET(pp, pp); +NET_IOV_ASSERT_OFFSET(pp_item, pp_item); NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); #undef NET_IOV_ASSERT_OFFSET @@ -67,6 +67,11 @@ NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); */ typedef unsigned long __bitwise netmem_ref; +/* Mirror page_pool_item_block, see include/net/page_pool/types.h */ +struct netmem_item_block { + struct page_pool *pp; +}; + static inline bool netmem_is_net_iov(const netmem_ref netmem) { return (__force unsigned long)netmem & NET_IOV; @@ -154,6 +159,11 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem) return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); } +static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->pp_item; +} + /** * __netmem_get_pp - unsafely get pointer to the &page_pool backing @netmem * @netmem: netmem reference to get the pointer from @@ -167,12 +177,16 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem) */ static inline struct page_pool *__netmem_get_pp(netmem_ref netmem) { - return __netmem_to_page(netmem)->pp; + struct page_pool_item *item = __netmem_to_page(netmem)->pp_item; + struct netmem_item_block *block; + + block = (struct netmem_item_block *)((unsigned long)item & PAGE_MASK); + return block->pp; } static inline struct page_pool *netmem_get_pp(netmem_ref netmem) { - return __netmem_clear_lsb(netmem)->pp; + return __netmem_get_pp((__force netmem_ref)__netmem_clear_lsb(netmem)); } static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem) diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 9c4dbd2289b1..d4f2ec0898a5 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -83,9 +83,15 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats) } #endif +static inline struct page_pool_item_block * +page_pool_item_to_block(struct page_pool_item *item) +{ + return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK); +} + static inline struct page_pool *page_pool_get_pp(struct page *page) { - return page->pp; + return page_pool_item_to_block(page->pp_item)->pp; } /** diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index ed4cd114180a..240f940b5ddc 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -102,6 +102,7 @@ struct page_pool_params { * @refill: an allocation which triggered a refill of the cache * @waive: pages obtained from the ptr ring that cannot be added to * the cache due to a NUMA mismatch + * @item_fast_empty: pre-allocated item cache is empty */ struct page_pool_alloc_stats { u64 fast; @@ -110,6 +111,7 @@ struct page_pool_alloc_stats { u64 empty; u64 refill; u64 waive; + u64 item_fast_empty; }; /** @@ -142,6 +144,30 @@ struct page_pool_stats { }; #endif +struct page_pool_item { + unsigned long state; + + union { + netmem_ref pp_netmem; + struct llist_node lentry; + }; +}; + +/* The size of item_block is always PAGE_SIZE, so that the address of item_block + * for a specific item can be calculated using 'item & PAGE_MASK' + */ +struct page_pool_item_block { + struct page_pool *pp; + struct list_head list; + struct page_pool_item items[]; +}; + +/* Ensure the offset of 'pp' field for both 'page_pool_item_block' and + * 'netmem_item_block' are the same. + */ +static_assert(offsetof(struct page_pool_item_block, pp) == \ + offsetof(struct netmem_item_block, pp)); + /* The whole frag API block must stay within one cacheline. On 32-bit systems, * sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``. * On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``. @@ -161,6 +187,7 @@ struct page_pool { int cpuid; u32 pages_state_hold_cnt; + struct llist_head hold_items; bool has_init_callback:1; /* slow::init_callback is set */ bool dma_map:1; /* Perform DMA mapping */ @@ -223,13 +250,20 @@ struct page_pool { #endif atomic_t pages_state_release_cnt; + /* Lock to protect item_blocks list when allocating and freeing + * item block memory dynamically when destroy_cnt is zero. + */ + spinlock_t item_lock; + struct list_head item_blocks; + struct llist_head release_items; + /* A page_pool is strictly tied to a single RX-queue being * protected by NAPI, due to above pp_alloc_cache. This * refcnt serves purpose is to simplify drivers error handling. */ refcount_t user_cnt; - u64 destroy_cnt; + unsigned long destroy_cnt; /* Slow/Control-path information follows */ struct page_pool_params_slow slow; diff --git a/net/core/devmem.c b/net/core/devmem.c index 0b6ed7525b22..cc7093f00af1 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -85,7 +85,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) niov = &owner->niovs[index]; niov->pp_magic = 0; - niov->pp = NULL; + niov->pp_item = NULL; atomic_long_set(&niov->pp_ref_count, 0); return niov; @@ -380,7 +380,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem) if (WARN_ON_ONCE(refcount != 1)) return false; - page_pool_clear_pp_info(netmem); + page_pool_clear_pp_info(pool, netmem); net_devmem_free_dmabuf(netmem_to_net_iov(netmem)); diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h index 7eadb8393e00..3173f6070cf7 100644 --- a/net/core/netmem_priv.h +++ b/net/core/netmem_priv.h @@ -18,9 +18,10 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem) __netmem_clear_lsb(netmem)->pp_magic = 0; } -static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool) +static inline void netmem_set_pp_item(netmem_ref netmem, + struct page_pool_item *item) { - __netmem_clear_lsb(netmem)->pp = pool; + __netmem_clear_lsb(netmem)->pp_item = item; } static inline void netmem_set_dma_addr(netmem_ref netmem, diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1aa7b93bdcc8..e6ffbb000e53 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -61,6 +61,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = { "rx_pp_alloc_empty", "rx_pp_alloc_refill", "rx_pp_alloc_waive", + "rx_pp_alloc_item_fast_empty", "rx_pp_recycle_cached", "rx_pp_recycle_cache_full", "rx_pp_recycle_ring", @@ -94,6 +95,7 @@ bool page_pool_get_stats(const struct page_pool *pool, stats->alloc_stats.empty += pool->alloc_stats.empty; stats->alloc_stats.refill += pool->alloc_stats.refill; stats->alloc_stats.waive += pool->alloc_stats.waive; + stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty; for_each_possible_cpu(cpu) { const struct page_pool_recycle_stats *pcpu = @@ -139,6 +141,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats) *data++ = pool_stats->alloc_stats.empty; *data++ = pool_stats->alloc_stats.refill; *data++ = pool_stats->alloc_stats.waive; + *data++ = pool_stats->alloc_stats.item_fast_empty; *data++ = pool_stats->recycle_stats.cached; *data++ = pool_stats->recycle_stats.cache_full; *data++ = pool_stats->recycle_stats.ring; @@ -268,6 +271,7 @@ static int page_pool_init(struct page_pool *pool, return -ENOMEM; } + spin_lock_init(&pool->item_lock); atomic_set(&pool->pages_state_release_cnt, 0); /* Driver calling page_pool_create() also call page_pool_destroy() */ @@ -325,6 +329,195 @@ static void page_pool_uninit(struct page_pool *pool) #endif } +#define PAGE_POOL_ITEM_USED 0 +#define PAGE_POOL_ITEM_MAPPED 1 + +#define ITEMS_PER_PAGE ((PAGE_SIZE - \ + offsetof(struct page_pool_item_block, items)) / \ + sizeof(struct page_pool_item)) + +#define page_pool_item_init_state(item) \ +({ \ + (item)->state = 0; \ +}) + +#if defined(CONFIG_DEBUG_NET) +#define page_pool_item_set_used(item) \ + __set_bit(PAGE_POOL_ITEM_USED, &(item)->state) + +#define page_pool_item_clear_used(item) \ + __clear_bit(PAGE_POOL_ITEM_USED, &(item)->state) + +#define page_pool_item_is_used(item) \ + test_bit(PAGE_POOL_ITEM_USED, &(item)->state) +#else +#define page_pool_item_set_used(item) +#define page_pool_item_clear_used(item) +#define page_pool_item_is_used(item) false +#endif + +#define page_pool_item_set_mapped(item) \ + __set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + +/* Only clear_mapped and is_mapped need to be atomic as they can be + * called concurrently. + */ +#define page_pool_item_clear_mapped(item) \ + clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + +#define page_pool_item_is_mapped(item) \ + test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, + netmem_ref netmem, + bool destroyed) +{ + struct page_pool_item *item; + dma_addr_t dma; + + if (!pool->dma_map) + /* Always account for inflight pages, even if we didn't + * map them + */ + return; + + dma = page_pool_get_dma_addr_netmem(netmem); + item = netmem_get_pp_item(netmem); + + /* dma unmapping is always needed when page_pool_destory() is not called + * yet. + */ + DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item)); + if (unlikely(destroyed && !page_pool_item_is_mapped(item))) + return; + + /* When page is unmapped, it cannot be returned to our pool */ + dma_unmap_page_attrs(pool->p.dev, dma, + PAGE_SIZE << pool->p.order, pool->p.dma_dir, + DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); + page_pool_set_dma_addr_netmem(netmem, 0); + page_pool_item_clear_mapped(item); +} + +static void __page_pool_item_init(struct page_pool *pool, struct page *page) +{ + struct page_pool_item_block *block = page_address(page); + struct page_pool_item *items = block->items; + unsigned int i; + + list_add(&block->list, &pool->item_blocks); + block->pp = pool; + + for (i = 0; i < ITEMS_PER_PAGE; i++) { + page_pool_item_init_state(&items[i]); + __llist_add(&items[i].lentry, &pool->hold_items); + } +} + +static int page_pool_item_init(struct page_pool *pool) +{ +#define PAGE_POOL_MIN_INFLIGHT_ITEMS 512 + struct page_pool_item_block *block; + int item_cnt; + + INIT_LIST_HEAD(&pool->item_blocks); + init_llist_head(&pool->hold_items); + init_llist_head(&pool->release_items); + + item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE + + PAGE_POOL_MIN_INFLIGHT_ITEMS; + while (item_cnt > 0) { + struct page *page; + + page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0); + if (!page) + goto err; + + __page_pool_item_init(pool, page); + item_cnt -= ITEMS_PER_PAGE; + } + + return 0; +err: + list_for_each_entry(block, &pool->item_blocks, list) + put_page(virt_to_page(block)); + + return -ENOMEM; +} + +static void page_pool_item_unmap(struct page_pool *pool, + struct page_pool_item *item) +{ + spin_lock_bh(&pool->item_lock); + __page_pool_release_page_dma(pool, item->pp_netmem, true); + spin_unlock_bh(&pool->item_lock); +} + +static void page_pool_items_unmap(struct page_pool *pool) +{ + struct page_pool_item_block *block; + + if (!pool->dma_map || pool->mp_priv) + return; + + list_for_each_entry(block, &pool->item_blocks, list) { + struct page_pool_item *items = block->items; + int i; + + for (i = 0; i < ITEMS_PER_PAGE; i++) { + struct page_pool_item *item = &items[i]; + + if (!page_pool_item_is_mapped(item)) + continue; + + page_pool_item_unmap(pool, item); + } + } +} + +static void page_pool_item_uninit(struct page_pool *pool) +{ + struct page_pool_item_block *block; + + list_for_each_entry(block, &pool->item_blocks, list) + put_page(virt_to_page(block)); +} + +static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem) +{ + struct page_pool_item *item; + struct llist_node *node; + + if (unlikely(llist_empty(&pool->hold_items))) { + pool->hold_items.first = llist_del_all(&pool->release_items); + + if (unlikely(llist_empty(&pool->hold_items))) { + alloc_stat_inc(pool, item_fast_empty); + return false; + } + } + + node = pool->hold_items.first; + pool->hold_items.first = node->next; + item = llist_entry(node, struct page_pool_item, lentry); + item->pp_netmem = netmem; + page_pool_item_set_used(item); + netmem_set_pp_item(netmem, item); + return true; +} + +static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem) +{ + struct page_pool_item *item = netmem_get_pp_item(netmem); + + DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem); + DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item)); + DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item)); + page_pool_item_clear_used(item); + netmem_set_pp_item(netmem, NULL); + llist_add(&item->lentry, &pool->release_items); +} + /** * page_pool_create_percpu() - create a page pool for a given cpu. * @params: parameters, see struct page_pool_params @@ -344,12 +537,18 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid) if (err < 0) goto err_free; - err = page_pool_list(pool); + err = page_pool_item_init(pool); if (err) goto err_uninit; + err = page_pool_list(pool); + if (err) + goto err_item_uninit; + return pool; +err_item_uninit: + page_pool_item_uninit(pool); err_uninit: page_pool_uninit(pool); err_free: @@ -369,7 +568,8 @@ struct page_pool *page_pool_create(const struct page_pool_params *params) } EXPORT_SYMBOL(page_pool_create); -static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem); +static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, + bool destroyed); static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { @@ -407,7 +607,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) * (2) break out to fallthrough to alloc_pages_node. * This limit stress on page buddy alloactor. */ - page_pool_return_page(pool, netmem); + __page_pool_return_page(pool, netmem, false); alloc_stat_inc(pool, waive); netmem = 0; break; @@ -464,6 +664,7 @@ page_pool_dma_sync_for_device(const struct page_pool *pool, static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) { + struct page_pool_item *item; dma_addr_t dma; /* Setup DMA mapping: use 'struct page' area for storing DMA-addr @@ -481,6 +682,9 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) if (page_pool_set_dma_addr_netmem(netmem, dma)) goto unmap_failed; + item = netmem_get_pp_item(netmem); + DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item)); + page_pool_item_set_mapped(item); page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len); return true; @@ -503,19 +707,24 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, if (unlikely(!page)) return NULL; - if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) { - put_page(page); - return NULL; - } + if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page)))) + goto err_alloc; + + if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) + goto err_set_info; alloc_stat_inc(pool, slow_high_order); - page_pool_set_pp_info(pool, page_to_netmem(page)); /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; trace_page_pool_state_hold(pool, page_to_netmem(page), pool->pages_state_hold_cnt); return page; +err_set_info: + page_pool_clear_pp_info(pool, page_to_netmem(page)); +err_alloc: + put_page(page); + return NULL; } /* slow path */ @@ -550,12 +759,18 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, */ for (i = 0; i < nr_pages; i++) { netmem = pool->alloc.cache[i]; + + if (unlikely(!page_pool_set_pp_info(pool, netmem))) { + put_page(netmem_to_page(netmem)); + continue; + } + if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) { + page_pool_clear_pp_info(pool, netmem); put_page(netmem_to_page(netmem)); continue; } - page_pool_set_pp_info(pool, netmem); pool->alloc.cache[pool->alloc.count++] = netmem; /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -627,9 +842,11 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict) return inflight; } -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) { - netmem_set_pp(netmem, pool); + if (unlikely(!page_pool_item_add(pool, netmem))) + return false; + netmem_or_pp_magic(netmem, PP_SIGNATURE); /* Ensuring all pages have been split into one fragment initially: @@ -641,32 +858,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) page_pool_fragment_netmem(netmem, 1); if (pool->has_init_callback) pool->slow.init_callback(netmem, pool->slow.init_arg); -} -void page_pool_clear_pp_info(netmem_ref netmem) -{ - netmem_clear_pp_magic(netmem); - netmem_set_pp(netmem, NULL); + return true; } -static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, - netmem_ref netmem) +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem) { - dma_addr_t dma; - - if (!pool->dma_map) - /* Always account for inflight pages, even if we didn't - * map them - */ - return; - - dma = page_pool_get_dma_addr_netmem(netmem); - - /* When page is unmapped, it cannot be returned to our pool */ - dma_unmap_page_attrs(pool->p.dev, dma, - PAGE_SIZE << pool->p.order, pool->p.dma_dir, - DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); - page_pool_set_dma_addr_netmem(netmem, 0); + netmem_clear_pp_magic(netmem); + page_pool_item_del(pool, netmem); } /* Disconnects a page (from a page_pool). API users can have a need @@ -674,7 +873,8 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, * a regular page (that will eventually be returned to the normal * page-allocator via put_page). */ -void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) +void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, + bool destroyed) { int count; bool put; @@ -683,7 +883,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv) put = mp_dmabuf_devmem_release_page(pool, netmem); else - __page_pool_release_page_dma(pool, netmem); + __page_pool_release_page_dma(pool, netmem, destroyed); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -692,7 +892,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) trace_page_pool_state_release(pool, netmem, count); if (put) { - page_pool_clear_pp_info(netmem); + page_pool_clear_pp_info(pool, netmem); put_page(netmem_to_page(netmem)); } /* An optimization would be to call __free_pages(page, pool->p.order) @@ -701,6 +901,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) */ } +/* Called from page_pool_put_*() path, need to synchronizated with + * page_pool_destory() path. + */ +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) +{ + unsigned int destroy_cnt; + + rcu_read_lock(); + + destroy_cnt = READ_ONCE(pool->destroy_cnt); + if (unlikely(destroy_cnt)) { + spin_lock_bh(&pool->item_lock); + __page_pool_return_page(pool, netmem, true); + spin_unlock_bh(&pool->item_lock); + } else { + __page_pool_return_page(pool, netmem, false); + } + + rcu_read_unlock(); +} + static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem) { int ret; @@ -963,7 +1184,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool, return netmem; } - page_pool_return_page(pool, netmem); + __page_pool_return_page(pool, netmem, false); return 0; } @@ -977,7 +1198,7 @@ static void page_pool_free_frag(struct page_pool *pool) if (!netmem || page_pool_unref_netmem(netmem, drain_count)) return; - page_pool_return_page(pool, netmem); + __page_pool_return_page(pool, netmem, false); } netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool, @@ -1053,6 +1274,7 @@ static void __page_pool_destroy(struct page_pool *pool) if (pool->disconnect) pool->disconnect(pool); + page_pool_item_uninit(pool); page_pool_unlist(pool); page_pool_uninit(pool); @@ -1084,7 +1306,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) static void page_pool_scrub(struct page_pool *pool) { page_pool_empty_alloc_cache_once(pool); - pool->destroy_cnt++; + WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1); /* No more consumers should exist, but producers could still * be in-flight. @@ -1178,6 +1400,8 @@ void page_pool_destroy(struct page_pool *pool) */ synchronize_rcu(); + page_pool_items_unmap(pool); + page_pool_detached(pool); pool->defer_start = jiffies; pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; @@ -1198,7 +1422,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) /* Flush pool alloc cache, as refill will check NUMA node */ while (pool->alloc.count) { netmem = pool->alloc.cache[--pool->alloc.count]; - page_pool_return_page(pool, netmem); + __page_pool_return_page(pool, netmem, false); } } EXPORT_SYMBOL(page_pool_update_nid); diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h index 57439787b9c2..5d85f862a30a 100644 --- a/net/core/page_pool_priv.h +++ b/net/core/page_pool_priv.h @@ -36,16 +36,18 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) } #if defined(CONFIG_PAGE_POOL) -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem); -void page_pool_clear_pp_info(netmem_ref netmem); +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem); +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem); int page_pool_check_memory_provider(struct net_device *dev, struct netdev_rx_queue *rxq); #else -static inline void page_pool_set_pp_info(struct page_pool *pool, +static inline bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) { + return true; } -static inline void page_pool_clear_pp_info(netmem_ref netmem) +static inline void page_pool_clear_pp_info(struct page_pool *pool, + netmem_ref netmem) { } static inline int page_pool_check_memory_provider(struct net_device *dev, From patchwork Mon Jan 6 13:01:12 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927417 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2468D1DE2A5; Mon, 6 Jan 2025 13:08:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; cv=none; b=WtlPZwe+GsXfgQ6i5yZXtHUKygHw9JZ9oT8g9HHk9WHfg3A3FFjC/Wi+4vBPsTD47c64lPmcg+9I7YGUFGU92JcohH/PnJka3M3N9FYhC5so94i70pAhG0E8CHtkFPo1HBVTS+Gq9Gm1XrschyDqKoIjId4q6veJc9egpQ4d690= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168920; c=relaxed/simple; bh=fyB3b/Uvp9sOv/RNRc08ilToGU5wOABPP2SmnKUqHrg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=nmHTItCJ/cikBYdtjgX8cCouEsmVKVbZSApiTK6F9vpYXqyNA+/ehvG0Ji/U5GzdBNv2n79kFuQq3ApocUly3LXcReH0fh0vxcQOoLvGIj9+JL+YFNzco4OUDMyGtmcGlJS6f3u7TlrVQUlUtF3RAu3GfBP4/HdYIq/NXD+PT9w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.214]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4YRZJ969Yqz20njP; Mon, 6 Jan 2025 21:08:57 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 89ECF1A016C; Mon, 6 Jan 2025 21:08:35 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:35 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 4/8] page_pool: support unlimited number of inflight pages Date: Mon, 6 Jan 2025 21:01:12 +0800 Message-ID: <20250106130116.457938-5-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org Currently a fixed size of pre-allocated memory is used to keep track of the inflight pages, in order to use the DMA API correctly. As mentioned [1], the number of inflight pages can be up to 73203 depending on the use cases. Allocate memory dynamically to keep track of the inflight pages when pre-allocated memory runs out. The overhead of using dynamic memory allocation is about 10ns~ 20ns, which causes 5%~10% performance degradation for the test case of time_bench_page_pool03_slow() in [2]. 1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/ 2. https://github.com/netoptimizer/prototype-kernel CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Fixes: f71fec47c2df ("page_pool: make sure struct device is stable") Signed-off-by: Yunsheng Lin --- include/net/page_pool/types.h | 6 ++ net/core/devmem.c | 2 +- net/core/page_pool.c | 109 +++++++++++++++++++++++++++++++--- net/core/page_pool_priv.h | 6 +- 4 files changed, 111 insertions(+), 12 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 240f940b5ddc..ae9b6159ca9f 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -103,6 +103,7 @@ struct page_pool_params { * @waive: pages obtained from the ptr ring that cannot be added to * the cache due to a NUMA mismatch * @item_fast_empty: pre-allocated item cache is empty + * @item_slow_failed: failed to allocate memory for item_block */ struct page_pool_alloc_stats { u64 fast; @@ -112,6 +113,7 @@ struct page_pool_alloc_stats { u64 refill; u64 waive; u64 item_fast_empty; + u64 item_slow_failed; }; /** @@ -159,6 +161,8 @@ struct page_pool_item { struct page_pool_item_block { struct page_pool *pp; struct list_head list; + unsigned int flags; + refcount_t ref; struct page_pool_item items[]; }; @@ -188,6 +192,8 @@ struct page_pool { int cpuid; u32 pages_state_hold_cnt; struct llist_head hold_items; + struct page_pool_item_block *item_blk; + unsigned int item_blk_idx; bool has_init_callback:1; /* slow::init_callback is set */ bool dma_map:1; /* Perform DMA mapping */ diff --git a/net/core/devmem.c b/net/core/devmem.c index cc7093f00af1..4d8b751d6f9c 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -380,7 +380,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem) if (WARN_ON_ONCE(refcount != 1)) return false; - page_pool_clear_pp_info(pool, netmem); + page_pool_clear_pp_info(pool, netmem, false); net_devmem_free_dmabuf(netmem_to_net_iov(netmem)); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index e6ffbb000e53..76992a5195f0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -62,6 +62,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = { "rx_pp_alloc_refill", "rx_pp_alloc_waive", "rx_pp_alloc_item_fast_empty", + "rx_pp_alloc_item_slow_failed", "rx_pp_recycle_cached", "rx_pp_recycle_cache_full", "rx_pp_recycle_ring", @@ -96,6 +97,7 @@ bool page_pool_get_stats(const struct page_pool *pool, stats->alloc_stats.refill += pool->alloc_stats.refill; stats->alloc_stats.waive += pool->alloc_stats.waive; stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty; + stats->alloc_stats.item_slow_failed += pool->alloc_stats.item_slow_failed; for_each_possible_cpu(cpu) { const struct page_pool_recycle_stats *pcpu = @@ -142,6 +144,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats) *data++ = pool_stats->alloc_stats.refill; *data++ = pool_stats->alloc_stats.waive; *data++ = pool_stats->alloc_stats.item_fast_empty; + *data++ = pool_stats->alloc_stats.item_slow_failed; *data++ = pool_stats->recycle_stats.cached; *data++ = pool_stats->recycle_stats.cache_full; *data++ = pool_stats->recycle_stats.ring; @@ -407,6 +410,8 @@ static void __page_pool_item_init(struct page_pool *pool, struct page *page) list_add(&block->list, &pool->item_blocks); block->pp = pool; + block->flags = 0; + refcount_set(&block->ref, 0); for (i = 0; i < ITEMS_PER_PAGE; i++) { page_pool_item_init_state(&items[i]); @@ -479,8 +484,82 @@ static void page_pool_item_uninit(struct page_pool *pool) { struct page_pool_item_block *block; - list_for_each_entry(block, &pool->item_blocks, list) + list_for_each_entry(block, &pool->item_blocks, list) { + WARN_ON(refcount_read(&block->ref)); put_page(virt_to_page(block)); + } +} + +#define PAGE_POOL_ITEM_BLK_DYNAMIC_BIT BIT(0) + +static bool page_pool_item_blk_add(struct page_pool *pool, netmem_ref netmem) +{ + struct page_pool_item *item; + + if (unlikely(!pool->item_blk || pool->item_blk_idx >= ITEMS_PER_PAGE)) { + struct page_pool_item_block *block; + struct page *page; + + page = alloc_pages_node(pool->p.nid, GFP_ATOMIC | __GFP_NOWARN | + __GFP_ZERO, 0); + if (!page) { + alloc_stat_inc(pool, item_slow_failed); + return false; + } + + block = page_address(page); + spin_lock_bh(&pool->item_lock); + list_add(&block->list, &pool->item_blocks); + spin_unlock_bh(&pool->item_lock); + + block->pp = pool; + block->flags |= PAGE_POOL_ITEM_BLK_DYNAMIC_BIT; + refcount_set(&block->ref, ITEMS_PER_PAGE); + pool->item_blk = block; + pool->item_blk_idx = 0; + } + + item = &pool->item_blk->items[pool->item_blk_idx++]; + item->pp_netmem = netmem; + page_pool_item_set_used(item); + netmem_set_pp_item(netmem, item); + return true; +} + +static void __page_pool_item_blk_del(struct page_pool *pool, + struct page_pool_item_block *block) +{ + spin_lock_bh(&pool->item_lock); + list_del(&block->list); + spin_unlock_bh(&pool->item_lock); + + put_page(virt_to_page(block)); +} + +static void page_pool_item_blk_free(struct page_pool *pool) +{ + struct page_pool_item_block *block = pool->item_blk; + + if (!block || pool->item_blk_idx >= ITEMS_PER_PAGE) + return; + + if (refcount_sub_and_test(ITEMS_PER_PAGE - pool->item_blk_idx, + &block->ref)) + __page_pool_item_blk_del(pool, block); +} + +static void page_pool_item_blk_del(struct page_pool *pool, + struct page_pool_item_block *block, + bool destroyed) +{ + /* Only call __page_pool_item_blk_del() when page_pool_destroy() + * is not called yet as alloc API is not allowed to be called at + * this point and pool->item_lock is reused to avoid concurrent + * dma unmapping when page_pool_destroy() is called, taking the + * lock in __page_pool_item_blk_del() causes deadlock. + */ + if (refcount_dec_and_test(&block->ref) && !destroyed) + __page_pool_item_blk_del(pool, block); } static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem) @@ -493,7 +572,7 @@ static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem) if (unlikely(llist_empty(&pool->hold_items))) { alloc_stat_inc(pool, item_fast_empty); - return false; + return page_pool_item_blk_add(pool, netmem); } } @@ -506,16 +585,26 @@ static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem) return true; } -static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem) +static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem, + bool destroyed) { struct page_pool_item *item = netmem_get_pp_item(netmem); + struct page_pool_item_block *block; DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem); DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item)); DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item)); page_pool_item_clear_used(item); netmem_set_pp_item(netmem, NULL); - llist_add(&item->lentry, &pool->release_items); + + block = page_pool_item_to_block(item); + if (likely(!(block->flags & PAGE_POOL_ITEM_BLK_DYNAMIC_BIT))) { + DEBUG_NET_WARN_ON_ONCE(refcount_read(&block->ref)); + llist_add(&item->lentry, &pool->release_items); + return; + } + + page_pool_item_blk_del(pool, block, destroyed); } /** @@ -721,7 +810,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, pool->pages_state_hold_cnt); return page; err_set_info: - page_pool_clear_pp_info(pool, page_to_netmem(page)); + page_pool_clear_pp_info(pool, page_to_netmem(page), false); err_alloc: put_page(page); return NULL; @@ -766,7 +855,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, } if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) { - page_pool_clear_pp_info(pool, netmem); + page_pool_clear_pp_info(pool, netmem, false); put_page(netmem_to_page(netmem)); continue; } @@ -862,10 +951,11 @@ bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) return true; } -void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem) +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem, + bool destroyed) { netmem_clear_pp_magic(netmem); - page_pool_item_del(pool, netmem); + page_pool_item_del(pool, netmem, destroyed); } /* Disconnects a page (from a page_pool). API users can have a need @@ -892,7 +982,7 @@ void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, trace_page_pool_state_release(pool, netmem, count); if (put) { - page_pool_clear_pp_info(pool, netmem); + page_pool_clear_pp_info(pool, netmem, destroyed); put_page(netmem_to_page(netmem)); } /* An optimization would be to call __free_pages(page, pool->p.order) @@ -1390,6 +1480,7 @@ void page_pool_destroy(struct page_pool *pool) page_pool_disable_direct_recycling(pool); page_pool_free_frag(pool); + page_pool_item_blk_free(pool); if (!page_pool_release(pool)) return; diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h index 5d85f862a30a..643f707838e8 100644 --- a/net/core/page_pool_priv.h +++ b/net/core/page_pool_priv.h @@ -37,7 +37,8 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) #if defined(CONFIG_PAGE_POOL) bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem); -void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem); +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem, + bool destroyed); int page_pool_check_memory_provider(struct net_device *dev, struct netdev_rx_queue *rxq); #else @@ -47,7 +48,8 @@ static inline bool page_pool_set_pp_info(struct page_pool *pool, return true; } static inline void page_pool_clear_pp_info(struct page_pool *pool, - netmem_ref netmem) + netmem_ref netmem, + bool destroyed) { } static inline int page_pool_check_memory_provider(struct net_device *dev, From patchwork Mon Jan 6 13:01:13 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927420 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 223231DE2CC; Mon, 6 Jan 2025 13:08:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168922; cv=none; b=XJcmsx3LJnoQhwF/I4My6Kzh4W/O5TLQ8ZRZRs7x/8kM9stSXPV0BU2aUBXVOeSUhSsKm8HtTneHBhLZ2+JUn1IKQs4ysfrmu8/rXamkT1XWUOtQsgpY7pv3SPphyL/uNzuYJt8KuYrKou7MOhicrKg6NOBH0Dht4wwzx2Xp3L4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168922; c=relaxed/simple; bh=VbfrbCNET5rGlpZ80e+CemOgHzbhLZi1YPx7o2QJ2NQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=MlkfFAnh/e1CAl38M6sUgnv2MrgkprKvLre7pOY3T6r/BnKu8rVHV0Ga9qi4zz5fHWlKemQLtXdd38E/TGYCaXowDbjkXY+p90VXWsVnDehW0yGmIHvex7UbcrXL84q+OTPunLlGu05sYK+JcY2gdSdyaogwXdnfj9D0IL0wGSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4YRZFC5xJgzRkXl; Mon, 6 Jan 2025 21:06:23 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 13B2118010A; Mon, 6 Jan 2025 21:08:37 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:36 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 5/8] page_pool: skip dma sync operation for inflight pages Date: Mon, 6 Jan 2025 21:01:13 +0800 Message-ID: <20250106130116.457938-6-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org Skip dma sync operation for inflight pages before the page_pool_destroy() returns to the driver as DMA API expects to be called with a valid device bound to a driver as mentioned in [1]. After page_pool_destroy() is called, the page is not expected to be recycled back to pool->alloc cache and dma sync operation is not needed when the page is not recyclable or pool->ring is full, so only skip the dma sync operation for the infilght pages by clearing the pool->dma_sync, as synchronize_rcu() in page_pool_destroy() is paired with rcu lock in page_pool_recycle_in_ring() to ensure that there is no dma syncoperation called after page_pool_destroy() is returned. 1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/ CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Fixes: f71fec47c2df ("page_pool: make sure struct device is stable") Signed-off-by: Yunsheng Lin --- net/core/page_pool.c | 57 ++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 76992a5195f0..bce58dc43498 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -280,9 +280,6 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); - if (pool->dma_map) - get_device(pool->p.dev); - if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) { /* We rely on rtnl_lock()ing to make sure netdev_rx_queue * configuration doesn't change while we're initializing @@ -323,9 +320,6 @@ static void page_pool_uninit(struct page_pool *pool) { ptr_ring_cleanup(&pool->ring, NULL); - if (pool->dma_map) - put_device(pool->p.dev); - #ifdef CONFIG_PAGE_POOL_STATS if (!pool->system) free_percpu(pool->recycle_stats); @@ -751,6 +745,25 @@ page_pool_dma_sync_for_device(const struct page_pool *pool, __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); } +static __always_inline void +page_pool_dma_sync_for_device_rcu(const struct page_pool *pool, + netmem_ref netmem, + u32 dma_sync_size) +{ + if (!pool->dma_sync) + return; + + rcu_read_lock(); + + /* Recheck the dma_sync under rcu lock to pair with synchronize_rcu() in + * page_pool_destroy(). + */ + if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) + __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); + + rcu_read_unlock(); +} + static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) { struct page_pool_item *item; @@ -1012,7 +1025,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) rcu_read_unlock(); } -static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem) +static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem, + unsigned int dma_sync_size) { int ret; /* BH protection not needed if current is softirq */ @@ -1021,12 +1035,12 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem) else ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem); - if (!ret) { + if (likely(!ret)) { + page_pool_dma_sync_for_device_rcu(pool, netmem, dma_sync_size); recycle_stat_inc(pool, ring); - return true; } - return false; + return !ret; } /* Only allow direct recycling in special circumstances, into the @@ -1079,10 +1093,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, if (likely(__page_pool_page_can_be_recycled(netmem))) { /* Read barrier done in page_ref_count / READ_ONCE */ - page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); - - if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) + if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) { + page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); return 0; + } /* Page found as candidate for recycling */ return netmem; @@ -1145,7 +1159,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, netmem = __page_pool_put_page(pool, netmem, dma_sync_size, allow_direct); - if (netmem && !page_pool_recycle_in_ring(pool, netmem)) { + if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) { /* Cache full, fallback to free pages */ recycle_stat_inc(pool, ring_full); page_pool_return_page(pool, netmem); @@ -1171,14 +1185,17 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool, /* Bulk produce into ptr_ring page_pool cache */ in_softirq = page_pool_producer_lock(pool); + rcu_read_lock(); for (i = 0; i < bulk_len; i++) { if (__ptr_ring_produce(&pool->ring, (__force void *)bulk[i])) { /* ring full */ recycle_stat_inc(pool, ring_full); break; } + page_pool_dma_sync_for_device(pool, (__force netmem_ref)bulk[i], + -1); } - + rcu_read_unlock(); page_pool_producer_unlock(pool, in_softirq); recycle_stat_add(pool, ring, i); @@ -1485,6 +1502,16 @@ void page_pool_destroy(struct page_pool *pool) if (!page_pool_release(pool)) return; + /* After page_pool_destroy() is called, the page is not expected to be + * recycled back to pool->alloc cache and dma sync operation is not + * needed when the page is not recyclable or pool->ring is full, so only + * skip the dma sync operation for the infilght pages by clearing the + * pool->dma_sync, and the synchronize_rcu() is paired with rcu lock in + * page_pool_recycle_in_ring() to ensure that there is no dma sync + * operation called after page_pool_destroy() is returned. + */ + pool->dma_sync = false; + /* Paired with rcu lock in page_pool_napi_local() to enable clearing * of pool->p.napi in page_pool_disable_direct_recycling() is seen * before returning to driver to free the napi instance. From patchwork Mon Jan 6 13:01:14 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927421 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CBFE1DE898; Mon, 6 Jan 2025 13:08:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168923; cv=none; b=C1wezwm2W67BG7lL/htFouiy4vdcAG0FlMXGlqNs6ozsQqny4ep37F2dJPr2sbxbiUZVmBamMyWjRXqTgunMMzt3SNwvub/FPtbEEWeab4mOv8hZltXEqR2gJ+lgbh1R4QF3phHmHXVOFn3MqUxnZN6o/mVSfM1wHtfSKo8UMhY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168923; c=relaxed/simple; bh=6DXF7WVgPKp4uA0PYgFCFgJecrr4zLl3CnOlR0Pbkng=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=kBdHCTpOXVhPp0+pNzl1jalnm7bImMA2v9PanRXgHEF2KnFT2DKvL0h8KSTS9RdXLb9kTMEbBEgFm4BYHbbsfBdazK48orpLr515fioM2w5HpeXPzU/blNNGA8bX4LzbJt3K4UD7ZNPSSvd2knQqkzb5RZehotn/HuHhxRyWMbA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4YRZDN0bWJz2Dk4X; Mon, 6 Jan 2025 21:05:40 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 15DFB140135; Mon, 6 Jan 2025 21:08:39 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:38 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 6/8] page_pool: use list instead of ptr_ring for ring cache Date: Mon, 6 Jan 2025 21:01:14 +0800 Message-ID: <20250106130116.457938-7-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org As 'struct page_pool_item' is added to fix the DMA API misuse problem, which adds some performance and memory overhead. Utilize the 'state' of 'struct page_pool_item' for a pointer to a next item as only lower 2 bits of 'state' are used, in order to avoid some performance and memory overhead of adding 'struct page_pool_item'. As there is only one producer due to the NAPI context protection, multiple consumers can be allowed using the similar lockless operation like llist in llist.h. Testing shows there is about 10ns~20ns improvement for performance of 'time_bench_page_pool02_ptr_ring' test case CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Signed-off-by: Yunsheng Lin --- include/net/page_pool/types.h | 16 ++-- net/core/page_pool.c | 133 ++++++++++++++++++---------------- 2 files changed, 82 insertions(+), 67 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index ae9b6159ca9f..923bcb01a830 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -4,7 +4,6 @@ #define _NET_PAGE_POOL_TYPES_H #include -#include #include #include @@ -147,7 +146,10 @@ struct page_pool_stats { #endif struct page_pool_item { - unsigned long state; + /* An 'encoded_next' is a pointer to next item, lower 2 bits is used to + * indicate the state of current item. + */ + unsigned long encoded_next; union { netmem_ref pp_netmem; @@ -155,6 +157,11 @@ struct page_pool_item { }; }; +struct pp_ring_cache { + struct page_pool_item *list; + atomic_t count; +}; + /* The size of item_block is always PAGE_SIZE, so that the address of item_block * for a specific item can be calculated using 'item & PAGE_MASK' */ @@ -241,12 +248,9 @@ struct page_pool { * wise, because free's can happen on remote CPUs, with no * association with allocation resource. * - * Use ptr_ring, as it separates consumer and producer - * efficiently, it a way that doesn't bounce cache-lines. - * * TODO: Implement bulk return pages into this structure. */ - struct ptr_ring ring; + struct pp_ring_cache ring ____cacheline_aligned_in_smp; void *mp_priv; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index bce58dc43498..b8f46c26b5ef 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -161,29 +161,6 @@ EXPORT_SYMBOL(page_pool_ethtool_stats_get); #define recycle_stat_add(pool, __stat, val) #endif -static bool page_pool_producer_lock(struct page_pool *pool) - __acquires(&pool->ring.producer_lock) -{ - bool in_softirq = in_softirq(); - - if (in_softirq) - spin_lock(&pool->ring.producer_lock); - else - spin_lock_bh(&pool->ring.producer_lock); - - return in_softirq; -} - -static void page_pool_producer_unlock(struct page_pool *pool, - bool in_softirq) - __releases(&pool->ring.producer_lock) -{ - if (in_softirq) - spin_unlock(&pool->ring.producer_lock); - else - spin_unlock_bh(&pool->ring.producer_lock); -} - static void page_pool_struct_check(void) { CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_users); @@ -266,14 +243,6 @@ static int page_pool_init(struct page_pool *pool, } #endif - if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) { -#ifdef CONFIG_PAGE_POOL_STATS - if (!pool->system) - free_percpu(pool->recycle_stats); -#endif - return -ENOMEM; - } - spin_lock_init(&pool->item_lock); atomic_set(&pool->pages_state_release_cnt, 0); @@ -299,7 +268,7 @@ static int page_pool_init(struct page_pool *pool, if (err) { pr_warn("%s() mem-provider init failed %d\n", __func__, err); - goto free_ptr_ring; + goto free_stats; } static_branch_inc(&page_pool_mem_providers); @@ -307,8 +276,7 @@ static int page_pool_init(struct page_pool *pool, return 0; -free_ptr_ring: - ptr_ring_cleanup(&pool->ring, NULL); +free_stats: #ifdef CONFIG_PAGE_POOL_STATS if (!pool->system) free_percpu(pool->recycle_stats); @@ -318,8 +286,6 @@ static int page_pool_init(struct page_pool *pool, static void page_pool_uninit(struct page_pool *pool) { - ptr_ring_cleanup(&pool->ring, NULL); - #ifdef CONFIG_PAGE_POOL_STATS if (!pool->system) free_percpu(pool->recycle_stats); @@ -328,6 +294,8 @@ static void page_pool_uninit(struct page_pool *pool) #define PAGE_POOL_ITEM_USED 0 #define PAGE_POOL_ITEM_MAPPED 1 +#define PAGE_POOL_ITEM_STATE_MASK (BIT(PAGE_POOL_ITEM_USED) |\ + BIT(PAGE_POOL_ITEM_MAPPED)) #define ITEMS_PER_PAGE ((PAGE_SIZE - \ offsetof(struct page_pool_item_block, items)) / \ @@ -335,18 +303,18 @@ static void page_pool_uninit(struct page_pool *pool) #define page_pool_item_init_state(item) \ ({ \ - (item)->state = 0; \ + (item)->encoded_next &= ~PAGE_POOL_ITEM_STATE_MASK; \ }) #if defined(CONFIG_DEBUG_NET) #define page_pool_item_set_used(item) \ - __set_bit(PAGE_POOL_ITEM_USED, &(item)->state) + __set_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next) #define page_pool_item_clear_used(item) \ - __clear_bit(PAGE_POOL_ITEM_USED, &(item)->state) + __clear_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next) #define page_pool_item_is_used(item) \ - test_bit(PAGE_POOL_ITEM_USED, &(item)->state) + test_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next) #else #define page_pool_item_set_used(item) #define page_pool_item_clear_used(item) @@ -354,16 +322,69 @@ static void page_pool_uninit(struct page_pool *pool) #endif #define page_pool_item_set_mapped(item) \ - __set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + __set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next) /* Only clear_mapped and is_mapped need to be atomic as they can be * called concurrently. */ #define page_pool_item_clear_mapped(item) \ - clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next) #define page_pool_item_is_mapped(item) \ - test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state) + test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next) + +#define page_pool_item_set_next(item, next) \ +({ \ + struct page_pool_item *__item = item; \ + \ + __item->encoded_next &= PAGE_POOL_ITEM_STATE_MASK; \ + __item->encoded_next |= (unsigned long)(next); \ +}) + +#define page_pool_item_get_next(item) \ +({ \ + struct page_pool_item *__next; \ + \ + __next = (struct page_pool_item *) \ + ((item)->encoded_next & ~PAGE_POOL_ITEM_STATE_MASK); \ + __next; \ +}) + +static bool __page_pool_recycle_in_ring(struct page_pool *pool, + netmem_ref netmem) +{ + struct page_pool_item *item, *list; + + if (unlikely(atomic_read(&pool->ring.count) > pool->p.pool_size)) + return false; + + item = netmem_get_pp_item(netmem); + list = READ_ONCE(pool->ring.list); + + do { + page_pool_item_set_next(item, list); + } while (!try_cmpxchg(&pool->ring.list, &list, item)); + + atomic_inc(&pool->ring.count); + return true; +} + +static netmem_ref page_pool_consume_ring(struct page_pool *pool) +{ + struct page_pool_item *next, *list; + + list = READ_ONCE(pool->ring.list); + + do { + if (unlikely(!list)) + return (__force netmem_ref)0; + + next = page_pool_item_get_next(list); + } while (!try_cmpxchg(&pool->ring.list, &list, next)); + + atomic_dec(&pool->ring.count); + return list->pp_netmem; +} static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, netmem_ref netmem, @@ -656,12 +677,11 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { - struct ptr_ring *r = &pool->ring; netmem_ref netmem; int pref_nid; /* preferred NUMA node */ /* Quicker fallback, avoid locks when ring is empty */ - if (__ptr_ring_empty(r)) { + if (unlikely(!READ_ONCE(pool->ring.list))) { alloc_stat_inc(pool, empty); return 0; } @@ -678,7 +698,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) /* Refill alloc array, but only if NUMA match */ do { - netmem = (__force netmem_ref)__ptr_ring_consume(r); + netmem = page_pool_consume_ring(pool); if (unlikely(!netmem)) break; @@ -1028,19 +1048,14 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem, unsigned int dma_sync_size) { - int ret; - /* BH protection not needed if current is softirq */ - if (in_softirq()) - ret = ptr_ring_produce(&pool->ring, (__force void *)netmem); - else - ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem); + bool ret = __page_pool_recycle_in_ring(pool, netmem); - if (likely(!ret)) { + if (likely(ret)) { page_pool_dma_sync_for_device_rcu(pool, netmem, dma_sync_size); recycle_stat_inc(pool, ring); } - return !ret; + return ret; } /* Only allow direct recycling in special circumstances, into the @@ -1179,15 +1194,12 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool, netmem_ref *bulk, u32 bulk_len) { - bool in_softirq; u32 i; - /* Bulk produce into ptr_ring page_pool cache */ - in_softirq = page_pool_producer_lock(pool); - rcu_read_lock(); for (i = 0; i < bulk_len; i++) { - if (__ptr_ring_produce(&pool->ring, (__force void *)bulk[i])) { + if (!__page_pool_recycle_in_ring(pool, + (__force netmem_ref)bulk[i])) { /* ring full */ recycle_stat_inc(pool, ring_full); break; @@ -1196,7 +1208,6 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool, -1); } rcu_read_unlock(); - page_pool_producer_unlock(pool, in_softirq); recycle_stat_add(pool, ring, i); /* Hopefully all pages were returned into ptr_ring */ @@ -1366,7 +1377,7 @@ static void page_pool_empty_ring(struct page_pool *pool) netmem_ref netmem; /* Empty recycle ring */ - while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { + while ((netmem = page_pool_consume_ring(pool))) { /* Verify the refcnt invariant of cached pages */ if (!(netmem_ref_count(netmem) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", From patchwork Mon Jan 6 13:01:15 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927423 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 228B41DF738; Mon, 6 Jan 2025 13:08:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.35 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168932; cv=none; b=aHe+n+BF0oaUFNzG5zbaeZOpHe4pooYB4BuDmTsYD9LryFV/OMdvPKJ25/BAqD3kclmeDAIab32gmTOUAgMfBNuoYlCPYvRkD8hlmQ9ivECuX7OeGKcwvgFBtOCn6155UJaxEG0TwprDXLU9aqgyZ5BBqDlK3fwt3VifUwV1KcU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168932; c=relaxed/simple; bh=xCTpS6ipjGJeQEohtMZNmbeIzH6qk2MfcthnLIMYa9c=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=GP7rT5OR1flq9IptDQ0Rzq/i1IHNtZNfZDSjYydADrLP5TgHRTJneEAz9kvEwiYe6B6IxCaxrdpxvMG1oIrMD16Il24CWkg5ieaItu1f1iuqj+IAycE9Y7DSWBw2mm9QwbrmhJgT8T+gAxs8VSzUy1R+k18vjLIrRsLLi/uml8w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.35 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4YRZDP19kTz1V4Nx; Mon, 6 Jan 2025 21:05:41 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id B7FCB14037B; Mon, 6 Jan 2025 21:08:40 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:40 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 7/8] page_pool: batch refilling pages to reduce atomic operation Date: Mon, 6 Jan 2025 21:01:15 +0800 Message-ID: <20250106130116.457938-8-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org Add refill variable in alloc cache to keep batched refilled pages to avoid doing the atomic operation for each page. Testing shows there is about 10ns improvement for the performance of 'time_bench_page_pool02_ptr_ring' test case. CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Signed-off-by: Yunsheng Lin --- include/net/page_pool/types.h | 5 +++++ net/core/page_pool.c | 25 +++++++++++++++++++++---- 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 923bcb01a830..ac83abea24b0 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -51,6 +51,11 @@ struct pp_alloc_cache { u32 count; netmem_ref cache[PP_ALLOC_CACHE_SIZE]; + + /* Keep batched refilled pages here to avoid doing the atomic operation + * for each page. + */ + struct page_pool_item *refill; }; /** diff --git a/net/core/page_pool.c b/net/core/page_pool.c index b8f46c26b5ef..429d44ede074 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -677,11 +677,13 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { + struct page_pool_item *refill; netmem_ref netmem; int pref_nid; /* preferred NUMA node */ /* Quicker fallback, avoid locks when ring is empty */ - if (unlikely(!READ_ONCE(pool->ring.list))) { + refill = pool->alloc.refill; + if (unlikely(!refill && !READ_ONCE(pool->ring.list))) { alloc_stat_inc(pool, empty); return 0; } @@ -698,10 +700,14 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) /* Refill alloc array, but only if NUMA match */ do { - netmem = page_pool_consume_ring(pool); - if (unlikely(!netmem)) - break; + if (unlikely(!refill)) { + refill = xchg(&pool->ring.list, NULL); + if (!refill) + break; + } + netmem = refill->pp_netmem; + refill = page_pool_item_get_next(refill); if (likely(netmem_is_pref_nid(netmem, pref_nid))) { pool->alloc.cache[pool->alloc.count++] = netmem; } else { @@ -711,14 +717,18 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) * This limit stress on page buddy alloactor. */ __page_pool_return_page(pool, netmem, false); + atomic_dec(&pool->ring.count); alloc_stat_inc(pool, waive); netmem = 0; break; } } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL); + pool->alloc.refill = refill; + /* Return last page */ if (likely(pool->alloc.count > 0)) { + atomic_sub(pool->alloc.count, &pool->ring.count); netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, refill); } @@ -1406,6 +1416,7 @@ static void __page_pool_destroy(struct page_pool *pool) static void page_pool_empty_alloc_cache_once(struct page_pool *pool) { + struct page_pool_item *refill; netmem_ref netmem; if (pool->destroy_cnt) @@ -1419,6 +1430,12 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) netmem = pool->alloc.cache[--pool->alloc.count]; page_pool_return_page(pool, netmem); } + + while ((refill = pool->alloc.refill)) { + pool->alloc.refill = page_pool_item_get_next(refill); + page_pool_return_page(pool, refill->pp_netmem); + atomic_dec(&pool->ring.count); + } } static void page_pool_scrub(struct page_pool *pool) From patchwork Mon Jan 6 13:01:16 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13927422 X-Patchwork-Delegate: kuba@kernel.org Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A17461DEFF7; Mon, 6 Jan 2025 13:08:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168927; cv=none; b=NYoDkCYNJIlS9oJZfbIn5ZJRmeEtam7gUCnI1XRTOJgjJl/LR4uRwWulbG2nRNA21ucKRZH/C/sVJjYkJWfokaa++gyjIMJiofptm0DkIuCSRNNwJrbybe+pvBhmobIg6rYgBR7WmASRZZI5tLrL4PkYNMIj/Q2Q7QnJ7SXwSe0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736168927; c=relaxed/simple; bh=pK90hb9orWRZ9OqTozGQCdwNJyO2Ck9ddwhLcvosOUI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=C9Y0bmdt69aj2WIGlx7/WxOIQTQrqPYY0yMIgjlgw6yAj4NBLXckfW1NJxNdzifGp4QRPvwbgJtsx7GpXNYor/1pv3sUr8QpbAcc2DoCzVKYkKZwp5VkJG5sdXA539b0WLWRBH0ug0FFoY6dLhzlisuIYKb5dAPJdUBT8bzpMq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4YRZDR3gs0z2Dk4n; Mon, 6 Jan 2025 21:05:43 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 8045A140135; Mon, 6 Jan 2025 21:08:42 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 6 Jan 2025 21:08:42 +0800 From: Yunsheng Lin To: , , CC: , , , Yunsheng Lin , Robin Murphy , Alexander Duyck , IOMMU , Jesper Dangaard Brouer , Ilias Apalodimas , Eric Dumazet , Simon Horman , , Subject: [PATCH net-next v6 8/8] page_pool: use list instead of array for alloc cache Date: Mon, 6 Jan 2025 21:01:16 +0800 Message-ID: <20250106130116.457938-9-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20250106130116.457938-1-linyunsheng@huawei.com> References: <20250106130116.457938-1-linyunsheng@huawei.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Patchwork-Delegate: kuba@kernel.org As the alloc cache is always protected by NAPI context protection, use encoded_next as a pointer to a next item to avoid the using the array. Testing shows there is about 3ns improvement for the performance of 'time_bench_page_pool01_fast_path' test case. CC: Robin Murphy CC: Alexander Duyck CC: IOMMU Signed-off-by: Yunsheng Lin --- include/net/page_pool/types.h | 2 +- net/core/page_pool.c | 59 +++++++++++++++++++++++++---------- 2 files changed, 43 insertions(+), 18 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index ac83abea24b0..97f548c79e22 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -50,7 +50,7 @@ #define PP_ALLOC_CACHE_REFILL 64 struct pp_alloc_cache { u32 count; - netmem_ref cache[PP_ALLOC_CACHE_SIZE]; + struct page_pool_item *list; /* Keep batched refilled pages here to avoid doing the atomic operation * for each page. diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 429d44ede074..59aea3abf0eb 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -386,6 +386,27 @@ static netmem_ref page_pool_consume_ring(struct page_pool *pool) return list->pp_netmem; } +static netmem_ref __page_pool_consume_alloc(struct page_pool *pool) +{ + struct page_pool_item *item = pool->alloc.list; + + pool->alloc.list = page_pool_item_get_next(item); + pool->alloc.count--; + + return item->pp_netmem; +} + +static void __page_pool_recycle_in_alloc(struct page_pool *pool, + netmem_ref netmem) +{ + struct page_pool_item *item; + + item = netmem_get_pp_item(netmem); + page_pool_item_set_next(item, pool->alloc.list); + pool->alloc.list = item; + pool->alloc.count++; +} + static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, netmem_ref netmem, bool destroyed) @@ -677,10 +698,12 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem, static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { - struct page_pool_item *refill; + struct page_pool_item *refill, *alloc, *curr; netmem_ref netmem; int pref_nid; /* preferred NUMA node */ + DEBUG_NET_WARN_ON_ONCE(pool->alloc.count || pool->alloc.list); + /* Quicker fallback, avoid locks when ring is empty */ refill = pool->alloc.refill; if (unlikely(!refill && !READ_ONCE(pool->ring.list))) { @@ -698,6 +721,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) pref_nid = numa_mem_id(); /* will be zero like page_to_nid() */ #endif + alloc = NULL; /* Refill alloc array, but only if NUMA match */ do { if (unlikely(!refill)) { @@ -706,10 +730,13 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) break; } + curr = refill; netmem = refill->pp_netmem; refill = page_pool_item_get_next(refill); if (likely(netmem_is_pref_nid(netmem, pref_nid))) { - pool->alloc.cache[pool->alloc.count++] = netmem; + page_pool_item_set_next(curr, alloc); + pool->alloc.count++; + alloc = curr; } else { /* NUMA mismatch; * (1) release 1 page to page-allocator and @@ -729,7 +756,8 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) /* Return last page */ if (likely(pool->alloc.count > 0)) { atomic_sub(pool->alloc.count, &pool->ring.count); - netmem = pool->alloc.cache[--pool->alloc.count]; + pool->alloc.list = page_pool_item_get_next(alloc); + pool->alloc.count--; alloc_stat_inc(pool, refill); } @@ -744,7 +772,7 @@ static netmem_ref __page_pool_get_cached(struct page_pool *pool) /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */ if (likely(pool->alloc.count)) { /* Fast-path */ - netmem = pool->alloc.cache[--pool->alloc.count]; + netmem = __page_pool_consume_alloc(pool); alloc_stat_inc(pool, fast); } else { netmem = page_pool_refill_alloc_cache(pool); @@ -863,6 +891,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, gfp_t gfp) { + netmem_ref netmems[PP_ALLOC_CACHE_REFILL] = {0}; const int bulk = PP_ALLOC_CACHE_REFILL; unsigned int pp_order = pool->p.order; bool dma_map = pool->dma_map; @@ -873,16 +902,12 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, if (unlikely(pp_order)) return page_to_netmem(__page_pool_alloc_page_order(pool, gfp)); - /* Unnecessary as alloc cache is empty, but guarantees zero count */ - if (unlikely(pool->alloc.count > 0)) - return pool->alloc.cache[--pool->alloc.count]; - - /* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */ - memset(&pool->alloc.cache, 0, sizeof(void *) * bulk); + /* alloc cache should be empty */ + DEBUG_NET_WARN_ON_ONCE(pool->alloc.count || pool->alloc.list); nr_pages = alloc_pages_bulk_array_node(gfp, pool->p.nid, bulk, - (struct page **)pool->alloc.cache); + (struct page **)netmems); if (unlikely(!nr_pages)) return 0; @@ -890,7 +915,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, * page element have not been (possibly) DMA mapped. */ for (i = 0; i < nr_pages; i++) { - netmem = pool->alloc.cache[i]; + netmem = netmems[i]; if (unlikely(!page_pool_set_pp_info(pool, netmem))) { put_page(netmem_to_page(netmem)); @@ -903,7 +928,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, continue; } - pool->alloc.cache[pool->alloc.count++] = netmem; + __page_pool_recycle_in_alloc(pool, netmem); /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; trace_page_pool_state_hold(pool, netmem, @@ -912,7 +937,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, /* Return last page */ if (likely(pool->alloc.count > 0)) { - netmem = pool->alloc.cache[--pool->alloc.count]; + netmem = __page_pool_consume_alloc(pool); alloc_stat_inc(pool, slow); } else { netmem = 0; @@ -1082,7 +1107,7 @@ static bool page_pool_recycle_in_cache(netmem_ref netmem, } /* Caller MUST have verified/know (page_ref_count(page) == 1) */ - pool->alloc.cache[pool->alloc.count++] = netmem; + __page_pool_recycle_in_alloc(pool, netmem); recycle_stat_inc(pool, cached); return true; } @@ -1427,7 +1452,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) * call concurrently. */ while (pool->alloc.count) { - netmem = pool->alloc.cache[--pool->alloc.count]; + netmem = __page_pool_consume_alloc(pool); page_pool_return_page(pool, netmem); } @@ -1567,7 +1592,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) /* Flush pool alloc cache, as refill will check NUMA node */ while (pool->alloc.count) { - netmem = pool->alloc.cache[--pool->alloc.count]; + netmem = __page_pool_consume_alloc(pool); __page_pool_return_page(pool, netmem, false); } }