From patchwork Thu Jun 29 15:23:02 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexander Lobakin <aleksander.lobakin@intel.com>
X-Patchwork-Id: 13297093
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37C3012B76
	for <netdev@vger.kernel.org>; Thu, 29 Jun 2023 15:24:02 +0000 (UTC)
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B8891B0;
	Thu, 29 Jun 2023 08:24:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1688052240; x=1719588240;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=9H4rj4yLG7tjR1yeWEPFWXfPb3eSoriC/kHz9ihgBIw=;
  b=aR81Opx+iDvYCQpsqq+QPdRPB2ClFgUnlotgRe9EZCPPkPsYPJtCrkcq
   P7gtP7B8tJaGbnhNbOEJp42FXZYzeEz4TpFT5Eytfewe7tEmoo+Op5yCv
   hLg7t8sTSOcA/9YmasoY63dnGyPE5Owf16rOjHd1MAbHSluaf3am3Ufoj
   Udzd+/oV38A8OYAIHCK4SbobEAJjt6YQDlU1bcqWzyjRcbElTSZ1X9Lqt
   yVNBqmu/+N7ttymrPE+FCuNguvVbdA+5LVo88S/jgGnwLiCsFp55u2HTg
   dsGUQLzA2kMltfONgcGXnJdFNPnbpEdbNjVSMZ2Zfen6shHZ1T40tezc9
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="346920586"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="346920586"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2023 08:23:57 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="830573778"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="830573778"
Received: from newjersey.igk.intel.com ([10.102.20.203])
  by fmsmga002.fm.intel.com with ESMTP; 29 Jun 2023 08:23:53 -0700
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH RFC net-next 1/4] net: skbuff: don't include <net/page_pool.h>
 to <linux/skbuff.h>
Date: Thu, 29 Jun 2023 17:23:02 +0200
Message-ID: <20230629152305.905962-2-aleksander.lobakin@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230629152305.905962-1-aleksander.lobakin@intel.com>
References: <20230629152305.905962-1-aleksander.lobakin@intel.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,
	SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Currently, touching <net/page_pool.h> triggers a rebuild of more than
a half of the kernel. That's because it's included in <linux/skbuff.h>.
And each new include to page_pool.h adds more [useless] data for the
toolchain to process per each source file from that pile.

In commit 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple functions
defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info
as long as page pool owns the page") one of the calls was removed, so
only one left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having include of very niche page_pool.h only for that
looks like an overkill.
As Alex noticed, the only thing that holds this function in page_pool.c
is %PP_SIGNATURE. By moving the check for magic a couple functions up,
the whole page_pool_return_skb_page() can be moved to skbuff.c.
The arguments for moving the check are the following:

1) It checks for a corner case that shouldn't ever happen when the code
   is sane. And execution speed doesn't matter on exception path, thus
   doing more calls before bailing out doesn't make any weather.
2) There are 2 users of the internal __page_pool_put_page(), where this
   check is moved: page_pool_put_defragged_page() and
   page_pool_put_page_bulk(). Both are exported and can be called from
   the drivers, but previously only the skb recycling path of the former
   was protected with the magic check. So this also makes the code a bit
   more reliable.

After the check is moved, teleport page_pool_return_skb_page() to
skbuff.c, just next to the main consumer, skb_pp_recycle(). It's used
also in napi_frag_unref() -> {__,}skb_frag_unref(), so no `static`
unfortunately. Maybe next time.
Now, after a few include fixes in the drivers, touching page_pool.h
only triggers rebuilding of the drivers using it and a couple core
networking files.

Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/engleder/tsnep_main.c    |  1 +
 drivers/net/ethernet/freescale/fec_main.c     |  1 +
 .../marvell/octeontx2/nic/otx2_common.c       |  1 +
 .../ethernet/marvell/octeontx2/nic/otx2_pf.c  |  1 +
 .../ethernet/mellanox/mlx5/core/en/params.c   |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  1 +
 drivers/net/wireless/mediatek/mt76/mt76.h     |  1 +
 include/linux/skbuff.h                        |  3 +-
 include/net/page_pool.h                       |  2 -
 net/core/page_pool.c                          | 52 +++++--------------
 net/core/skbuff.c                             | 28 ++++++++++
 11 files changed, 50 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 84751bb303a6..6222aaa5157f 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -28,6 +28,7 @@
 #include <linux/iopoll.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <net/page_pool.h>
 #include <net/xdp_sock_drv.h>
 
 #define TSNEP_RX_OFFSET (max(NET_SKB_PAD, XDP_PACKET_HEADROOM) + NET_IP_ALIGN)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 8fbe47703d47..99b3b4f79603 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -38,6 +38,7 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <net/ip.h>
+#include <net/page_pool.h>
 #include <net/selftests.h>
 #include <net/tso.h>
 #include <linux/tcp.h>
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index a5d03583bf79..d17a0ebc9036 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -7,6 +7,7 @@
 
 #include <linux/interrupt.h>
 #include <linux/pci.h>
+#include <net/page_pool.h>
 #include <net/tso.h>
 #include <linux/bitfield.h>
 
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
index fe8ea4e531b7..7eca434a0550 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
@@ -16,6 +16,7 @@
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/bitfield.h>
+#include <net/page_pool.h>
 
 #include "otx2_reg.h"
 #include "otx2_common.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5ce28ff7685f..0f152f14165b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -6,6 +6,7 @@
 #include "en/port.h"
 #include "en_accel/en_accel.h"
 #include "en_accel/ipsec.h"
+#include <net/page_pool.h>
 #include <net/xdp_sock_drv.h>
 
 static u8 mlx5e_mpwrq_min_page_shift(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f0e6095809fa..1bd91bc09eb8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -35,6 +35,7 @@
 #include "en/xdp.h"
 #include "en/params.h"
 #include <linux/bitfield.h>
+#include <net/page_pool.h>
 
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
 {
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 6b07b8fafec2..95c16f11d156 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -15,6 +15,7 @@
 #include <linux/average.h>
 #include <linux/soc/mediatek/mtk_wed.h>
 #include <net/mac80211.h>
+#include <net/page_pool.h>
 #include "util.h"
 #include "testmode.h"
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..f76d172ed262 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -32,7 +32,6 @@
 #include <linux/if_packet.h>
 #include <linux/llist.h>
 #include <net/flow.h>
-#include <net/page_pool.h>
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 #include <linux/netfilter/nf_conntrack_common.h>
 #endif
@@ -3423,6 +3422,8 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
 	__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
 }
 
+bool page_pool_return_skb_page(struct page *page, bool napi_safe);
+
 static inline void
 napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
 {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2b7db9992fc0..829dc1f8ba6b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -307,8 +307,6 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
 	return pool->p.dma_dir;
 }
 
-bool page_pool_return_skb_page(struct page *page, bool napi_safe);
-
 struct page_pool *page_pool_create(const struct page_pool_params *params);
 
 struct xdp_mem_info;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 985ccaffc06a..dff0b4fa2316 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -582,6 +582,19 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
 		     unsigned int dma_sync_size, bool allow_direct)
 {
+	/* Avoid recycling non-PP pages, give them back to the page allocator.
+	 * page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+	 * in order to preserve any existing bits, such as bit 0 for the
+	 * head page of compound page and bit 1 for pfmemalloc page, so
+	 * mask those bits for freeing side when doing below checking,
+	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+	 * to avoid recycling the pfmemalloc page.
+	 */
+	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) {
+		put_page(page);
+		return NULL;
+	}
+
 	/* This allocator is optimized for the XDP mode that uses
 	 * one-frame-per-page, but have fallbacks that act like the
 	 * regular page allocator APIs.
@@ -913,42 +926,3 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
-
-bool page_pool_return_skb_page(struct page *page, bool napi_safe)
-{
-	struct napi_struct *napi;
-	struct page_pool *pp;
-	bool allow_direct;
-
-	page = compound_head(page);
-
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
-		return false;
-
-	pp = page->pp;
-
-	/* Allow direct recycle if we have reasons to believe that we are
-	 * in the same context as the consumer would run, so there's
-	 * no possible race.
-	 */
-	napi = READ_ONCE(pp->p.napi);
-	allow_direct = napi_safe && napi &&
-		READ_ONCE(napi->list_owner) == smp_processor_id();
-
-	/* Driver set this to memory recycling info. Reset it on recycle.
-	 * This will *not* work for NIC using a split-page memory model.
-	 * The page will be returned to the pool here regardless of the
-	 * 'flipped' fragment being in use or not.
-	 */
-	page_pool_put_full_page(pp, page, allow_direct);
-
-	return true;
-}
-EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7edabf17988a..4b7d00d5b5d7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -879,6 +879,34 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
+bool page_pool_return_skb_page(struct page *page, bool napi_safe)
+{
+	struct napi_struct *napi;
+	struct page_pool *pp;
+	bool allow_direct;
+
+	page = compound_head(page);
+	pp = page->pp;
+
+	/* Allow direct recycle if we have reasons to believe that we are
+	 * in the same context as the consumer would run, so there's
+	 * no possible race.
+	 */
+	napi = READ_ONCE(pp->p.napi);
+	allow_direct = napi_safe && napi &&
+		READ_ONCE(napi->list_owner) == smp_processor_id();
+
+	/* Driver set this to memory recycling info. Reset it on recycle.
+	 * This will *not* work for NIC using a split-page memory model.
+	 * The page will be returned to the pool here regardless of the
+	 * 'flipped' fragment being in use or not.
+	 */
+	page_pool_put_full_page(pp, page, allow_direct);
+
+	return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
+
 static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
 {
 	if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)

From patchwork Thu Jun 29 15:23:03 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexander Lobakin <aleksander.lobakin@intel.com>
X-Patchwork-Id: 13297094
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0EBEF12B76
	for <netdev@vger.kernel.org>; Thu, 29 Jun 2023 15:24:02 +0000 (UTC)
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 963A12703;
	Thu, 29 Jun 2023 08:24:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1688052241; x=1719588241;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=b/XcIomoM5FwKJ4/woDWeTY9YJxcho3WgtMqtWHrNMM=;
  b=RnycxpmWI6AMDWRkIz9CAizAp/OMhCqxDCYNevqvFG3dl2jztXrvTq5q
   FuPm/RwI7mIZXQN+0COgw5qeLtHf59/C266sEHA8Dl+whvhCnPUnDBm4R
   xeZypJApv48jvtYehlq1qtJtdLt4tPIteqVXNjJIpj0fK4dsIADSL6QZE
   Wq4rdOh6Phi/h0auxmBZ4pSTxGZ9Hf5Er6I2teI2/wVL7suobyaamFSR1
   CLvEbPulrMitt/8ZI24LPxDsnGBE13CNxAZAgYdFGdDpy9kzZHDNkExFU
   A3vWg1vSSCB5Zddul7IVfvQf1ZrfGCwdyL9HWtZ/TPD+sbw+s+QuOFPCg
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="346920613"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="346920613"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2023 08:24:00 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="830573798"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="830573798"
Received: from newjersey.igk.intel.com ([10.102.20.203])
  by fmsmga002.fm.intel.com with ESMTP; 29 Jun 2023 08:23:57 -0700
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op
 externals when possible
Date: Thu, 29 Jun 2023 17:23:03 +0200
Message-ID: <20230629152305.905962-3-aleksander.lobakin@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230629152305.905962-1-aleksander.lobakin@intel.com>
References: <20230629152305.905962-1-aleksander.lobakin@intel.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,
	SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's

page_pool_put_page()
  page_pool_put_defragged_page()                  <- external
    __page_pool_put_page()
      page_pool_dma_sync_for_device()             <- non-inline
        dma_sync_single_range_for_device()
          dma_sync_single_for_device()            <- external
            dma_direct_sync_single_for_device()
              dev_is_dma_coherent()               <- exit

For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define new PP flag, which will mean "do DMA syncs for device, but only
when needed" and turn it on by default when the driver asks to sync
pages. When a page is mapped, check whether it needs syncs and if so,
replace that "sync when needed" back to "always do syncs" globally for
the whole pool (better safe than sorry). As long as a pool has no pages
requiring DMA syncs, this cuts off a good piece of calls and checks.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool.h |  3 +++
 net/core/page_pool.c    | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 829dc1f8ba6b..ff3772fab707 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -23,6 +23,9 @@
 					* Please note DMA-sync-for-CPU is still
 					* device driver responsibility
 					*/
+#define PP_FLAG_DMA_MAYBE_SYNC	BIT(2) /* Internal, should not be used in
+					* drivers
+					*/
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP |\
 				 PP_FLAG_DMA_SYNC_DEV)
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index dff0b4fa2316..498e058140b3 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -197,6 +197,10 @@ static int page_pool_init(struct page_pool *pool,
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
 		 */
+
+		/* Try to avoid calling no-op syncs */
+		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
+		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
 	}
 
 #ifdef CONFIG_PAGE_POOL_STATS
@@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 
 	page_pool_set_dma_addr(page, dma);
 
+	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
+	    dma_need_sync(pool->p.dev, dma)) {
+		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
+		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
+	}
+
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 

From patchwork Thu Jun 29 15:23:04 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexander Lobakin <aleksander.lobakin@intel.com>
X-Patchwork-Id: 13297095
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1FD3B134C8
	for <netdev@vger.kernel.org>; Thu, 29 Jun 2023 15:24:06 +0000 (UTC)
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA4822961;
	Thu, 29 Jun 2023 08:24:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1688052244; x=1719588244;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=vFwKj/X1T0Mik3UGZNrcERWLJ/HF06IVw/0PejdyHC0=;
  b=PbiCzo7bnTknV+mEPfyF3nK82HHEYuJAPvThMJOAG7vPlgP0blSD3M6i
   W3JrACIRi5MQVeTMiYgq119NVn44v1FzgjKUFsWc4HfCpPopc3yw6UutL
   e7n5yQECEdL0zn+WcAT1em7NbPdsj7EFdBo8JFVqcUQdTV9YNmH0trNxQ
   RAWYHbBZ1j9q8I67okbSY0Ts3PaWcOBjgVSmJhVu1pn4+oVz1qBOosGW8
   gaKrh07Abqh0qMQ/DOuFXC8twf4rtww5RN8N6Wt9k9snXoV78RYgOQ6I/
   eebrn7ZUE+pF/4wQAK74iW0cu/Gl5sDtGSrUppnKQ+1qYJq7qOpCqo3HZ
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="346920631"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="346920631"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2023 08:24:04 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="830573840"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="830573840"
Received: from newjersey.igk.intel.com ([10.102.20.203])
  by fmsmga002.fm.intel.com with ESMTP; 29 Jun 2023 08:24:00 -0700
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH RFC net-next 3/4] net: add flag to indicate NAPI/GRO is
 running right now
Date: Thu, 29 Jun 2023 17:23:04 +0200
Message-ID: <20230629152305.905962-4-aleksander.lobakin@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230629152305.905962-1-aleksander.lobakin@intel.com>
References: <20230629152305.905962-1-aleksander.lobakin@intel.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,
	SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Currently, there's no easy way to check if a NAPI polling cycle is
running and on which CPU, although this might come very handy in
several cases.
Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") added napi_struct::list_owner, BUT it's set when the NAPI is
*scheduled*. `->list_owner == smp_processor_id()` doesn't mean we're
inside the corresponding polling loop.
Introduce new NAPI state flag, NAPI{,F}_STATE_RUNNING. Set it right
before calling to ->poll() and clear after all napi_gro_flush() and
gro_normal_list() are done. They are run in the same context and, in
fact, are part of the receive routine.
When `packets == budget`, napi_complete_done() is not called, so in
that case it's safe to clear the flag after ->poll() ends. Otherwise,
however, napi_complete_done() can lead to reenabling interrupts and
scheduling the NAPI already on another CPU. In that case, clear the
flag in napi_complete_done() itself.
The main usecase for the flag is as follows:

	if (test_bit(NAPI_STATE_RUNNING, &n->state) &&
	    READ_ONCE(n->list_owner) == smp_processor_id())
		/* you're inside n->poll() or the following GRO
		 * processing context
		 */

IOW, when the condition is true, feel free to use resources protected
by this NAPI, such as page_pools covered by it, percpu NAPI caches etc.
Just make sure interrupts are enabled.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 23 +++++++++++++++++------
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b828c7a75be2..db3aea863ea9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -389,6 +389,7 @@ enum {
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
+	NAPI_STATE_RUNNING,		/* This NAPI or GRO is running on ::list_owner */
 };
 
 enum {
@@ -402,6 +403,7 @@ enum {
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
+	NAPIF_STATE_RUNNING		= BIT(NAPI_STATE_RUNNING),
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 69a3e544676c..7f0d23c9e25e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6039,7 +6039,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 
 		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
 			      NAPIF_STATE_SCHED_THREADED |
-			      NAPIF_STATE_PREFER_BUSY_POLL);
+			      NAPIF_STATE_PREFER_BUSY_POLL |
+			      NAPIF_STATE_RUNNING);
 
 		/* If STATE_MISSED was set, leave STATE_SCHED set,
 		 * because we will call napi->poll() one more time.
@@ -6128,6 +6129,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
 	/* All we really want here is to re-enable device interrupts.
 	 * Ideally, a new ndo_busy_poll_stop() could avoid another round.
 	 */
+	set_bit(NAPI_STATE_RUNNING, &napi->state);
 	rc = napi->poll(napi, budget);
 	/* We can't gro_normal_list() here, because napi->poll() might have
 	 * rearmed the napi (napi_complete_done()) in which case it could
@@ -6135,8 +6137,10 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
 	 */
 	trace_napi_poll(napi, rc, budget);
 	netpoll_poll_unlock(have_poll_lock);
-	if (rc == budget)
+	if (rc == budget) {
 		__busy_poll_stop(napi, skip_schedule);
+		clear_bit(NAPI_STATE_RUNNING, &napi->state);
+	}
 	local_bh_enable();
 }
 
@@ -6186,9 +6190,11 @@ void napi_busy_loop(unsigned int napi_id,
 			have_poll_lock = netpoll_poll_lock(napi);
 			napi_poll = napi->poll;
 		}
+		set_bit(NAPI_STATE_RUNNING, &napi->state);
 		work = napi_poll(napi, budget);
 		trace_napi_poll(napi, work, budget);
 		gro_normal_list(napi);
+		clear_bit(NAPI_STATE_RUNNING, &napi->state);
 count:
 		if (work > 0)
 			__NET_ADD_STATS(dev_net(napi->dev),
@@ -6457,6 +6463,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 	 */
 	work = 0;
 	if (test_bit(NAPI_STATE_SCHED, &n->state)) {
+		set_bit(NAPI_STATE_RUNNING, &n->state);
 		work = n->poll(n, weight);
 		trace_napi_poll(n, work, weight);
 	}
@@ -6466,7 +6473,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 				n->poll, work, weight);
 
 	if (likely(work < weight))
-		return work;
+		goto out;
 
 	/* Drivers must not modify the NAPI state if they
 	 * consume the entire weight.  In such cases this code
@@ -6475,7 +6482,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 	 */
 	if (unlikely(napi_disable_pending(n))) {
 		napi_complete(n);
-		return work;
+		goto out;
 	}
 
 	/* The NAPI context has more processing work, but busy-polling
@@ -6488,7 +6495,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 			 */
 			napi_schedule(n);
 		}
-		return work;
+		goto out;
 	}
 
 	if (n->gro_bitmask) {
@@ -6506,11 +6513,15 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 	if (unlikely(!list_empty(&n->poll_list))) {
 		pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
 			     n->dev ? n->dev->name : "backlog");
-		return work;
+		goto out;
 	}
 
 	*repoll = true;
 
+out:
+	if (READ_ONCE(n->list_owner) == smp_processor_id())
+		clear_bit(NAPI_STATE_RUNNING, &n->state);
+
 	return work;
 }
 

From patchwork Thu Jun 29 15:23:05 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexander Lobakin <aleksander.lobakin@intel.com>
X-Patchwork-Id: 13297096
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D820613ACD
	for <netdev@vger.kernel.org>; Thu, 29 Jun 2023 15:24:08 +0000 (UTC)
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 852033584;
	Thu, 29 Jun 2023 08:24:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1688052247; x=1719588247;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DoG27UNRr5H+5NVUUD3zG36LMPgB99hxtFjecm2IJwk=;
  b=h0H92upFCQGhYvF2XNhavYVb/YFW9l+huTqby4ouBNd3Sy//e8cuj4BT
   W2mgcfLdSCULHjyE2fYzRg2dajaQtVqKHVHHHbFHIVtwYGARSR9EBxmXp
   3URrT6sM2asf91Zc5Z645o1cMdNeSrg641YLRhS3DtG/Vefd3iAvxHnFx
   MuvWHFExsBP5HoSyedcSte5mip5vSABZIri8cvxtwqMEZuUBjvu4Jok3m
   UgTdlWCZ1jzJB/vOHOfVpD8sk5GPZ2rnR1vUGcAebCG6CmdA0gqlhEdRY
   9MQvhT+hH1zATOUPDRe1fYrmcjLDUQRooexHj0OJOUPqkcdolGNBc+ziH
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="346920658"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="346920658"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2023 08:24:07 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10756"; a="830573882"
X-IronPort-AV: E=Sophos;i="6.01,168,1684825200";
   d="scan'208";a="830573882"
Received: from newjersey.igk.intel.com ([10.102.20.203])
  by fmsmga002.fm.intel.com with ESMTP; 29 Jun 2023 08:24:04 -0700
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH RFC net-next 4/4] net: skbuff: always recycle PP pages
 directly when inside a NAPI loop
Date: Thu, 29 Jun 2023 17:23:05 +0200
Message-ID: <20230629152305.905962-5-aleksander.lobakin@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230629152305.905962-1-aleksander.lobakin@intel.com>
References: <20230629152305.905962-1-aleksander.lobakin@intel.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,
	SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of times we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, test the introduced
%NAPI_STATE_RUNNING. There's good probability it will be set and, if
->list_owner is our current CPU, we're good to use direct recycling,
even though @napi_safe is false.
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of times, which means the
MM layer is not even called to allocate any new pages.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 net/core/skbuff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4b7d00d5b5d7..931c83d7b251 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -893,7 +893,8 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
 	 * no possible race.
 	 */
 	napi = READ_ONCE(pp->p.napi);
-	allow_direct = napi_safe && napi &&
+	allow_direct = napi &&
+		(napi_safe || test_bit(NAPI_STATE_RUNNING, &napi->state)) &&
 		READ_ONCE(napi->list_owner) == smp_processor_id();
 
 	/* Driver set this to memory recycling info. Reset it on recycle.