From patchwork Wed Jan 15 15:18:53 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexander Lobakin <aleksander.lobakin@intel.com>
X-Patchwork-Id: 13940537
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4FEB70801;
	Wed, 15 Jan 2025 15:19:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1736954384; cv=none;
 b=WO3DLYpkFkGD8xhpgeY63sk8YHrE3S6Cr3UMCXOmmfgxWLE+M1O9ivSMGwcbRB2HIMnI7ILc4g1kwSJcMMEh2Cu3SwhJD4mDHfGS6HABAEE5TOv+TloPeayFLK4xf3TZfxHYMHIbuz+G8aqGmNT6YZId9TtbVvBeFk9QmEJw77E=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1736954384; c=relaxed/simple;
	bh=7ejRYevHYOXah2gzkbZoNHZ3U+tY63P81luAAusOV5c=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=ARQLnxW6F3kyIUbXtasbAN0eNIDe8LWO4/5uqE9bX3iU0gS0g2Bc52K9VYS+o2/S3lloo4ngw6EPty0XRSbjbVoKSkgYx6IgrWGYu79PwhFSbGZ4mp1BKFPe5uXRjOlJEW84T9uT6cI3fG987cZ17uDiqJXEiP51g5Z/uMXIAls=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Nyg/v2wF; arc=none smtp.client-ip=192.198.163.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Nyg/v2wF"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1736954383; x=1768490383;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=7ejRYevHYOXah2gzkbZoNHZ3U+tY63P81luAAusOV5c=;
  b=Nyg/v2wFv5GjbopIdgMumt426wMZJ8ArFpurSVm2365LgBKW2cTXZ+5O
   /U9+u5NyaI+Th7mTuqRzhmz5YDSl3urcUfhUIQ02/Ad/DlDBt+FkbrWHE
   8tStd+9bcDHn6mnO9xLj7OzZf/0wLfulOauXq3mZjmiHUMRKfObVT+2yl
   UuqBsxWi4aUxkIjjhh1fZLLhSPLj7W/uZ73PGUN+zvfple7JPC4oq1ZiM
   QKnzCe48p0ySmoThPn254RPG5oHsQm6KcEVNbGayurqLwFuA3ggpgZ6gX
   zDMjGkp2kApi6NwfLIZE4sg+jtPKN5grkQCX6tIgLQyKinenYd6LYiUw1
   A==;
X-CSE-ConnectionGUID: qIpaeHGKTN++xMZWw1kJ7A==
X-CSE-MsgGUID: wylSEl8uTfeJjUtSCm0g6A==
X-IronPort-AV: E=McAfee;i="6700,10204,11316"; a="37451756"
X-IronPort-AV: E=Sophos;i="6.13,206,1732608000";
   d="scan'208";a="37451756"
Received: from fmviesa007.fm.intel.com ([10.60.135.147])
  by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Jan 2025 07:19:41 -0800
X-CSE-ConnectionGUID: GyD+oBEsRR+ALt26wRX3rw==
X-CSE-MsgGUID: EBVu00rWTdOolhOuqOIJNg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.13,206,1732608000";
   d="scan'208";a="105116640"
Received: from newjersey.igk.intel.com ([10.102.20.203])
  by fmviesa007.fm.intel.com with ESMTP; 15 Jan 2025 07:19:37 -0800
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
 Lorenzo Bianconi <lorenzo@kernel.org>, Daniel Xu <dxu@dxuuu.xyz>,
 Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>,
 Andrii Nakryiko <andrii@kernel.org>,
 John Fastabend <john.fastabend@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J?=
	=?utf-8?q?=C3=B8rgensen?= <toke@kernel.org>,
 Jesper Dangaard Brouer <hawk@kernel.org>,
 Martin KaFai Lau <martin.lau@linux.dev>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH net-next v3 0/8] bpf: cpumap: enable GRO for XDP_PASS frames
Date: Wed, 15 Jan 2025 16:18:53 +0100
Message-ID: <20250115151901.2063909-1-aleksander.lobakin@intel.com>
X-Mailer: git-send-email 2.48.0
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: kuba@kernel.org

Several months ago, I had been looking through my old XDP hints tree[0]
to check whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-2: decouple the GRO struct from the NAPI struct and allow
  using it out of a NAPI entity within the kernel core code;
* patch 3: switch cpumap from netif_receive_skb_list() to
  gro_receive_skb().

Additional improvements:

* patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 5-6: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 7-8: use that function in veth as well and remove the one that
  was now superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
patch 3         2.3        4         Mpps
patch 8         2.4        4.7       Mpps

1...3 diff      -17        +48       %
1...8 diff      -11        +74       %

Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
tests. On my system however, the same test gave me up to +100%.

Note that there's a series from Lorenzo[4] which achieves the same, but
in a different way. During the discussions, the approach using a
standalone GRO instance was preferred over the threaded NAPI.

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
[3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
[4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org

Alexander Lobakin (8):
  net: gro: decouple GRO from the NAPI layer
  net: gro: expose GRO init/cleanup to use outside of NAPI
  bpf: cpumap: switch to GRO from netif_receive_skb_list()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

 include/linux/netdevice.h                  |  26 ++--
 include/linux/skbuff.h                     |   1 +
 include/net/busy_poll.h                    |  11 +-
 include/net/gro.h                          |  38 ++++--
 include/net/xdp.h                          |   1 -
 drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
 drivers/net/ethernet/cortina/gemini.c      |   1 +
 drivers/net/veth.c                         |   3 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
 kernel/bpf/cpumap.c                        | 145 +++++++++++++--------
 net/core/dev.c                             |  77 +++--------
 net/core/gro.c                             | 101 +++++++++-----
 net/core/skbuff.c                          |  62 +++++++++
 net/core/xdp.c                             |  10 --
 14 files changed, 298 insertions(+), 180 deletions(-)
---
From v2[5]:
* 1: remove napi_id duplication in both &gro_node and &napi_struct by
     using a tagged struct group. The most efficient approach I've
     found so far: no additional branches, no inline expansion, no tail
     calls / double calls, saves 8 bytes of &napi_struct in comparison
     with v2 (Jakub, Paolo, me);
* 4: improve and streamline skb allocation fails (-1 branch per frame),
     skip more code for skb-only batches.

From v1[6]:
* use a standalone GRO instance instead of the threaded NAPI (Jakub);
* rebase and send to net-next as it's now more networking than BPF.

[5] https://lore.kernel.org/netdev/20250107152940.26530-1-aleksander.lobakin@intel.com
[6] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com