From patchwork Wed Jul 13 11:14:09 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916569
X-Patchwork-Delegate: kuba@kernel.org
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E0DD5CCA479
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:14:48 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235958AbiGMLOr (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:47 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38370 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229579AbiGMLOl (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:41 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7EB56AE3BE
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710879;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=bWMkKjPV4Nd8SsBqqmrTOMauXj+aqkwTTcjEsUpuP7A=;
        b=NpDQVNCemY4aovFVyzm2d8Oqoc5uQYXLOz5dJyGqCzw66E7jOTeYxpuWtE6dCyx6rqSAGN
        lWtQa9QpEfaf2t5cCCmGxn+0MBOGFT2d6Z/z53E0aWy4XSEqZO3QvJJK9Uhip6i5YZ/q57
        t4LxPPuXQFHHTSjyY77Qsc3fbhiW4+Q=
Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com
 [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-665-_AQ7bhpgMySzTRQMy1XUGw-1; Wed, 13 Jul 2022 07:14:38 -0400
X-MC-Unique: _AQ7bhpgMySzTRQMy1XUGw-1
Received: by mail-ej1-f70.google.com with SMTP id
 qb28-20020a1709077e9c00b0072af6ccc1aeso3216243ejc.6
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=bWMkKjPV4Nd8SsBqqmrTOMauXj+aqkwTTcjEsUpuP7A=;
        b=hCFAagTgv5V50xO3Tiq/QRRrljTgHgsBg/ceDzvvN5ukOqOh9dlwY6/H76pK2zClPd
         6tCAkHC8oigRR/gvyGcLKk4jyQPpCp11LGOj1XmpAUAxJIanMtxRvU7wpEW2Jr9wECvB
         tdE/2eSyU+vhJtll2PhR/m48FbcL8lIaQlTHw8KyRydu64TzE5WSmptlO1aPuI3J0VEI
         Wy7CGBdnVN7+WZlfPumN7+2J9/8Yk4Se/KZnniX5oaqeyThv7144fpXjQczv0b0UtBOJ
         zWQQJOmjhdQNshAgnv+hnJ6bfWf+wkrAdIbw+S3Xfem5bHgAXWIx2sOfn3/g13ex2DyF
         1I8w==
X-Gm-Message-State: AJIora+EbbKNL8KS/chlo53Fg7p8mWFUr/i3FNP/CFfxXm3AryHcrcYl
        P5y5T+mtjB6sJ8fk39tN49lFSnKdqUP725bxPtGik0q2FTCd+Fm/vWAaSpx+mxqtTJy4FayleF2
        gBPM5HCkjyudJ/ode
X-Received: by 2002:aa7:c2d7:0:b0:43a:78af:6e57 with SMTP id
 m23-20020aa7c2d7000000b0043a78af6e57mr4115926edp.163.1657710875078;
        Wed, 13 Jul 2022 04:14:35 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1toYCJgxCFQePChPTgb0ddDwHTSvI13EZhuFWasHBao1NAdsCA/T33L5bSGuT5go7T9KW8NIg==
X-Received: by 2002:aa7:c2d7:0:b0:43a:78af:6e57 with SMTP id
 m23-20020aa7c2d7000000b0043a78af6e57mr4115781edp.163.1657710873752;
        Wed, 13 Jul 2022 04:14:33 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 s10-20020a170906354a00b00705cdfec71esm4868046eja.7.2022.07.13.04.14.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:33 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id A05944D98FD; Wed, 13 Jul 2022 13:14:32 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members
 in softnet data
Date: Wed, 13 Jul 2022 13:14:09 +0200
Message-Id: <20220713111430.134810-2-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Move the received_rps counter value next to the other RPS-related members
in softnet_data. This closes two four-byte holes in the structure, making
room for another pointer in the first two cache lines without bumping the
xmit struct to its own line.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a3cb93c3dcc..fe9aeca2fce9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3100,7 +3100,6 @@ struct softnet_data {
 	/* stats */
 	unsigned int		processed;
 	unsigned int		time_squeeze;
-	unsigned int		received_rps;
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 #endif
@@ -3133,6 +3132,7 @@ struct softnet_data {
 	unsigned int		cpu;
 	unsigned int		input_queue_tail;
 #endif
+	unsigned int		received_rps;
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;

From patchwork Wed Jul 13 11:14:10 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916566
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9284EC433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:14:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235537AbiGMLOl (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:41 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38334 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230261AbiGMLOj (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:39 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 233D6AE557
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710877;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=4n1qURFuPsutSWvym+I4lZuwTMswVOZqrEfLZaagqYk=;
        b=W2zU0Vm0t036OXycAIosujQynNSFemHnnRz4t1l3curE4UaQ992Q6rbaqovgyAbc2RnwAB
        N4OWCsKds67Ydb/ETeaI2CX+W7Tw5y6ChrWSEDr/zcR+j5IkuA53RH7AW9gXeLEyXzd1MA
        ATWj9htD5BnrrYaj6a1dGg2BGCveF0U=
Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com
 [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-641-UgdVCG0OMGq05KiAPMM3bA-1; Wed, 13 Jul 2022 07:14:36 -0400
X-MC-Unique: UgdVCG0OMGq05KiAPMM3bA-1
Received: by mail-ej1-f70.google.com with SMTP id
 hq20-20020a1709073f1400b0072b9824f0a2so576391ejc.23
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:35 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=4n1qURFuPsutSWvym+I4lZuwTMswVOZqrEfLZaagqYk=;
        b=KCJSR2OuJcg3EAH6eJNgWCtcvfoo218Q6UHBRtm2PlFpxoJT/GyASovKsCu17B+Qx/
         bjQWnpmQiBksQxeM6jx/tlzapdfcbTBpmSEePwizup2qauotYk2BMQG3xCcqxrhYR147
         UkJTQfMa6AZLVGcHOn2FkmSwkwHvYZYzROj94PRhTy+thJywKEEX1RbhRWOqf3uvjv7g
         IP2pA1xTe2/61e91O+FpcIYtVnF42/Pr8GKkkzNf5RXziYby73nEVOijoIbnN04kLaFG
         kVKy/9CXm9whM643EB96b3MLCPeFnUEIu+CH1K6PVwpXqhua6fjtxfq0hggb7Yt3Wz6q
         lsng==
X-Gm-Message-State: AJIora8KHTiK+kZqN8op7XQJ0Qe3CmPeQO46Ut40gP6NUKPl6cY/FXxv
        BfwKnIR7brQj1rhG7PAx0lha74mcdUZVEhd0PrfvYtALDjlMLRkgv4Dupr9jj/vorOfn5AbUTNJ
        QqWRyxJ7hHvQ0pbqF
X-Received: by 2002:a05:6402:1741:b0:433:4e4d:bfb4 with SMTP id
 v1-20020a056402174100b004334e4dbfb4mr4142730edx.7.1657710874840;
        Wed, 13 Jul 2022 04:14:34 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1si2uz3D1uFgcQn/a15w/KYB8f1OJeDy20jWcOcf1lbhsMDY+nfRwNBrz+qStsMJhXbmtsn4Q==
X-Received: by 2002:a05:6402:1741:b0:433:4e4d:bfb4 with SMTP id
 v1-20020a056402174100b004334e4dbfb4mr4142677edx.7.1657710874490;
        Wed, 13 Jul 2022 04:14:34 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 kv20-20020a17090778d400b0072af6f166c2sm4907118ejc.82.2022.07.13.04.14.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:34 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 5912D4D98FF; Wed, 13 Jul 2022 13:14:33 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>,
 Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>,
 Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>,
 KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>,
 Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
 "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>,
 Jesper Dangaard Brouer <hawk@kernel.org>,
 =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn@kernel.org>,
 Magnus Karlsson <magnus.karlsson@intel.com>,
 Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
 Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Eric Dumazet <edumazet@google.com>,
 Paolo Abeni <pabeni@redhat.com>
Subject: [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to
 u64
Date: Wed, 13 Jul 2022 13:14:10 +0200
Message-Id: <20220713111430.134810-3-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

We want to be able to support 64-bit indexes for PIFO maps, so expand the
width of the 'key' argument to the bpf_redirect_map() helper. Since BPF
registers are always 64-bit, this should be safe to do after the fact.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h      |  2 +-
 include/linux/filter.h   | 12 ++++++------
 include/uapi/linux/bpf.h |  2 +-
 kernel/bpf/cpumap.c      |  4 ++--
 kernel/bpf/devmap.c      |  4 ++--
 kernel/bpf/verifier.c    |  2 +-
 net/core/filter.c        |  4 ++--
 net/xdp/xskmap.c         |  4 ++--
 8 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2b21f2a3452f..d877d9825e77 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -132,7 +132,7 @@ struct bpf_map_ops {
 	struct bpf_local_storage __rcu ** (*map_owner_storage_ptr)(void *owner);
 
 	/* Misc helpers.*/
-	int (*map_redirect)(struct bpf_map *map, u32 ifindex, u64 flags);
+	int (*map_redirect)(struct bpf_map *map, u64 key, u64 flags);
 
 	/* map_meta_equal must be implemented for maps that can be
 	 * used as an inner map.  It is a runtime check to ensure
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4c1a8b247545..10167ab1ef95 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -637,13 +637,13 @@ struct bpf_nh_params {
 };
 
 struct bpf_redirect_info {
-	u32 flags;
-	u32 tgt_index;
+	u64 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	u32 flags;
+	u32 kern_flags;
 	u32 map_id;
 	enum bpf_map_type map_type;
-	u32 kern_flags;
 	struct bpf_nh_params nh;
 };
 
@@ -1486,7 +1486,7 @@ static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
 }
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
+static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u64 index,
 						  u64 flags, const u64 flag_mask,
 						  void *lookup_elem(struct bpf_map *map, u32 key))
 {
@@ -1497,7 +1497,7 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 	if (unlikely(flags & ~(action_mask | flag_mask)))
 		return XDP_ABORTED;
 
-	ri->tgt_value = lookup_elem(map, ifindex);
+	ri->tgt_value = lookup_elem(map, index);
 	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
 		/* If the lookup fails we want to clear out the state in the
 		 * redirect_info struct completely, so that if an eBPF program
@@ -1509,7 +1509,7 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 		return flags & action_mask;
 	}
 
-	ri->tgt_index = ifindex;
+	ri->tgt_index = index;
 	ri->map_id = map->id;
 	ri->map_type = map->map_type;
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 379e68fb866f..aec623f60048 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2607,7 +2607,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_redirect_map(struct bpf_map *map, u64 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the endpoint referenced by *map* at
  * 		index *key*. Depending on its type, this *map* can contain
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f4860ac756cd..2e7ee53ae3e4 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -668,9 +668,9 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	return 0;
 }
 
-static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int cpu_map_redirect(struct bpf_map *map, u64 index, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+	return __bpf_xdp_redirect_map(map, index, flags, 0,
 				      __cpu_map_lookup_elem);
 }
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index c2867068e5bd..980f8928e977 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -992,14 +992,14 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
 					 map, key, value, map_flags);
 }
 
-static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int dev_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags)
 {
 	return __bpf_xdp_redirect_map(map, ifindex, flags,
 				      BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS,
 				      __dev_map_lookup_elem);
 }
 
-static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int dev_hash_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags)
 {
 	return __bpf_xdp_redirect_map(map, ifindex, flags,
 				      BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 328cfab3af60..039f7b61c305 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14183,7 +14183,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			BUILD_BUG_ON(!__same_type(ops->map_peek_elem,
 				     (int (*)(struct bpf_map *map, void *value))NULL));
 			BUILD_BUG_ON(!__same_type(ops->map_redirect,
-				     (int (*)(struct bpf_map *map, u32 ifindex, u64 flags))NULL));
+				     (int (*)(struct bpf_map *map, u64 index, u64 flags))NULL));
 			BUILD_BUG_ON(!__same_type(ops->map_for_each_callback,
 				     (int (*)(struct bpf_map *map,
 					      bpf_callback_t callback_fn,
diff --git a/net/core/filter.c b/net/core/filter.c
index 4ef77ec5255e..e23e53ed1b04 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4408,10 +4408,10 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
-BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
+BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u64, key,
 	   u64, flags)
 {
-	return map->ops->map_redirect(map, ifindex, flags);
+	return map->ops->map_redirect(map, key, flags);
 }
 
 static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index acc8e52a4f5f..771d0fa90ef5 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -231,9 +231,9 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 	return 0;
 }
 
-static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int xsk_map_redirect(struct bpf_map *map, u64 index, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+	return __bpf_xdp_redirect_map(map, index, flags, 0,
 				      __xsk_map_lookup_elem);
 }
 

From patchwork Wed Jul 13 11:14:11 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916568
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 700D5C43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:14:47 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235940AbiGMLOo (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:44 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38356 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235327AbiGMLOk (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:40 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 10F48AE3A5
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710878;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=kUXHMLLa6TGj+N3iLFRjHqYiMI2HN4GorZ64uF/S2p4=;
        b=LU+fmQ2QgQD/D75a+GtVdt73wM2dw7tcGYeDK2QE3yNouyjGwRbKy9qyXrrcpmok8l3845
        o5lmmUQTOqnYp43ja3HS8RkoUt+n3WLM3PptAFbGWTTqRBz+OKd1yxgXmiYOcS7w/SUYxN
        BT4gJjvHeBrqre1vhx8/gO4pwz2kxdE=
Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com
 [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-663-W-O9HKzoOB-FPbqmu7sMIg-1; Wed, 13 Jul 2022 07:14:37 -0400
X-MC-Unique: W-O9HKzoOB-FPbqmu7sMIg-1
Received: by mail-ej1-f71.google.com with SMTP id
 hr24-20020a1709073f9800b0072b57c28438so2919812ejc.5
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=kUXHMLLa6TGj+N3iLFRjHqYiMI2HN4GorZ64uF/S2p4=;
        b=byPSX8sis6RtoXokePZ4sQbPTNtRRi5Q6bTMxOYOgqOhxeKc5hn6bhJ8nk/sN8wEEl
         Xc36lyVpnN48i2nygqd65/stOeJMABazbzsMTFp61V+1Z4YW3Wz5u7uFHvrW4c0TN4A6
         m5iMIerKyGWrm40l53hm2Ris8fJbamQldWo9i7Qo6YA8hiUvlHCWfE6mJLP1F9SCbrsJ
         Vi4m8qwzKOtgtzXhZNpiuOeXiFhnOXr5eaGEzIIjTKNWyuCcNcs2cPFrl0NzfWRK6IqN
         /Uyf2dc4gQTpLrxiQvTBEDPVd7IQqXKZ2a5RoNr1SHq4TLYa0W3OfOwGbYoZxmdqSJaw
         s0Ig==
X-Gm-Message-State: AJIora/IbTmJ0udG17xOsHlCM6CmuOwCrYyV/ZD/vRQQ2iTInelnRY1p
        5afrHJIGSSfyy//9/c8M4gOXvotAj/kTH8AwdFjGpTx8++bRaQufHF39nHRhoqqVnzGCrgD3XH1
        rUXObD3apeNOXe/KK
X-Received: by 2002:a17:907:7f05:b0:72b:5a11:b357 with SMTP id
 qf5-20020a1709077f0500b0072b5a11b357mr2969589ejc.67.1657710875603;
        Wed, 13 Jul 2022 04:14:35 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1shlOkKTLY74UvhMGPgdm0TM9lJocW3Dqtd6mDQPeIl/i79kT2eJuKmmH7+aqmeDGXJ1iBgnA==
X-Received: by 2002:a17:907:7f05:b0:72b:5a11:b357 with SMTP id
 qf5-20020a1709077f0500b0072b5a11b357mr2969539ejc.67.1657710875104;
        Wed, 13 Jul 2022 04:14:35 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 9-20020a170906310900b0072b7b317aadsm1970271ejx.150.2022.07.13.04.14.34
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:34 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 20FAD4D9901; Wed, 13 Jul 2022 13:14:34 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Eric Dumazet <edumazet@google.com>,
 Paolo Abeni <pabeni@redhat.com>
Subject: [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run
Date: Wed, 13 Jul 2022 13:14:11 +0200
Message-Id: <20220713111430.134810-4-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

BPF ABI always uses 64-bit return value, but so far __bpf_prog_run and
higher level wrappers always truncated the return value to 32-bit.
Future patches introducing a new BPF_PROG_TYPE_DEQUEUE return a
PTR_TO_BTF_ID or NULL from the BPF program to the caller context in the
kernel. The verifier is taught to enforce a successful return of such a
referenced PTR_TO_BTF_ID, explicit release, or the return of a NULL
pointer to indicate absence. To be able to use this returned pointer
value, the bpf_prog_run invocation needs to be able to return a 64-bit
value.

To avoid code churn in the whole kernel, we let the compiler handle
truncation normally, and allow new call sites to utilize the 64-bit
return value, by receiving the return value as a u64.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf-cgroup.h | 12 ++++++------
 include/linux/bpf.h        | 14 +++++++-------
 include/linux/filter.h     | 34 +++++++++++++++++-----------------
 kernel/bpf/cgroup.c        | 12 ++++++------
 kernel/bpf/core.c          | 14 +++++++-------
 kernel/bpf/offload.c       |  4 ++--
 net/bpf/test_run.c         | 21 ++++++++++++---------
 net/packet/af_packet.c     |  7 +++++--
 8 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 2bd1b5f8de9b..e975f89c491b 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -23,12 +23,12 @@ struct ctl_table;
 struct ctl_table_header;
 struct task_struct;
 
-unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
-				       const struct bpf_insn *insn);
-unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
-					 const struct bpf_insn *insn);
-unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
-					  const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_sock(const void *ctx,
+			      const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_socket(const void *ctx,
+				const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_current(const void *ctx,
+				 const struct bpf_insn *insn);
 
 #ifdef CONFIG_CGROUP_BPF
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d877d9825e77..ebe6f2d95182 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -56,8 +56,8 @@ typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64);
 typedef int (*bpf_iter_init_seq_priv_t)(void *private_data,
 					struct bpf_iter_aux_info *aux);
 typedef void (*bpf_iter_fini_seq_priv_t)(void *private_data);
-typedef unsigned int (*bpf_func_t)(const void *,
-				   const struct bpf_insn *);
+typedef u64 (*bpf_func_t)(const void *,
+			  const struct bpf_insn *);
 struct bpf_iter_seq_info {
 	const struct seq_operations *seq_ops;
 	bpf_iter_init_seq_priv_t init_seq_private;
@@ -882,7 +882,7 @@ struct bpf_dispatcher {
 	struct bpf_ksym ksym;
 };
 
-static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
+static __always_inline __nocfi u64 bpf_dispatcher_nop_func(
 	const void *ctx,
 	const struct bpf_insn *insnsi,
 	bpf_func_t bpf_func)
@@ -911,7 +911,7 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
 }
 
 #define DEFINE_BPF_DISPATCHER(name)					\
-	noinline __nocfi unsigned int bpf_dispatcher_##name##_func(	\
+	noinline __nocfi u64 bpf_dispatcher_##name##_func(		\
 		const void *ctx,					\
 		const struct bpf_insn *insnsi,				\
 		bpf_func_t bpf_func)					\
@@ -922,7 +922,7 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
 	struct bpf_dispatcher bpf_dispatcher_##name =			\
 		BPF_DISPATCHER_INIT(bpf_dispatcher_##name);
 #define DECLARE_BPF_DISPATCHER(name)					\
-	unsigned int bpf_dispatcher_##name##_func(			\
+	u64 bpf_dispatcher_##name##_func(				\
 		const void *ctx,					\
 		const struct bpf_insn *insnsi,				\
 		bpf_func_t bpf_func);					\
@@ -1127,7 +1127,7 @@ struct bpf_prog {
 	u8			tag[BPF_TAG_SIZE];
 	struct bpf_prog_stats __percpu *stats;
 	int __percpu		*active;
-	unsigned int		(*bpf_func)(const void *ctx,
+	u64			(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
@@ -1472,7 +1472,7 @@ static inline void bpf_reset_run_ctx(struct bpf_run_ctx *old_ctx)
 /* BPF program asks to set CN on the packet. */
 #define BPF_RET_SET_CN						(1 << 0)
 
-typedef u32 (*bpf_prog_run_fn)(const struct bpf_prog *prog, const void *ctx);
+typedef u64 (*bpf_prog_run_fn)(const struct bpf_prog *prog, const void *ctx);
 
 static __always_inline u32
 bpf_prog_run_array(const struct bpf_prog_array *array,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 10167ab1ef95..b0ddb647d5f2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -567,16 +567,16 @@ struct sk_filter {
 
 DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 
-typedef unsigned int (*bpf_dispatcher_fn)(const void *ctx,
-					  const struct bpf_insn *insnsi,
-					  unsigned int (*bpf_func)(const void *,
-								   const struct bpf_insn *));
+typedef u64 (*bpf_dispatcher_fn)(const void *ctx,
+				 const struct bpf_insn *insnsi,
+				 u64 (*bpf_func)(const void *,
+						 const struct bpf_insn *));
 
-static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
+static __always_inline u64 __bpf_prog_run(const struct bpf_prog *prog,
 					  const void *ctx,
 					  bpf_dispatcher_fn dfunc)
 {
-	u32 ret;
+	u64 ret;
 
 	cant_migrate();
 	if (static_branch_unlikely(&bpf_stats_enabled_key)) {
@@ -596,7 +596,7 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 	return ret;
 }
 
-static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
+static __always_inline u64 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
 {
 	return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func);
 }
@@ -609,10 +609,10 @@ static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void
  * invocation of a BPF program does not require reentrancy protection
  * against a BPF program which is invoked from a preempting task.
  */
-static inline u32 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog,
 					  const void *ctx)
 {
-	u32 ret;
+	u64 ret;
 
 	migrate_disable();
 	ret = bpf_prog_run(prog, ctx);
@@ -708,13 +708,13 @@ static inline u8 *bpf_skb_cb(const struct sk_buff *skb)
 }
 
 /* Must be invoked with migration disabled */
-static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
+static inline u64 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
 					 const void *ctx)
 {
 	const struct sk_buff *skb = ctx;
 	u8 *cb_data = bpf_skb_cb(skb);
 	u8 cb_saved[BPF_SKB_CB_LEN];
-	u32 res;
+	u64 res;
 
 	if (unlikely(prog->cb_access)) {
 		memcpy(cb_saved, cb_data, sizeof(cb_saved));
@@ -729,10 +729,10 @@ static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
 	return res;
 }
 
-static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_save_cb(const struct bpf_prog *prog,
 				       struct sk_buff *skb)
 {
-	u32 res;
+	u64 res;
 
 	migrate_disable();
 	res = __bpf_prog_run_save_cb(prog, skb);
@@ -740,11 +740,11 @@ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
 	return res;
 }
 
-static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 					struct sk_buff *skb)
 {
 	u8 *cb_data = bpf_skb_cb(skb);
-	u32 res;
+	u64 res;
 
 	if (unlikely(prog->cb_access))
 		memset(cb_data, 0, BPF_SKB_CB_LEN);
@@ -759,14 +759,14 @@ DECLARE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp);
 
-static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+static __always_inline u64 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
 	/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
 	 * under local_bh_disable(), which provides the needed RCU protection
 	 * for accessing map entries.
 	 */
-	u32 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
+	u64 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
 
 	if (static_branch_unlikely(&bpf_master_redirect_enabled_key)) {
 		if (act == XDP_TX && netif_is_bond_slave(xdp->rxq->dev))
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 59b7eb60d5b4..1721b09d0838 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -63,8 +63,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 	return run_ctx.retval;
 }
 
-unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
-				       const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_sock(const void *ctx,
+			      const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct sock *sk;
@@ -85,8 +85,8 @@ unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
 	return ret;
 }
 
-unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
-					 const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_socket(const void *ctx,
+				const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct socket *sock;
@@ -107,8 +107,8 @@ unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
 	return ret;
 }
 
-unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
-					  const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_current(const void *ctx,
+				 const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct cgroup *cgrp;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 805c2ad5c793..a94dbb822f11 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2039,7 +2039,7 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
 
 #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
 #define DEFINE_BPF_PROG_RUN(stack_size) \
-static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
+static u64 PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
 { \
 	u64 stack[stack_size / sizeof(u64)]; \
 	u64 regs[MAX_BPF_EXT_REG]; \
@@ -2083,8 +2083,8 @@ EVAL4(DEFINE_BPF_PROG_RUN_ARGS, 416, 448, 480, 512);
 
 #define PROG_NAME_LIST(stack_size) PROG_NAME(stack_size),
 
-static unsigned int (*interpreters[])(const void *ctx,
-				      const struct bpf_insn *insn) = {
+static u64 (*interpreters[])(const void *ctx,
+			     const struct bpf_insn *insn) = {
 EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
 EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
@@ -2109,8 +2109,8 @@ void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
 }
 
 #else
-static unsigned int __bpf_prog_ret0_warn(const void *ctx,
-					 const struct bpf_insn *insn)
+static u64 __bpf_prog_ret0_warn(const void *ctx,
+				const struct bpf_insn *insn)
 {
 	/* If this handler ever gets executed, then BPF_JIT_ALWAYS_ON
 	 * is not working properly, so warn about it!
@@ -2245,8 +2245,8 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
 
-static unsigned int __bpf_prog_ret1(const void *ctx,
-				    const struct bpf_insn *insn)
+static u64 __bpf_prog_ret1(const void *ctx,
+			   const struct bpf_insn *insn)
 {
 	return 1;
 }
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index bd09290e3648..fabda7ed5dd0 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -246,8 +246,8 @@ static int bpf_prog_offload_translate(struct bpf_prog *prog)
 	return ret;
 }
 
-static unsigned int bpf_prog_warn_on_exec(const void *ctx,
-					  const struct bpf_insn *insn)
+static u64 bpf_prog_warn_on_exec(const void *ctx,
+				 const struct bpf_insn *insn)
 {
 	WARN(1, "attempt to execute device eBPF program on the host!");
 	return 0;
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2ca96acbc50a..f05d13717430 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -370,7 +370,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 }
 
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
-			u32 *retval, u32 *time, bool xdp)
+			u64 *retval, u32 *time, bool xdp)
 {
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
@@ -769,7 +769,7 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
 	struct bpf_fentry_test_t arg = {};
 	u16 side_effect = 0, ret = 0;
 	int b = 2, err = -EFAULT;
-	u32 retval = 0;
+	u64 retval = 0;
 
 	if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
 		return -EINVAL;
@@ -809,7 +809,7 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
 struct bpf_raw_tp_test_run_info {
 	struct bpf_prog *prog;
 	void *ctx;
-	u32 retval;
+	u64 retval;
 };
 
 static void
@@ -1054,15 +1054,15 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr)
 {
 	bool is_l2 = false, is_direct_pkt_access = false;
+	u32 size = kattr->test.data_size_in, duration;
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = net->loopback_dev;
-	u32 size = kattr->test.data_size_in;
 	u32 repeat = kattr->test.repeat;
 	struct __sk_buff *ctx = NULL;
-	u32 retval, duration;
 	int hh_len = ETH_HLEN;
 	struct sk_buff *skb;
 	struct sock *sk;
+	u64 retval;
 	void *data;
 	int ret;
 
@@ -1250,15 +1250,16 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	u32 batch_size = kattr->test.batch_size;
-	u32 retval = 0, duration, max_data_sz;
 	u32 size = kattr->test.data_size_in;
 	u32 headroom = XDP_PACKET_HEADROOM;
 	u32 repeat = kattr->test.repeat;
 	struct netdev_rx_queue *rxqueue;
 	struct skb_shared_info *sinfo;
+	u32 duration, max_data_sz;
 	struct xdp_buff xdp = {};
 	int i, ret = -EINVAL;
 	struct xdp_md *ctx;
+	u64 retval = 0;
 	void *data;
 
 	if (prog->expected_attach_type == BPF_XDP_DEVMAP ||
@@ -1416,7 +1417,8 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	struct bpf_flow_keys flow_keys;
 	const struct ethhdr *eth;
 	unsigned int flags = 0;
-	u32 retval, duration;
+	u32 duration;
+	u64 retval;
 	void *data;
 	int ret;
 
@@ -1481,8 +1483,9 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
 	struct bpf_sk_lookup_kern ctx = {};
 	u32 repeat = kattr->test.repeat;
 	struct bpf_sk_lookup *user_ctx;
-	u32 retval, duration;
 	int ret = -EINVAL;
+	u32 duration;
+	u64 retval;
 
 	if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
 		return -EINVAL;
@@ -1580,8 +1583,8 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
 	void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in);
 	__u32 ctx_size_in = kattr->test.ctx_size_in;
 	void *ctx = NULL;
-	u32 retval;
 	int err = 0;
+	u64 retval;
 
 	/* doesn't support data_in/out, ctx_out, duration, or repeat or flags */
 	if (kattr->test.data_in || kattr->test.data_out ||
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d08c4728523b..5b91f712d246 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1444,8 +1444,11 @@ static unsigned int fanout_demux_bpf(struct packet_fanout *f,
 
 	rcu_read_lock();
 	prog = rcu_dereference(f->bpf_prog);
-	if (prog)
-		ret = bpf_prog_run_clear_cb(prog, skb) % num;
+	if (prog) {
+		ret = bpf_prog_run_clear_cb(prog, skb);
+		/* For some architectures, we need to do modulus in 32-bit width */
+		ret %= num;
+	}
 	rcu_read_unlock();
 
 	return ret;

From patchwork Wed Jul 13 11:14:12 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916570
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 72F74C43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:14:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236005AbiGMLO4 (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:56 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38406 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235068AbiGMLOn (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:43 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id D82C5AE553
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710880;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=oNqiseiBBK1oSRPpUqEqndaBGP0dH/Mii8v0q7JzirA=;
        b=EMWqsGoqM9xYX0mA/99jmGMsg+TsRIYph9fOdcixtkeyGjOec35ISkSzPv7v3CrtMvs5l/
        PLwzyyVPV6DfvjHNcBFF14L+xyWeWCza8ZJG5PuyeTfUd81tMKYrhgm5wE4CqoNEoq7Ho0
        OPke7rQOEfjl80MZnustNZsBuPlXkV8=
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-342-EnQkAyrwOaOHhMRwaDLSkg-1; Wed, 13 Jul 2022 07:14:38 -0400
X-MC-Unique: EnQkAyrwOaOHhMRwaDLSkg-1
Received: by mail-ed1-f69.google.com with SMTP id
 x8-20020a056402414800b0042d8498f50aso8222570eda.23
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=oNqiseiBBK1oSRPpUqEqndaBGP0dH/Mii8v0q7JzirA=;
        b=1zdoHURFqrsEO6XdraadwdyxIhUwedwn+/XpykdT+SGXnAtQqq3/EWsF1ZxYBp4obS
         uZ4moDgvEyMProocb6d22IrdwWOUQh99N7jisKtmYeuEpc2Qg3jnaRfARC+i9of5BFNi
         RWJ7IyIgHYjBu25w2lg8VGu1KCGeQQbDBiBfzUZCxzpuR7HH2EzrH9Tfl59cmVmSZG5f
         FhXzepg0QSbcnOup5oCBOA2TCnYcNsGqMpLTHizbNEC6mhn6DVBMlG4YbFzVdS1bRTTC
         6YuU7uzmWIM+2sL87h+64RxbT/WzZzBm7mGZvZMLtnlgc536RPe3mWQINpqzG5MviRWk
         tzcw==
X-Gm-Message-State: AJIora9ahLtwz7E6GJ7QBq++fWR1U88KQ8VK70R3Z2sV9/AH63THZPoB
        WtooJsgS7L4NJcoxgR408swLBx9fHLfDxlnUFX6rDe283om8ZQUPU9u/mZPxoK0TJ+ohMHfzOGq
        JmZeOtkIanVJyA2B6
X-Received: by 2002:a17:907:7604:b0:72b:4ad5:b21c with SMTP id
 jx4-20020a170907760400b0072b4ad5b21cmr2873194ejc.412.1657710876880;
        Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1vZjPw7WZF5b1u2shBguMk1pQ0WXo1JnnKlhAGsVYwIWrtszqcXlnLG9YSBNjmYWpMdf6NxPQ==
X-Received: by 2002:a17:907:7604:b0:72b:4ad5:b21c with SMTP id
 jx4-20020a170907760400b0072b4ad5b21cmr2873157ejc.412.1657710876351;
        Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 lc25-20020a170906f91900b006ff802baf5dsm4870917ejb.54.2022.07.13.04.14.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:35 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 396C64D9903; Wed, 13 Jul 2022 13:14:35 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        Eric Dumazet <edumazet@google.com>,
        Paolo Abeni <pabeni@redhat.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type
Date: Wed, 13 Jul 2022 13:14:12 +0200
Message-Id: <20220713111430.134810-5-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

The PIFO (Push-In First-Out) data structure is a priority queue where
entries can be added at any point in the queue, but dequeue is always
in-order. The primary application is packet queueing, but with arbitrary
data it can also be used as a generic priority queue data structure.

This patch implements two variants of the PIFO map: A generic priority
queue and a queueing structure for XDP frames. The former is
BPF_MAP_TYPE_PIFO_GENERIC, which a generic priority queue that BPF programs
can use via the bpf_map_push_elem() and bpf_map_pop_elem() helpers to
insert and dequeue items. For pushing items, the lower 60 bits of the flags
argument of the helper is interpreted as the priority.

The BPF_MAP_TYPE_PIFO_XDP is a priority queue that stores XDP frames, which
are added to the map using the bpf_redirect_map() helper, where the map
index is used as the priority. Frames can be dequeued from a separate
dequeue program type, added in a later commit.

The two variants of the PIFO share most of their implementation. The user
selects the maximum number of entries stored in the map (using the regular
max_entries) parameter, as well as the range of valid priorities, where the
latter is expressed using the lower 32 bits of the map_extra parameter; the
range must be a power of two.

Each priority can have multiple entries queued, in which case entries are
stored in FIFO order of enqueue. The implementation uses a tree of
word-sized bitmaps as an index of which buckets contain any items. This
allows fast lookups: finding the next item to dequeue requires log_k(N)
"find first set" operations, where N is the range of the map (chosen at
setup time), and k is the number of bits in the native 'unsigned long'
type (so either 32 or 64).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h            |  13 +
 include/linux/bpf_types.h      |   2 +
 include/net/xdp.h              |   5 +-
 include/uapi/linux/bpf.h       |  13 +
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/pifomap.c           | 581 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   2 +
 kernel/bpf/verifier.c          |  10 +-
 net/core/filter.c              |   7 +
 tools/include/uapi/linux/bpf.h |  13 +
 10 files changed, 645 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/pifomap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ebe6f2d95182..ea994acebb81 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1849,6 +1849,9 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
 int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 			     struct sk_buff *skb);
 
+int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdpf, u32 index);
+struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank);
+
 /* Return map's numa specified by userspace */
 static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
 {
@@ -2081,6 +2084,16 @@ static inline int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 	return -EOPNOTSUPP;
 }
 
+static inline int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdp, u32 index)
+{
+	return 0;
+}
+
+static inline struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
+{
+	return NULL;
+}
+
 static inline struct bpf_prog *bpf_prog_get_type_path(const char *name,
 				enum bpf_prog_type type)
 {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..26ef981a8aa5 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -105,11 +105,13 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_INODE_STORAGE, inode_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_TASK_STORAGE, task_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_PIFO_GENERIC, pifo_generic_map_ops)
 #ifdef CONFIG_NET
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_PIFO_XDP, pifo_xdp_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
 BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 #endif
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 04c852c7a77f..7c694fb26f34 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -170,7 +170,10 @@ struct xdp_frame {
 	 * while mem info is valid on remote CPU.
 	 */
 	struct xdp_mem_info mem;
-	struct net_device *dev_rx; /* used by cpumap */
+	union {
+		struct net_device *dev_rx; /* used by cpumap */
+		struct xdp_frame *next; /* used by pifomap */
+	};
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec623f60048..f0947ddee784 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -909,6 +909,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_PIFO_GENERIC,
+	BPF_MAP_TYPE_PIFO_XDP,
 };
 
 /* Note that tracing related programs such as
@@ -1244,6 +1246,13 @@ enum {
 /* If set, XDP frames will be transmitted after processing */
 #define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
+/* Flags for BPF_MAP_TYPE_PIFO_* */
+
+/* Used for flags argument of bpf_map_push_elem(); reserve top four bits for
+ * actual flags, the rest is the enqueue priority
+ */
+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
+
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {
 	/* enabled run_time_ns and run_cnt */
@@ -1298,6 +1307,10 @@ union bpf_attr {
 		 * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
 		 * number of hash functions (if 0, the bloom filter will default
 		 * to using 5 hash functions).
+		 *
+		 * BPF_MAP_TYPE_PIFO_* - the lower 32 bits indicate the valid
+		 * range of priorities for entries enqueued in the map. Must be
+		 * a power of two.
 		 */
 		__u64	map_extra;
 	};
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..e66b4d0d3135 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@ endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o pifomap.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/pifomap.c b/kernel/bpf/pifomap.c
new file mode 100644
index 000000000000..5040f532e5d8
--- /dev/null
+++ b/kernel/bpf/pifomap.c
@@ -0,0 +1,581 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Pifomaps queue packets
+ */
+#include <linux/spinlock.h>
+#include <linux/bpf.h>
+#include <linux/bitops.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+#include <linux/filter.h>
+#include <trace/events/xdp.h>
+
+#define PIFO_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+
+struct bpf_pifo_element {
+	struct bpf_pifo_element *next;
+	char data[];
+};
+
+union bpf_pifo_item {
+	struct bpf_pifo_element elem;
+	struct xdp_frame frame;
+};
+
+struct bpf_pifo_element_cache {
+	u32 free_elems;
+	struct bpf_pifo_element *elements[];
+};
+
+struct bpf_pifo_bucket {
+	union bpf_pifo_item *head, *tail;
+	u32 elem_count;
+};
+
+struct bpf_pifo_queue {
+	struct bpf_pifo_bucket *buckets;
+	unsigned long *bitmap;
+	unsigned long **lvl_bitmap;
+	u64 min_rank;
+	u32 range;
+	u32 levels;
+};
+
+struct bpf_pifo_map {
+	struct bpf_map map;
+	struct bpf_pifo_queue *queue;
+	unsigned long num_queued;
+	spinlock_t lock; /* protects enqueue / dequeue */
+
+	size_t elem_size;
+	struct bpf_pifo_element_cache *elem_cache;
+	char elements[] __aligned(8);
+};
+
+static struct bpf_pifo_element *elem_cache_get(struct bpf_pifo_element_cache *cache)
+{
+	if (unlikely(!cache->free_elems))
+		return NULL;
+	return cache->elements[--cache->free_elems];
+}
+
+static void elem_cache_put(struct bpf_pifo_element_cache *cache,
+			   struct bpf_pifo_element *elem)
+{
+	cache->elements[cache->free_elems++] = elem;
+}
+
+static bool pifo_map_is_full(struct bpf_pifo_map *pifo)
+{
+	return pifo->num_queued >= pifo->map.max_entries;
+}
+
+static void pifo_queue_free(struct bpf_pifo_queue *q)
+{
+	bpf_map_area_free(q->buckets);
+	bpf_map_area_free(q->bitmap);
+	bpf_map_area_free(q->lvl_bitmap);
+	kfree(q);
+}
+
+static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
+{
+	u32 num_longs = 0, offset = 0, i, lvl, levels;
+	struct bpf_pifo_queue *q;
+
+	levels = __KERNEL_DIV_ROUND_UP(ilog2(range), ilog2(BITS_PER_TYPE(long)));
+	for (i = 0, lvl = 1; i < levels; i++) {
+		num_longs += lvl;
+		lvl *= BITS_PER_TYPE(long);
+	}
+
+	q = kzalloc(sizeof(*q), GFP_USER | __GFP_ACCOUNT);
+	if (!q)
+		return NULL;
+	q->buckets = bpf_map_area_alloc(sizeof(struct bpf_pifo_bucket) * range,
+					numa_node);
+	if (!q->buckets)
+		goto err;
+
+	q->bitmap = bpf_map_area_alloc(sizeof(unsigned long) * num_longs,
+				       numa_node);
+	if (!q->bitmap)
+		goto err;
+
+	q->lvl_bitmap = bpf_map_area_alloc(sizeof(unsigned long *) * levels,
+					   numa_node);
+	for (i = 0, lvl = 1; i < levels; i++) {
+		q->lvl_bitmap[i] = &q->bitmap[offset];
+		offset += lvl;
+		lvl *= BITS_PER_TYPE(long);
+	}
+	q->levels = levels;
+	q->range = range;
+	return q;
+
+err:
+	pifo_queue_free(q);
+	return NULL;
+}
+
+static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
+			     size_t elem_size, u32 range)
+{
+	int err = -ENOMEM;
+
+	/* Packet map is special, we don't want BPF writing straight to it
+	 */
+	if (attr->map_type != BPF_MAP_TYPE_PIFO_GENERIC)
+		attr->map_flags |= BPF_F_RDONLY_PROG;
+
+	bpf_map_init_from_attr(&pifo->map, attr);
+
+	pifo->queue = pifo_queue_alloc(range, pifo->map.numa_node);
+	if (!pifo->queue)
+		return -ENOMEM;
+
+	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
+		size_t cache_size;
+		int i;
+
+		cache_size = sizeof(void *) * attr->max_entries +
+			sizeof(struct bpf_pifo_element_cache);
+		pifo->elem_cache = bpf_map_area_alloc(cache_size,
+						      pifo->map.numa_node);
+		if (!pifo->elem_cache)
+			goto err_queue;
+
+		for (i = 0; i < attr->max_entries; i++)
+			pifo->elem_cache->elements[i] = (void *)&pifo->elements[i * elem_size];
+		pifo->elem_cache->free_elems = attr->max_entries;
+	}
+
+	return 0;
+
+err_queue:
+	pifo_queue_free(pifo->queue);
+	return err;
+}
+
+static struct bpf_map *pifo_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	size_t size, elem_size = 0;
+	struct bpf_pifo_map *pifo;
+	u32 range;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if ((attr->map_type == BPF_MAP_TYPE_PIFO_XDP && attr->value_size != 4) ||
+	    attr->key_size != 4 || attr->map_extra & ~0xFFFFFFFFULL ||
+	    attr->map_flags & ~PIFO_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	range = attr->map_extra;
+	if (!range || !is_power_of_2(range))
+		return ERR_PTR(-EINVAL);
+
+	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
+		elem_size = (attr->value_size + sizeof(struct bpf_pifo_element));
+		if (elem_size > U32_MAX / attr->max_entries)
+			return ERR_PTR(-E2BIG);
+	}
+
+	size = sizeof(*pifo) + attr->max_entries * elem_size;
+	pifo = bpf_map_area_alloc(size, numa_node);
+	if (!pifo)
+		return ERR_PTR(-ENOMEM);
+
+	err = pifo_map_init_map(pifo, attr, elem_size, range);
+	if (err) {
+		bpf_map_area_free(pifo);
+		return ERR_PTR(err);
+	}
+
+	spin_lock_init(&pifo->lock);
+	return &pifo->map;
+}
+
+static void pifo_queue_flush(struct bpf_pifo_queue *queue)
+{
+#ifdef CONFIG_NET
+	unsigned long *bitmap = queue->lvl_bitmap[queue->levels - 1];
+	int i = 0;
+
+	/* this is only ever called in the RCU callback when freeing the map, so
+	 * no need for locking
+	 */
+	while (i < queue->range) {
+		struct bpf_pifo_bucket *bucket = &queue->buckets[i];
+		struct xdp_frame *frame = &bucket->head->frame, *next;
+
+		while (frame) {
+			next = frame->next;
+			xdp_return_frame(frame);
+			frame = next;
+		}
+		i = find_next_bit(bitmap, queue->range, i + 1);
+	}
+#endif
+}
+
+static void pifo_map_free(struct bpf_map *map)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. The following synchronize_rcu() guarantees
+	 * both rcu read critical sections complete and waits for
+	 * preempt-disable regions (NAPI being the relevant context here) so we
+	 * are certain there will be no further reads against the netdev_map and
+	 * all flush operations are complete. Flush operations can only be done
+	 * from NAPI context for this reason.
+	 */
+
+	synchronize_rcu();
+
+	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP)
+		pifo_queue_flush(pifo->queue);
+	pifo_queue_free(pifo->queue);
+	bpf_map_area_free(pifo->elem_cache);
+	bpf_map_area_free(pifo);
+}
+
+static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX, offset;
+	struct bpf_pifo_queue *queue = pifo->queue;
+	unsigned long idx, flags;
+	u32 *next = next_key;
+	int ret = -ENOENT;
+
+	spin_lock_irqsave(&pifo->lock, flags);
+
+	if (index == U32_MAX || index < queue->min_rank)
+		offset = 0;
+	else
+		offset = index - queue->min_rank + 1;
+
+	if (offset >= queue->range)
+		goto out;
+
+	idx = find_next_bit(queue->lvl_bitmap[queue->levels - 1],
+			    queue->range, offset);
+	if (idx == queue->range)
+		goto out;
+
+	*next = idx;
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&pifo->lock, flags);
+	return ret;
+}
+
+static void pifo_set_bit(struct bpf_pifo_queue *queue, u32 rank)
+{
+	u32 i;
+
+	for (i = queue->levels; i > 0; i--) {
+		unsigned long *bitmap = queue->lvl_bitmap[i - 1];
+
+		set_bit(rank, bitmap);
+		rank /= BITS_PER_TYPE(long);
+	}
+}
+
+static void pifo_clear_bit(struct bpf_pifo_queue *queue, u32 rank)
+{
+	u32 i;
+
+	for (i = queue->levels; i > 0; i--) {
+		unsigned long *bitmap = queue->lvl_bitmap[i - 1];
+
+		clear_bit(rank, bitmap);
+		rank /= BITS_PER_TYPE(long);
+
+		// another bit is set in this word, don't clear bit in higher
+		// level
+		if (*(bitmap + rank))
+			break;
+	}
+}
+
+static void pifo_item_set_next(union bpf_pifo_item *item, void *next, bool xdp)
+{
+	if (xdp)
+		item->frame.next = next;
+	else
+		item->elem.next = next;
+}
+
+static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *item,
+			      u64 rank, bool xdp)
+{
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	u64 q_index;
+
+	lockdep_assert_held(&pifo->lock);
+
+	if (unlikely(pifo_map_is_full(pifo)))
+		return -EOVERFLOW;
+
+	if (rank < queue->min_rank)
+		return -ERANGE;
+
+	pifo_item_set_next(item, NULL, xdp);
+
+	q_index = rank - queue->min_rank;
+	if (unlikely(q_index >= queue->range))
+		q_index = queue->range - 1;
+
+	bucket = &queue->buckets[q_index];
+	if (likely(!bucket->head)) {
+		bucket->head = item;
+		bucket->tail = item;
+		pifo_set_bit(queue, q_index);
+	} else {
+		pifo_item_set_next(bucket->tail, item, xdp);
+		bucket->tail = item;
+	}
+
+	pifo->num_queued++;
+	bucket->elem_count++;
+	return 0;
+}
+
+int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdpf, u32 index)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	int ret;
+
+	/* called under local_bh_disable() so no need to use irqsave variant */
+	spin_lock(&pifo->lock);
+	ret = __pifo_map_enqueue(pifo, (union bpf_pifo_item *)xdpf, index, true);
+	spin_unlock(&pifo->lock);
+
+	return ret;
+}
+
+static unsigned long pifo_find_first_bucket(struct bpf_pifo_queue *queue)
+{
+	unsigned long *bitmap, bit = 0, offset = 0;
+	int i;
+
+	for (i = 0; i < queue->levels; i++) {
+		bitmap = queue->lvl_bitmap[i] + offset;
+		if (!*bitmap)
+			return -1;
+		bit = __ffs(*bitmap);
+		offset = offset * BITS_PER_TYPE(long) + bit;
+	}
+	return offset;
+}
+
+static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
+					       u64 flags, u64 *rank, bool xdp)
+{
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	union bpf_pifo_item *item;
+	unsigned long bucket_idx;
+
+	lockdep_assert_held(&pifo->lock);
+
+	if (flags) {
+		*rank = -EINVAL;
+		return NULL;
+	}
+
+	bucket_idx = pifo_find_first_bucket(queue);
+	if (bucket_idx == -1) {
+		*rank = -ENOENT;
+		return NULL;
+	}
+	bucket = &queue->buckets[bucket_idx];
+
+	if (WARN_ON_ONCE(!bucket->tail)) {
+		*rank = -EFAULT;
+		return NULL;
+	}
+
+	item = bucket->head;
+	if (xdp)
+		bucket->head = (union bpf_pifo_item *)item->frame.next;
+	else
+		bucket->head = (union bpf_pifo_item *)item->elem.next;
+
+	if (!bucket->head) {
+		bucket->tail = NULL;
+		pifo_clear_bit(queue, bucket_idx);
+	}
+	pifo->num_queued--;
+	bucket->elem_count--;
+
+	*rank = bucket_idx + queue->min_rank;
+	return item;
+}
+
+struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	union bpf_pifo_item *item;
+	unsigned long lflags;
+
+	spin_lock_irqsave(&pifo->lock, lflags);
+	item = __pifo_map_dequeue(pifo, flags, rank, true);
+	spin_unlock_irqrestore(&pifo->lock, lflags);
+
+	return item ? &item->frame : NULL;
+}
+
+static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	u32 rank =  *(u32 *)key, idx;
+
+	if (rank < queue->min_rank)
+		return NULL;
+
+	idx = rank - queue->min_rank;
+	if (idx >= queue->range)
+		return NULL;
+
+	bucket = &queue->buckets[idx];
+	/* FIXME: what happens if this changes while userspace is reading the
+	 * value
+	 */
+	return &bucket->elem_count;
+}
+
+static int pifo_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_element *dst;
+	unsigned long irq_flags;
+	u64 prio;
+	int ret;
+
+	/* Check if any of the actual flag bits are set */
+	if (flags & ~BPF_PIFO_PRIO_MASK)
+		return -EINVAL;
+
+	prio = flags & BPF_PIFO_PRIO_MASK;
+
+	spin_lock_irqsave(&pifo->lock, irq_flags);
+
+	dst = elem_cache_get(pifo->elem_cache);
+	if (!dst) {
+		ret = -EOVERFLOW;
+		goto out;
+	}
+
+	memcpy(&dst->data, value, pifo->map.value_size);
+
+	ret = __pifo_map_enqueue(pifo, (union bpf_pifo_item *)dst, prio, false);
+	if (ret)
+		elem_cache_put(pifo->elem_cache, dst);
+
+out:
+	spin_unlock_irqrestore(&pifo->lock, irq_flags);
+	return ret;
+}
+
+static int pifo_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	union bpf_pifo_item *item;
+	unsigned long flags;
+	int err = 0;
+	u64 rank;
+
+	spin_lock_irqsave(&pifo->lock, flags);
+
+	item = __pifo_map_dequeue(pifo, 0, &rank, false);
+	if (!item) {
+		err = rank;
+		goto out;
+	}
+
+	memcpy(value, &item->elem.data, pifo->map.value_size);
+	elem_cache_put(pifo->elem_cache, &item->elem);
+
+out:
+	spin_unlock_irqrestore(&pifo->lock, flags);
+	return err;
+}
+
+static int pifo_map_update_elem(struct bpf_map *map, void *key, void *value,
+				u64 map_flags)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_peek_elem(struct bpf_map *map, void *value)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_redirect(struct bpf_map *map, u64 index, u64 flags)
+{
+#ifdef CONFIG_NET
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
+
+	/* Lower bits of the flags are used as return code on lookup failure */
+	if (unlikely(flags & ~action_mask))
+		return XDP_ABORTED;
+
+	ri->tgt_value = NULL;
+	ri->tgt_index = index;
+	ri->map_id = map->id;
+	ri->map_type = map->map_type;
+	ri->flags = flags;
+	WRITE_ONCE(ri->map, map);
+	return XDP_REDIRECT;
+#else
+	return XDP_ABORTED;
+#endif
+}
+
+BTF_ID_LIST_SINGLE(pifo_xdp_map_btf_ids, struct, bpf_pifo_map);
+const struct bpf_map_ops pifo_xdp_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = pifo_map_alloc,
+	.map_free = pifo_map_free,
+	.map_get_next_key = pifo_map_get_next_key,
+	.map_lookup_elem = pifo_map_lookup_elem,
+	.map_update_elem = pifo_map_update_elem,
+	.map_delete_elem = pifo_map_delete_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &pifo_xdp_map_btf_ids[0],
+	.map_redirect = pifo_map_redirect,
+};
+
+BTF_ID_LIST_SINGLE(pifo_generic_map_btf_ids, struct, bpf_pifo_map);
+const struct bpf_map_ops pifo_generic_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = pifo_map_alloc,
+	.map_free = pifo_map_free,
+	.map_get_next_key = pifo_map_get_next_key,
+	.map_lookup_elem = pifo_map_lookup_elem,
+	.map_update_elem = pifo_map_update_elem,
+	.map_delete_elem = pifo_map_delete_elem,
+	.map_push_elem = pifo_map_push_elem,
+	.map_pop_elem = pifo_map_pop_elem,
+	.map_peek_elem = pifo_map_peek_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &pifo_generic_map_btf_ids[0],
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ab688d85b2c6..31899882e513 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1066,6 +1066,8 @@ static int map_create(union bpf_attr *attr)
 	}
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
+	    attr->map_type != BPF_MAP_TYPE_PIFO_XDP &&
+	    attr->map_type != BPF_MAP_TYPE_PIFO_GENERIC &&
 	    attr->map_extra != 0)
 		return -EINVAL;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039f7b61c305..489ea3f368a1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6249,6 +6249,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_QUEUE:
 	case BPF_MAP_TYPE_STACK:
+	case BPF_MAP_TYPE_PIFO_GENERIC:
 		if (func_id != BPF_FUNC_map_peek_elem &&
 		    func_id != BPF_FUNC_map_pop_elem &&
 		    func_id != BPF_FUNC_map_push_elem)
@@ -6274,6 +6275,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_map_push_elem)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_PIFO_XDP:
+		if (func_id != BPF_FUNC_redirect_map)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -6318,6 +6323,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
 		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
 		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_XDP &&
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
@@ -6346,13 +6352,15 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_map_pop_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
-		    map->map_type != BPF_MAP_TYPE_STACK)
+		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_GENERIC)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
 	case BPF_FUNC_map_push_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
 		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_GENERIC &&
 		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
 			goto error;
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index e23e53ed1b04..8e6ea17a29db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4236,6 +4236,13 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 			err = dev_map_enqueue(fwd, xdpf, dev);
 		}
 		break;
+	case BPF_MAP_TYPE_PIFO_XDP:
+		map = READ_ONCE(ri->map);
+		if (unlikely(!map))
+			err = -EINVAL;
+		else
+			err = pifo_map_enqueue(map, xdpf, ri->tgt_index);
+		break;
 	case BPF_MAP_TYPE_CPUMAP:
 		err = cpu_map_enqueue(fwd, xdpf, dev);
 		break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 379e68fb866f..623421377f6e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -909,6 +909,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_PIFO_GENERIC,
+	BPF_MAP_TYPE_PIFO_XDP,
 };
 
 /* Note that tracing related programs such as
@@ -1244,6 +1246,13 @@ enum {
 /* If set, XDP frames will be transmitted after processing */
 #define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
+/* Flags for BPF_MAP_TYPE_PIFO_* */
+
+/* Used for flags argument of bpf_map_push_elem(); reserve top four bits for
+ * actual flags, the rest is the enqueue priority
+ */
+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
+
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {
 	/* enabled run_time_ns and run_cnt */
@@ -1298,6 +1307,10 @@ union bpf_attr {
 		 * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
 		 * number of hash functions (if 0, the bloom filter will default
 		 * to using 5 hash functions).
+		 *
+		 * BPF_MAP_TYPE_PIFO_* - the lower 32 bits indicate the valid
+		 * range of priorities for entries enqueued in the map. Must be
+		 * a power of two.
 		 */
 		__u64	map_extra;
 	};

From patchwork Wed Jul 13 11:14:13 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916572
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DE2A6C433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236003AbiGMLO7 (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:59 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38908 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235974AbiGMLOz (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:55 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 6106BF273D
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710884;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=5uTN0iYXNQJ821H3571k1jB4mSJSAImIFfOFFCQD7EA=;
        b=K9HLUn3ztQ9qX1TjXccc5+CExGKfhaFWLijrt07asGmmzU9CwHKAzxH2PMd53webk4EEhs
        RyaEql17RYJvvdg3Y2uthSyj6sdkBOaYKotZVZayv2ahAweQesFKYXt8qUgyJPJ1IuAkqb
        GxmLTSJGURRIgjyVAqk97Dmv6YtAazw=
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-587-o_Bn3WeyPpycxCAhvmgjJQ-1; Wed, 13 Jul 2022 07:14:42 -0400
X-MC-Unique: o_Bn3WeyPpycxCAhvmgjJQ-1
Received: by mail-ed1-f71.google.com with SMTP id
 o13-20020a056402438d00b0043aa846b2d2so8095991edc.8
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=5uTN0iYXNQJ821H3571k1jB4mSJSAImIFfOFFCQD7EA=;
        b=hLYoTXnuhd4noHSiOZvDgK+29eCqqroKwN/kL+s0LCaXzHTt3JX3FyYWU1zvo4J9jq
         R5xGUBhQXude94kYlMETFF9FwedfWUhbjfhkLMlBa2O0DUR8etxaSDQ+Gk8IoFpLCNKB
         810gKFsYcE2C2peHA3RUIagZd3O8xZTeWSF8BEQms9NA2/J5gzkYLcApit3dxuxqzga1
         qgfSJfXY313WitF+uDJ/+H1V2ZyyvZqlMKS0QR94/Tad1sARQNQbLVQQGFDEOsw7sD7/
         vEYFtVAcJqDMwisDnBEuE1Sxa5v8IkvQpRMhM1XNvavD4x5M6usEjMTE09qq6naCjcqe
         Rpkw==
X-Gm-Message-State: AJIora9D7eGlLS5YR+GX6KKU98Z0JEZzs8d5WYfo0boBmS2cwB/H73zL
        23uMRV+fywePQi2GNHlCS7CtuGARrT97GVyjvdPLporJEnn/2p2ns9ACOKAsxxvj+2eYMi1ySKO
        j+yhuOF724HccFxWj
X-Received: by 2002:a17:907:96a4:b0:72b:647e:30fd with SMTP id
 hd36-20020a17090796a400b0072b647e30fdmr2783285ejc.723.1657710879630;
        Wed, 13 Jul 2022 04:14:39 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1s+UMwr1jDtKqGwgGJgq+K7hLtOvFBF7ZuxlDiEr1YBHhUDN+xlRVuVdg/F1JB6u+6X7oIh0g==
X-Received: by 2002:a17:907:96a4:b0:72b:647e:30fd with SMTP id
 hd36-20020a17090796a400b0072b647e30fdmr2783103ejc.723.1657710877318;
        Wed, 13 Jul 2022 04:14:37 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 p20-20020a056402155400b0043a896048basm7787960edx.85.2022.07.13.04.14.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 7A6D74D9905; Wed, 13 Jul 2022 13:14:35 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 05/17] pifomap: Add queue rotation for continuously
 increasing rank mode
Date: Wed, 13 Jul 2022 13:14:13 +0200
Message-Id: <20220713111430.134810-6-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Amend the PIFO map so it can operate in a mode that allows the range to
increase continuously. This works by allocating two underlying queues, and
queueing entries into the second one if the first one overflows. When the
primary queue runs empty, if there are entries in the secondary queue, swap
the two queues and shift the operating range of the new secondary queue to
be after the (new) primary. This way the queue can support a continuously
increasing rank, for instance to index by timestamps.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 kernel/bpf/pifomap.c | 96 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 75 insertions(+), 21 deletions(-)

diff --git a/kernel/bpf/pifomap.c b/kernel/bpf/pifomap.c
index 5040f532e5d8..62633c2c7419 100644
--- a/kernel/bpf/pifomap.c
+++ b/kernel/bpf/pifomap.c
@@ -6,6 +6,7 @@
 #include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/btf_ids.h>
+#include <linux/minmax.h>
 #include <net/xdp.h>
 #include <linux/filter.h>
 #include <trace/events/xdp.h>
@@ -44,7 +45,8 @@ struct bpf_pifo_queue {
 
 struct bpf_pifo_map {
 	struct bpf_map map;
-	struct bpf_pifo_queue *queue;
+	struct bpf_pifo_queue *q_primary;
+	struct bpf_pifo_queue *q_secondary;
 	unsigned long num_queued;
 	spinlock_t lock; /* protects enqueue / dequeue */
 
@@ -71,6 +73,12 @@ static bool pifo_map_is_full(struct bpf_pifo_map *pifo)
 	return pifo->num_queued >= pifo->map.max_entries;
 }
 
+static bool pifo_queue_is_empty(struct bpf_pifo_queue *queue)
+{
+	/* first word in bitmap is always the top-level map */
+	return !queue->bitmap[0];
+}
+
 static void pifo_queue_free(struct bpf_pifo_queue *q)
 {
 	bpf_map_area_free(q->buckets);
@@ -79,7 +87,7 @@ static void pifo_queue_free(struct bpf_pifo_queue *q)
 	kfree(q);
 }
 
-static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
+static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, u32 min_rank, int numa_node)
 {
 	u32 num_longs = 0, offset = 0, i, lvl, levels;
 	struct bpf_pifo_queue *q;
@@ -112,6 +120,7 @@ static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
 	}
 	q->levels = levels;
 	q->range = range;
+	q->min_rank = min_rank;
 	return q;
 
 err:
@@ -131,10 +140,14 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 
 	bpf_map_init_from_attr(&pifo->map, attr);
 
-	pifo->queue = pifo_queue_alloc(range, pifo->map.numa_node);
-	if (!pifo->queue)
+	pifo->q_primary = pifo_queue_alloc(range, 0, pifo->map.numa_node);
+	if (!pifo->q_primary)
 		return -ENOMEM;
 
+	pifo->q_secondary = pifo_queue_alloc(range, range, pifo->map.numa_node);
+	if (!pifo->q_secondary)
+		goto err_queue;
+
 	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
 		size_t cache_size;
 		int i;
@@ -144,7 +157,7 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 		pifo->elem_cache = bpf_map_area_alloc(cache_size,
 						      pifo->map.numa_node);
 		if (!pifo->elem_cache)
-			goto err_queue;
+			goto err;
 
 		for (i = 0; i < attr->max_entries; i++)
 			pifo->elem_cache->elements[i] = (void *)&pifo->elements[i * elem_size];
@@ -153,8 +166,10 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 
 	return 0;
 
+err:
+	pifo_queue_free(pifo->q_secondary);
 err_queue:
-	pifo_queue_free(pifo->queue);
+	pifo_queue_free(pifo->q_primary);
 	return err;
 }
 
@@ -238,9 +253,12 @@ static void pifo_map_free(struct bpf_map *map)
 
 	synchronize_rcu();
 
-	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP)
-		pifo_queue_flush(pifo->queue);
-	pifo_queue_free(pifo->queue);
+	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP) {
+		pifo_queue_flush(pifo->q_primary);
+		pifo_queue_flush(pifo->q_secondary);
+	}
+	pifo_queue_free(pifo->q_primary);
+	pifo_queue_free(pifo->q_secondary);
 	bpf_map_area_free(pifo->elem_cache);
 	bpf_map_area_free(pifo);
 }
@@ -249,7 +267,7 @@ static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
 	u32 index = key ? *(u32 *)key : U32_MAX, offset;
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	unsigned long idx, flags;
 	u32 *next = next_key;
 	int ret = -ENOENT;
@@ -261,15 +279,27 @@ static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	else
 		offset = index - queue->min_rank + 1;
 
-	if (offset >= queue->range)
-		goto out;
+	if (offset >= queue->range) {
+		offset -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (offset >= queue->range)
+			goto out;
+	}
 
+search:
 	idx = find_next_bit(queue->lvl_bitmap[queue->levels - 1],
 			    queue->range, offset);
-	if (idx == queue->range)
+	if (idx == queue->range) {
+		if (queue == pifo->q_primary) {
+			queue = pifo->q_secondary;
+			offset = 0;
+			goto search;
+		}
 		goto out;
+	}
 
-	*next = idx;
+	*next = idx + queue->min_rank;
 	ret = 0;
 out:
 	spin_unlock_irqrestore(&pifo->lock, flags);
@@ -316,7 +346,7 @@ static void pifo_item_set_next(union bpf_pifo_item *item, void *next, bool xdp)
 static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *item,
 			      u64 rank, bool xdp)
 {
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	u64 q_index;
 
@@ -331,8 +361,16 @@ static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *it
 	pifo_item_set_next(item, NULL, xdp);
 
 	q_index = rank - queue->min_rank;
-	if (unlikely(q_index >= queue->range))
-		q_index = queue->range - 1;
+	if (unlikely(q_index >= queue->range)) {
+		/* If we overflow the primary queue, enqueue into secondary, and
+		 * if we overflow that enqueue as the last item
+		 */
+		q_index -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (q_index >= queue->range)
+			q_index = queue->range - 1;
+	}
 
 	bucket = &queue->buckets[q_index];
 	if (likely(!bucket->head)) {
@@ -380,7 +418,7 @@ static unsigned long pifo_find_first_bucket(struct bpf_pifo_queue *queue)
 static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
 					       u64 flags, u64 *rank, bool xdp)
 {
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	union bpf_pifo_item *item;
 	unsigned long bucket_idx;
@@ -392,6 +430,17 @@ static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
 		return NULL;
 	}
 
+	if (!pifo->num_queued) {
+		*rank = -ENOENT;
+		return NULL;
+	}
+
+	if (unlikely(pifo_queue_is_empty(queue))) {
+		swap(pifo->q_primary, pifo->q_secondary);
+		pifo->q_secondary->min_rank = pifo->q_primary->min_rank + pifo->q_primary->range;
+		queue = pifo->q_primary;
+	}
+
 	bucket_idx = pifo_find_first_bucket(queue);
 	if (bucket_idx == -1) {
 		*rank = -ENOENT;
@@ -437,7 +486,7 @@ struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
 static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	u32 rank =  *(u32 *)key, idx;
 
@@ -445,8 +494,13 @@ static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
 		return NULL;
 
 	idx = rank - queue->min_rank;
-	if (idx >= queue->range)
-		return NULL;
+	if (idx >= queue->range) {
+		idx -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (idx >= queue->range)
+			return NULL;
+	}
 
 	bucket = &queue->buckets[idx];
 	/* FIXME: what happens if this changes while userspace is reading the

From patchwork Wed Jul 13 11:14:14 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916571
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 25381CCA479
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229579AbiGMLO6 (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:14:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38882 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235968AbiGMLOz (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:55 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5F041FF597
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710883;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=+fSAWDEGpl8ZiSP+HMH8zwUMV3ceUvwVsG+k+DjufDg=;
        b=E8brxe9adHb442jLgJpf+rXHEWgK29Od9XSfVAdzDB4Ti+dKUxkFxCFL2xM6YdsOL2zDUE
        xx6PJU0NykFQl+uWKCf7DyMafvn3vE13r/GPaurWywDhKltutIHgbveJwLtiK9rlu5+/kL
        54xk5UbG1qGvMbE2xiS+K+aponZ24eg=
Received: from mail-ej1-f69.google.com (mail-ej1-f69.google.com
 [209.85.218.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-584-sFHY2ZHqOSiCJjS559drCg-1; Wed, 13 Jul 2022 07:14:42 -0400
X-MC-Unique: sFHY2ZHqOSiCJjS559drCg-1
Received: by mail-ej1-f69.google.com with SMTP id
 sh39-20020a1709076ea700b0072aa3156a68so3239169ejc.19
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=+fSAWDEGpl8ZiSP+HMH8zwUMV3ceUvwVsG+k+DjufDg=;
        b=rNQSMH0uc3K7EgJzZQ77Bq+UzYe4H7VYSFdwHBTTr3GzUNK4yNuBvvwwXvInIaNQ1/
         qjMKLX2vKVgEDUTSjUFIrpoxD+MYoE5DUDprIbUJWI9Q/y/5ZULfLeUscRIMQ2a74TYV
         XLznn7yYXwnnvzevRQYrdHtuMJ0bnIn9Tc67xKUMy5YnAT1jYf56/G4rTfxzogYdMMRX
         C1yf1WujPXlNtzby+R+xFG/cFwwxL79BGec6MMNTvG92lCkt25ucVAPUjUeiocW04vna
         LCu2z8sdhkb87gqwswAwMkJ9yN/qCWPwJCQDY7hVxj3M3zEAFek02fjy40pB0X/4sjgQ
         +81w==
X-Gm-Message-State: AJIora/QPJzxKIkoHktZ1xCiugYg3BZBtRqWiWPxhfRuIdfVbhvwOU3H
        WwfQKyKCuH98cxWFe2kkzeVeCrxArE+0mm19a1Njp185G4mufmr9FjCdWQmPG+FCsbEuep2Jofq
        +ezSRre559zM5md9e
X-Received: by 2002:a17:906:6a0f:b0:72b:64ce:289d with SMTP id
 qw15-20020a1709066a0f00b0072b64ce289dmr2770262ejc.663.1657710879458;
        Wed, 13 Jul 2022 04:14:39 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1sCdKO4zWgqJ8EJISzFR2q6vt/cCdrOMboNj75zsLVfKWzKMpG+zCuE3s699AavLwI24PA7pg==
X-Received: by 2002:a17:906:6a0f:b0:72b:64ce:289d with SMTP id
 qw15-20020a1709066a0f00b0072b64ce289dmr2770210ejc.663.1657710879048;
        Wed, 13 Jul 2022 04:14:39 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 26-20020a170906311a00b0070e238ff66fsm4825119ejx.96.2022.07.13.04.14.36
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 3A14F4D9907; Wed, 13 Jul 2022 13:14:36 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        John Fastabend <john.fastabend@gmail.com>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        Eric Dumazet <edumazet@google.com>,
        Paolo Abeni <pabeni@redhat.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets
 from a PIFO
Date: Wed, 13 Jul 2022 13:14:14 +0200
Message-Id: <20220713111430.134810-7-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add a new BPF_PROG_TYPE_DEQUEUE, which will be executed by a new device
hook to retrieve queued packets for transmission. The API of the dequeue
program is simple: it takes a context object containing as its sole member
the ifindex of the device it is being executed on. The program can return a
pointer to a packet, or NULL to indicate it has nothing to transmit at this
time. Packet pointers are obtained by dequeueing them from a PIFO
map (using a helper added in a subsequent commit).

This commit adds dequeue program type and the ability to run it using the
bpf_prog_run() syscall (returning the dequeued packet to userspace); a
subsequent commit introduces the network stack hook to attach and execute
dequeue programs.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h            |  9 ++++++
 include/linux/bpf_types.h      |  2 ++
 include/net/xdp.h              |  4 +++
 include/uapi/linux/bpf.h       |  5 ++++
 kernel/bpf/syscall.c           |  1 +
 net/bpf/test_run.c             | 33 +++++++++++++++++++++
 net/core/filter.c              | 53 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 ++++
 8 files changed, 112 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ea994acebb81..6ea5d6d188cf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1864,6 +1864,8 @@ int array_map_alloc_check(union bpf_attr *attr);
 
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr);
+int bpf_prog_test_run_dequeue(struct bpf_prog *prog, const union bpf_attr *kattr,
+			      union bpf_attr __user *uattr);
 int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr);
 int bpf_prog_test_run_tracing(struct bpf_prog *prog,
@@ -2107,6 +2109,13 @@ static inline int bpf_prog_test_run_xdp(struct bpf_prog *prog,
 	return -ENOTSUPP;
 }
 
+static inline int bpf_prog_test_run_dequeue(struct bpf_prog *prog,
+					    const union bpf_attr *kattr,
+					    union bpf_attr __user *uattr)
+{
+	return -ENOTSUPP;
+}
+
 static inline int bpf_prog_test_run_skb(struct bpf_prog *prog,
 					const union bpf_attr *kattr,
 					union bpf_attr __user *uattr)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 26ef981a8aa5..e6bc962befb7 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp,
 	      struct xdp_md, struct xdp_buff)
+BPF_PROG_TYPE(BPF_PROG_TYPE_DEQUEUE, dequeue,
+	      struct dequeue_ctx, struct dequeue_data)
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb,
 	      struct __sk_buff, struct sk_buff)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 7c694fb26f34..728ce943d352 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -85,6 +85,10 @@ struct xdp_buff {
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
+struct dequeue_data {
+	struct xdp_txq_info *txq;
+};
+
 static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
 {
 	return !!(xdp->flags & XDP_FLAGS_HAS_FRAGS);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f0947ddee784..974fb5882305 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_DEQUEUE,
 };
 
 enum bpf_attach_type {
@@ -5961,6 +5962,10 @@ struct xdp_md {
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+struct dequeue_ctx {
+	__u32 egress_ifindex;
+};
+
 /* DEVMAP map-value layout
  *
  * The struct data-layout of map-value is a configuration interface.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 31899882e513..c4af9119b68a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2370,6 +2370,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_DEQUEUE:
 	case BPF_PROG_TYPE_SYSCALL:
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f05d13717430..a7f479a19fe0 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1390,6 +1390,39 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	return ret;
 }
 
+int bpf_prog_test_run_dequeue(struct bpf_prog *prog, const union bpf_attr *kattr,
+			      union bpf_attr __user *uattr)
+{
+	struct xdp_txq_info txq = { .dev = current->nsproxy->net_ns->loopback_dev };
+	u32 repeat = kattr->test.repeat, duration, size;
+	struct dequeue_data ctx = { .txq = &txq };
+	struct xdp_buff xdp = {};
+	struct xdp_frame *pkt;
+	int ret = -EINVAL;
+	u64 retval;
+
+	if (prog->expected_attach_type)
+		return -EINVAL;
+
+	if (kattr->test.data_in || kattr->test.data_size_in ||
+	    kattr->test.ctx_in || kattr->test.ctx_out || repeat > 1)
+		return -EINVAL;
+
+	ret = bpf_test_run(prog, &ctx, repeat, &retval, &duration, false);
+	if (ret)
+		return ret;
+	if (!retval)
+		return bpf_test_finish(kattr, uattr, NULL, NULL, 0, retval, duration);
+
+	pkt = (void *)(unsigned long)retval;
+	xdp_convert_frame_to_buff(pkt, &xdp);
+	size = xdp.data_end - xdp.data_meta;
+	/* We set retval == 1 if pkt != NULL, otherwise 0 */
+	ret = bpf_test_finish(kattr, uattr, xdp.data_meta, NULL, size, !!retval, duration);
+	xdp_return_frame(pkt);
+	return ret;
+}
+
 static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
 {
 	/* make sure the fields we don't use are zeroed */
diff --git a/net/core/filter.c b/net/core/filter.c
index 8e6ea17a29db..30bd3a6aedab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8062,6 +8062,12 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
@@ -8776,6 +8782,20 @@ void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog,
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool dequeue_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+				    const struct bpf_prog *prog,
+				    struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+	switch (off) {
+	case offsetof(struct dequeue_ctx, egress_ifindex):
+		return true;
+	}
+	return false;
+}
+
 static bool sock_addr_is_valid_access(int off, int size,
 				      enum bpf_access_type type,
 				      const struct bpf_prog *prog,
@@ -9835,6 +9855,28 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 dequeue_convert_ctx_access(enum bpf_access_type type,
+				      const struct bpf_insn *si,
+				      struct bpf_insn *insn_buf,
+				      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct dequeue_ctx, egress_ifindex):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct dequeue_data, txq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct dequeue_data, txq));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_txq_info, dev),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct xdp_txq_info, dev));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct net_device, ifindex));
+		break;
+	}
+	return insn - insn_buf;
+}
+
 /* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
  * context Structure, F is Field in context structure that contains a pointer
  * to Nested Structure of type NS that has the field NF.
@@ -10687,6 +10729,17 @@ const struct bpf_prog_ops xdp_prog_ops = {
 	.test_run		= bpf_prog_test_run_xdp,
 };
 
+const struct bpf_verifier_ops dequeue_verifier_ops = {
+	.get_func_proto		= dequeue_func_proto,
+	.is_valid_access	= dequeue_is_valid_access,
+	.convert_ctx_access	= dequeue_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
+};
+
+const struct bpf_prog_ops dequeue_prog_ops = {
+	.test_run		= bpf_prog_test_run_dequeue,
+};
+
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
 	.get_func_proto		= cg_skb_func_proto,
 	.is_valid_access	= cg_skb_is_valid_access,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 623421377f6e..4dd8a563f85d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_DEQUEUE,
 };
 
 enum bpf_attach_type {
@@ -5961,6 +5962,10 @@ struct xdp_md {
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+struct dequeue_ctx {
+	__u32 egress_ifindex;
+};
+
 /* DEVMAP map-value layout
  *
  * The struct data-layout of map-value is a configuration interface.

From patchwork Wed Jul 13 11:14:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916575
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 09999C433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:21 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236255AbiGMLPS (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:18 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38916 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235969AbiGMLO6 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:58 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 253F0101480
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710887;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=wIG1zNsh8lsCTnIOSJjvN9N6MeUmyuIluq3oahpmRcc=;
        b=E4eoDZB+gGk87B3q+QVsuPIW7NAOt0X+I7gN/CaRRgQ54WkLf35RGmlHiAmFqc1sBlWKln
        E0KXR5ylveW2m0h2YnNL7TtZGZQ6wQOzJ4U29SCNYoNhjQPQF8MxwWK8rp1wFDZvx0fB7T
        4HWI7NqQsKjxDr4xrkzhjq+YvytH5p4=
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-78-v2mW3qc0MgqrtWIWj0bwVQ-1; Wed, 13 Jul 2022 07:14:43 -0400
X-MC-Unique: v2mW3qc0MgqrtWIWj0bwVQ-1
Received: by mail-ed1-f69.google.com with SMTP id
 f13-20020a0564021e8d00b00437a2acb543so8084931edf.7
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=wIG1zNsh8lsCTnIOSJjvN9N6MeUmyuIluq3oahpmRcc=;
        b=mh7mK8Pz7vbYXgCWmL7X+s4oV2kiI8TgEGSMkSRtnr8f85ZHq7BLPySA6/sGQf8his
         /Bf4xRsPpDZ46w6uTHM44v3b5aNeBWbmzGMr7S4xJpzTzHnPU1Zsb0Ix4vSS4qWYIuVv
         YrheHa3Ca3gKxz3hSHpYvN5Xc9/NpLR7rtFPEEziP7IdxX+wtbvDZ7fj4IrwOeWVF/0B
         Q0FFOVDxmPF8Vmr4YYNq7ENDRKqg6U+HotcyQWkfzMOSdEw4orvb0LZCvn25OpGfGuuV
         B/4+ZlJYgsLgxNVRZFaa/5ExSNpp714zlnj86hSG1W4eGlW5iTxE8+1hXcqPqOa848ZW
         Gm7w==
X-Gm-Message-State: AJIora/ccevXU5zUtdZrzILQA975pegK5SSuSAXVwZNoNvInqw9WAiFY
        DbqR+qe8B4FY42bbey/X4bnZSuIVu5JdplufB/Xx86R1LjW/bSyRXtNJ3nY6VMAmcb8yvfZxzjX
        MVD8SsKLVEdhmp9/G
X-Received: by 2002:a17:907:28c9:b0:72b:7165:20c2 with SMTP id
 en9-20020a17090728c900b0072b716520c2mr2891714ejc.120.1657710881128;
        Wed, 13 Jul 2022 04:14:41 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1s16+3AyfkzBbcXq34X0y++PLKuDiMja5ZSlf2hxVY8NWSB4DRX/1or5m4FigZlda0Vp75obg==
X-Received: by 2002:a17:907:28c9:b0:72b:7165:20c2 with SMTP id
 en9-20020a17090728c900b0072b716520c2mr2891611ejc.120.1657710879788;
        Wed, 13 Jul 2022 04:14:39 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 hy10-20020a1709068a6a00b00704757b1debsm4839880ejc.9.2022.07.13.04.14.36
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:36 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 710C94D9909; Wed, 13 Jul 2022 13:14:36 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        John Fastabend <john.fastabend@gmail.com>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets
 returned from dequeue programs
Date: Wed, 13 Jul 2022 13:14:15 +0200
Message-Id: <20220713111430.134810-8-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

The usecase is to allow returning a dequeued packet, or NULL directly from
the BPF program. Shift the check_reference_leak call after
check_return_code, since the return is reference release (the reference is
transferred to the caller of the BPF program), hence a reference leak check
before check_return_code would always fail verification.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 kernel/bpf/verifier.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 489ea3f368a1..e3662460a095 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10421,6 +10421,9 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	return 0;
 }
 
+BTF_ID_LIST(dequeue_btf_ids)
+BTF_ID(struct, xdp_md)
+
 static int check_return_code(struct bpf_verifier_env *env)
 {
 	struct tnum enforce_attach_type_range = tnum_unknown;
@@ -10554,6 +10557,17 @@ static int check_return_code(struct bpf_verifier_env *env)
 		}
 		break;
 
+	case BPF_PROG_TYPE_DEQUEUE:
+		if (register_is_null(reg))
+			return 0;
+		if ((reg->type == PTR_TO_BTF_ID || reg->type == PTR_TO_BTF_ID_OR_NULL) &&
+		    reg->btf == btf_vmlinux && reg->btf_id == dequeue_btf_ids[0] &&
+		    reg->ref_obj_id != 0)
+			return release_reference(env, reg->ref_obj_id);
+		verbose(env, "At program exit the register R0 must be NULL or referenced %s%s\n",
+			reg_type_str(env, PTR_TO_BTF_ID),
+			kernel_type_name(btf_vmlinux, dequeue_btf_ids[0]));
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
@@ -12339,11 +12353,11 @@ static int do_check(struct bpf_verifier_env *env)
 					continue;
 				}
 
-				err = check_reference_leak(env);
+				err = check_return_code(env);
 				if (err)
 					return err;
 
-				err = check_return_code(env);
+				err = check_reference_leak(env);
 				if (err)
 					return err;
 process_bpf_exit:

From patchwork Wed Jul 13 11:14:16 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916576
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0160BC43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235969AbiGMLPT (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:19 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38982 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235968AbiGMLO6 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:58 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id BD6F6F5116
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710887;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=wuR8FXFo94ahyoYDajY486Whyu+YAqAq0MCiFtSIImw=;
        b=i5RfDA6C7l1ZFfb3IqN164f+1mwYhLmpmCrxQVx9v3uiU5NLrIUIW4bmNayjk8fTMQMTh3
        bTtNPbzA8NVbZ68Jylwn7/U99u3mI93Vg1mBukAS74uJzLciifu3CPNx3airNRLLjGMUrT
        9s+0W2+FG2I9N2dLqT2avgf2FT332HA=
Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com
 [209.85.208.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-85-4cjoD0X0Mu6qbaljJRCrUw-1; Wed, 13 Jul 2022 07:14:43 -0400
X-MC-Unique: 4cjoD0X0Mu6qbaljJRCrUw-1
Received: by mail-ed1-f72.google.com with SMTP id
 c9-20020a05640227c900b0043ad14b1fa0so6003658ede.1
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=wuR8FXFo94ahyoYDajY486Whyu+YAqAq0MCiFtSIImw=;
        b=N5aqQcuLZmmoXmmZ2NpWPNWOI3BVcZANpvbcEuQl8O3XphbJPjW4rYSYphZ3NojJ9e
         bqWYk6QW/WQKu9i/GUb7m3DlTtkmk5ay0CC3jbHAG2TvlT4TRXHOBSkbteu1SmH2jlrW
         F+6R4pERhh0PAZfqYEQn4reAeHsB6SA6+jU8pR2oTnTLi31iX/B6hKpbC53cQbeoe1/3
         99hTFDNe1ofKStaYm7DoaFKt8/kemrO0R1yrG3W+yEB1ojmtyvSCMAKo+8ZxOgJzBtC6
         61AyfHgGedWw5sJvJQhYWR7uH/nzxn07Ph4ID7gyJZQS3wnV8peL9S5D2qELH6xcinLf
         7MtQ==
X-Gm-Message-State: AJIora8Yj68daDlVukc1iWAHiqfX5EsMXf2Men2McwaqimXs4P2iAGoU
        LiS4c3DMwe7fSHhb1qSbRJ6L6LV4kbxeLX5j1jIOzIzjByXGIaqSiOZ41czY0EtB2kFM6rHQ+kA
        PWDKIO160++SDIPrY
X-Received: by 2002:a05:6402:1bda:b0:43a:55d7:9f2f with SMTP id
 ch26-20020a0564021bda00b0043a55d79f2fmr4005449edb.360.1657710881961;
        Wed, 13 Jul 2022 04:14:41 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1vGpIBjNSgB/AtD/xtNF900ucY1jjJYDN6RHNmEioUCGKOiES+fu5fHCZFnL6rPgI98Qa+g/A==
X-Received: by 2002:a05:6402:1bda:b0:43a:55d7:9f2f with SMTP id
 ch26-20020a0564021bda00b0043a55d79f2fmr4005396edb.360.1657710881587;
        Wed, 13 Jul 2022 04:14:41 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 j20-20020a170906411400b00726e108b566sm4872706ejk.173.2022.07.13.04.14.37
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:39 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id E35B34D990B; Wed, 13 Jul 2022 13:14:36 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Eric Dumazet <edumazet@google.com>,
 Paolo Abeni <pabeni@redhat.com>
Subject: [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map
Date: Wed, 13 Jul 2022 13:14:16 +0200
Message-Id: <20220713111430.134810-9-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

This adds a new helper to dequeue a packet from a PIFO map,
bpf_packet_dequeue(). The helper returns a refcounted pointer to the packet
dequeued from the map; the reference must be released either by dropping
the packet (using bpf_packet_drop()), or by returning it to the caller.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/uapi/linux/bpf.h       | 19 +++++++++++++++
 kernel/bpf/verifier.c          | 13 +++++++---
 net/core/filter.c              | 43 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h | 19 +++++++++++++++
 4 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 974fb5882305..d44382644391 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5341,6 +5341,23 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_packet_dequeue(void *ctx, struct bpf_map *map, u64 flags, u64 *rank)
+ *	Description
+ *		Dequeue the packet at the head of the PIFO in *map* and return a pointer
+ *		to the packet (or NULL if the PIFO is empty).
+ *	Return
+ *		On success, a pointer to the packet, or NULL if the PIFO is empty. The
+ *		packet pointer must be freed using *bpf_packet_drop()* or returning
+ *		the packet pointer. The *rank* pointer will be set to the rank of
+ *		the dequeued packet on success, or a negative error code on error.
+ *
+ * long bpf_packet_drop(void *ctx, void *pkt)
+ *	Description
+ *		Drop *pkt*, which must be a reference previously returned by
+ *		*bpf_packet_dequeue()* (and checked to not be NULL).
+ *	Return
+ *		This always succeeds and returns zero.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5551,6 +5568,8 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(packet_dequeue),		\
+	FN(packet_drop),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e3662460a095..68f98d76bc78 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -483,7 +483,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
 		func_id == BPF_FUNC_map_lookup_elem ||
-	        func_id == BPF_FUNC_ringbuf_reserve;
+		func_id == BPF_FUNC_ringbuf_reserve ||
+		func_id == BPF_FUNC_packet_dequeue;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -495,7 +496,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
 	    func_id == BPF_FUNC_ringbuf_reserve ||
-	    func_id == BPF_FUNC_kptr_xchg)
+	    func_id == BPF_FUNC_kptr_xchg ||
+	    func_id == BPF_FUNC_packet_dequeue)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -6276,7 +6278,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 			goto error;
 		break;
 	case BPF_MAP_TYPE_PIFO_XDP:
-		if (func_id != BPF_FUNC_redirect_map)
+		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_packet_dequeue)
 			goto error;
 		break;
 	default:
@@ -6385,6 +6388,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_packet_dequeue:
+		if (map->map_type != BPF_MAP_TYPE_PIFO_XDP)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 30bd3a6aedab..893b75515859 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4430,6 +4430,40 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BTF_ID_LIST_SINGLE(xdp_md_btf_ids, struct, xdp_md)
+
+BPF_CALL_4(bpf_packet_dequeue, struct dequeue_data *, ctx, struct bpf_map *, map,
+	   u64, flags, u64 *, rank)
+{
+	return (unsigned long)pifo_map_dequeue(map, flags, rank);
+}
+
+static const struct bpf_func_proto bpf_packet_dequeue_proto = {
+	.func           = bpf_packet_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id	= xdp_md_btf_ids,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+	.arg4_type      = ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_2(bpf_packet_drop, struct dequeue_data *, ctx, struct xdp_frame *, pkt)
+{
+	xdp_return_frame(pkt);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_packet_drop_proto = {
+	.func           = bpf_packet_drop,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
+	.arg2_btf_id	= xdp_md_btf_ids,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -8065,7 +8099,14 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 static const struct bpf_func_proto *
 dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	switch (func_id) {
+	case BPF_FUNC_packet_dequeue:
+		return &bpf_packet_dequeue_proto;
+	case BPF_FUNC_packet_drop:
+		return &bpf_packet_drop_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
 }
 
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4dd8a563f85d..1dab68a89e18 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5341,6 +5341,23 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_packet_dequeue(void *ctx, struct bpf_map *map, u64 flags, u64 *rank)
+ *	Description
+ *		Dequeue the packet at the head of the PIFO in *map* and return a pointer
+ *		to the packet (or NULL if the PIFO is empty).
+ *	Return
+ *		On success, a pointer to the packet, or NULL if the PIFO is empty. The
+ *		packet pointer must be freed using *bpf_packet_drop()* or returning
+ *		the packet pointer. The *rank* pointer will be set to the rank of
+ *		the dequeued packet on success, or a negative error code on error.
+ *
+ * long bpf_packet_drop(void *ctx, void *pkt)
+ *	Description
+ *		Drop *pkt*, which must be a reference previously returned by
+ *		*bpf_packet_dequeue()* (and checked to not be NULL).
+ *	Return
+ *		This always succeeds and returns zero.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5551,6 +5568,8 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(packet_dequeue),		\
+	FN(packet_drop),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper

From patchwork Wed Jul 13 11:14:17 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916584
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CF92BC43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:18:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236112AbiGMLSe (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:18:34 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42852 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236095AbiGMLSc (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:18:32 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7B533100CEE
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657711109;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=m9jsaBAn6wtBTsWqUwNNR/1YQePbJoUZIk07ulDT2Uw=;
        b=YnyGIMYx1Q5u7omYXZfXVxvXkw7Bn0clBeasN2lbcbwcpOi7vHn5ESdLVfZqQdIneOqMi5
        LzDpo8NXUpJ6e4DMljYDjDFwIYw0FifTXa1T6LhXDlQeYXZIoS33fKJo3VMK6gankZIKx0
        Yu+wQXb1chcaT5YVka64zICqBQ9D07Y=
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-484-b6sfjqDmOa2iJ7rl0yoXRA-1; Wed, 13 Jul 2022 07:18:28 -0400
X-MC-Unique: b6sfjqDmOa2iJ7rl0yoXRA-1
Received: by mail-ed1-f70.google.com with SMTP id
 o13-20020a056402438d00b0043aa846b2d2so8101506edc.8
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:28 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=m9jsaBAn6wtBTsWqUwNNR/1YQePbJoUZIk07ulDT2Uw=;
        b=gAibdA1HT7YtsWoMAeCKD5+Ywy0Wch6nnT1H8iRyNjOO7dgLSR+x76pQM84QKQaQiN
         s2TWHP6f70m85Dp/UDqvb+rFJNVTCMIMalyAOOzT8dzUdDX7QLEsAxX3Bi72rYmnLwSv
         44t3fVOS/BNRPSt5DdAQsmsrTceSw+8l+WkwaRmCm0FKM26hawfFbjSdogH3arU3Mu4w
         2uZzY/wMXC1gzw/VrWT51nsEVQ6hPZ0y29YhWd8z2r2790PboydTcgf4/1X052Lthytl
         ufDi9byAngqOyMV1d1dDuD7GhHu7q5MIP4OjANqQocjut84uok8WkEBYt7viL97ysV+n
         yrXw==
X-Gm-Message-State: AJIora94THZWV3iD+gMGZDZunAxcgmmDdqrjOLnAHUELwWZofSAanwO6
        XkMO3OWqOqUvrrqayjjWbrqopncAhTnUZqnM3Vgh4NE0rCTtk8F6acnNpU2sAVT2u5Oh++H6rEH
        Zt5zLuICTSbnRjhz3
X-Received: by 2002:a17:906:149:b0:712:c9:7981 with SMTP id
 9-20020a170906014900b0071200c97981mr2887994ejh.218.1657711105359;
        Wed, 13 Jul 2022 04:18:25 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1te/ajeNjxiFGlAqADULantLtM8t2h0ppxDuqVUdOm+9mP2gmmnpL+ygeIfuvRq4o5c+z1fyA==
X-Received: by 2002:a17:906:149:b0:712:c9:7981 with SMTP id
 9-20020a170906014900b0071200c97981mr2887900ejh.218.1657711104296;
        Wed, 13 Jul 2022 04:18:24 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 fd9-20020a1709072a0900b006fed062c68esm4807773ejc.182.2022.07.13.04.18.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:18:22 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 3E4794D990D; Wed, 13 Jul 2022 13:14:37 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET
Date: Wed, 13 Jul 2022 13:14:17 +0200
Message-Id: <20220713111430.134810-10-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Add a new member in PTR_TO_PACKET specific register state, namely
pkt_uid. This is used to classify packet pointers into different sets,
and the invariant is that any pkt pointers not belonging to the same
set, i.e. not sharing same pkt_uid, won't be allowed for comparison with
each other. During range propagation in __find_good_pkt_pointers, we now
need to take care to skip packet pointers with a different pkt_uid.

This change is necessary so that we can dequeue multiple XDP frames in a
single program, obtain packet pointers using their xdp_md fake struct,
and prevent confusion wrt comparison of packet pointers pointing into
different frames. Attaching a pkt_uid to the PTR_TO_PACKET type prevents
these, and also allows user to see which frame a packet pointer belongs
to in the verbose verifier log (by matching pkt_uid and ref_obj_id of
the referenced xdp_md obtained from bpf_packet_dequeue).

regsafe is updated to match non-zero pkt_uid using the idmap to ensure
it rejects distinct pkt_uid pkt pointers.

We also replace memset of reg->raw to set range to 0. In commit
0962590e5533 ("bpf: fix partial copy of map_ptr when dst is scalar"),
the copying was changed to use raw so that all possible members of type
specific register state are copied, since at that point the type of
register is not known. But inside the reg_is_pkt_pointer block, there is
no need to memset the whole 'raw' struct, since we also have a pkt_uid
member that we now want to preserve after copying from one register to
another, for pkt pointers. A test for this case has been included to
prevent regressions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf_verifier.h |  8 ++++-
 kernel/bpf/verifier.c        | 59 +++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 2e3bad8640dc..93b69dbf3d19 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -50,7 +50,13 @@ struct bpf_reg_state {
 	s32 off;
 	union {
 		/* valid when type == PTR_TO_PACKET */
-		int range;
+		struct {
+			int range;
+			/* To distinguish packet pointers backed by different
+			 * packets, to prevent pkt pointer comparisons.
+			 */
+			u32 pkt_uid;
+		};
 
 		/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
 		 *   PTR_TO_MAP_VALUE_OR_NULL
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 68f98d76bc78..f319e9392587 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -431,6 +431,12 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 	       type == PTR_TO_PACKET_META;
 }
 
+static bool type_is_pkt_pointer_any(enum bpf_reg_type type)
+{
+	return type_is_pkt_pointer(type) ||
+	       type == PTR_TO_PACKET_END;
+}
+
 static bool type_is_sk_pointer(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
@@ -861,6 +867,8 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 				verbose_a("off=%d", reg->off);
 			if (type_is_pkt_pointer(t))
 				verbose_a("r=%d", reg->range);
+			if (type_is_pkt_pointer_any(t) && reg->pkt_uid)
+				verbose_a("pkt_uid=%d", reg->pkt_uid);
 			else if (base_type(t) == CONST_PTR_TO_MAP ||
 				 base_type(t) == PTR_TO_MAP_KEY ||
 				 base_type(t) == PTR_TO_MAP_VALUE)
@@ -1394,8 +1402,7 @@ static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
 
 static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
 {
-	return reg_is_pkt_pointer(reg) ||
-	       reg->type == PTR_TO_PACKET_END;
+	return type_is_pkt_pointer_any(reg->type);
 }
 
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
@@ -6575,14 +6582,17 @@ static void release_reg_references(struct bpf_verifier_env *env,
 	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
 
-	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].ref_obj_id == ref_obj_id)
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		if (regs[i].ref_obj_id == ref_obj_id ||
+		    (reg_is_pkt_pointer_any(&regs[i]) && regs[i].pkt_uid == ref_obj_id))
 			mark_reg_unknown(env, regs, i);
+	}
 
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->ref_obj_id == ref_obj_id)
+		if (reg->ref_obj_id == ref_obj_id ||
+		    (reg_is_pkt_pointer_any(reg) && reg->pkt_uid == ref_obj_id))
 			__mark_reg_unknown(env, reg);
 	}
 }
@@ -8200,7 +8210,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
-			memset(&dst_reg->raw, 0, sizeof(dst_reg->raw));
+			dst_reg->range = 0;
 		}
 		break;
 	case BPF_SUB:
@@ -8260,7 +8270,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
 			if (smin_val < 0)
-				memset(&dst_reg->raw, 0, sizeof(dst_reg->raw));
+				dst_reg->range = 0;
 		}
 		break;
 	case BPF_AND:
@@ -9287,7 +9297,8 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		reg = &state->regs[i];
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type == type && reg->id == dst_reg->id &&
+		    reg->pkt_uid == dst_reg->pkt_uid)
 			/* keep the maximum range already checked */
 			reg->range = max(reg->range, new_range);
 	}
@@ -9295,7 +9306,8 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type == type && reg->id == dst_reg->id &&
+		    reg->pkt_uid == dst_reg->pkt_uid)
 			reg->range = max(reg->range, new_range);
 	}
 }
@@ -9910,6 +9922,14 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
 }
 
+static bool is_bad_pkt_comparison(const struct bpf_reg_state *dst_reg,
+				  const struct bpf_reg_state *src_reg)
+{
+	if (!reg_is_pkt_pointer_any(dst_reg) || !reg_is_pkt_pointer_any(src_reg))
+		return false;
+	return dst_reg->pkt_uid != src_reg->pkt_uid;
+}
+
 static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 				   struct bpf_reg_state *dst_reg,
 				   struct bpf_reg_state *src_reg,
@@ -9923,6 +9943,9 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 	if (BPF_CLASS(insn->code) == BPF_JMP32)
 		return false;
 
+	if (is_bad_pkt_comparison(dst_reg, src_reg))
+		return false;
+
 	switch (BPF_OP(insn->code)) {
 	case BPF_JGT:
 		if ((dst_reg->type == PTR_TO_PACKET &&
@@ -10220,11 +10243,17 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		mark_ptr_or_null_regs(other_branch, insn->dst_reg,
 				      opcode == BPF_JEQ);
 	} else if (!try_match_pkt_pointers(insn, dst_reg, &regs[insn->src_reg],
-					   this_branch, other_branch) &&
-		   is_pointer_value(env, insn->dst_reg)) {
-		verbose(env, "R%d pointer comparison prohibited\n",
-			insn->dst_reg);
-		return -EACCES;
+					   this_branch, other_branch)) {
+		if (is_pointer_value(env, insn->dst_reg)) {
+			verbose(env, "R%d pointer comparison prohibited\n",
+				insn->dst_reg);
+			return -EACCES;
+		}
+		if (is_bad_pkt_comparison(dst_reg, &regs[insn->src_reg])) {
+			verbose(env, "R%d, R%d pkt pointer comparison prohibited\n",
+				insn->dst_reg, insn->src_reg);
+			return -EACCES;
+		}
 	}
 	if (env->log.level & BPF_LOG_LEVEL)
 		print_insn_state(env, this_branch->frame[this_branch->curframe]);
@@ -11514,6 +11543,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		/* id relations must be preserved */
 		if (rold->id && !check_ids(rold->id, rcur->id, idmap))
 			return false;
+		if (rold->pkt_uid && !check_ids(rold->pkt_uid, rcur->pkt_uid, idmap))
+			return false;
 		/* new val must satisfy old val knowledge */
 		return range_within(rold, rcur) &&
 		       tnum_in(rold->var_off, rcur->var_off);

From patchwork Wed Jul 13 11:14:18 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916582
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 08E08C43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:18:34 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236097AbiGMLSc (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:18:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42700 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235161AbiGMLS2 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:18:28 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 6C013100CF0
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657711106;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=FxD0L2l12FH/a0DfiV9dWaAAQqWE+U6eSaGfcg0Nk74=;
        b=HgNnjXR6k2aVlD+FEm3gkbnmkmR6w6VghQq6U6F5gj8T63ulrA6rS512oOaycDcdGFzOyt
        UoRecI7jHf+XmJPsksug6Ss8uFZqg4OV1ywmjFZsm7vIlHF8+KV5LZRvsjc+VZdENCmWbY
        nXoL9QOvF/1xihiY9Vqyv18PrHzcVBc=
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-32-s9spLwBCPkm9zU6HpFxhTw-1; Wed, 13 Jul 2022 07:18:25 -0400
X-MC-Unique: s9spLwBCPkm9zU6HpFxhTw-1
Received: by mail-ed1-f71.google.com with SMTP id
 z5-20020a05640235c500b0043ae18edeeeso4514739edc.5
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:25 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=FxD0L2l12FH/a0DfiV9dWaAAQqWE+U6eSaGfcg0Nk74=;
        b=VUVc3D67a8r3zp2Rws7c2tmSOpoLX0eHZ9H94kpgK8FDm0UdbGAyQ+dwaiDve+FjSU
         wfhPGs1+lV27d6Q3VwM1Ie87uZvobPa7ulRk7UFzzTcmCXbh5VTKRm+ABs1Mth2+drzj
         A5hF4L8ecvxHcNR+OW9ahetCNYzvCSn96a/ufdy/a6RL0/VDHkA1YGyNCN9JmPSPBUY+
         9W5iHLZRfqDf4yuvp3GMdVo6G571/lbjhnKU5SxPIOXDZMhqT14lJGqL2A2sv7Os4Jte
         LiXuzUXDMOrY/2LZ1Gda0yq3zUEGLWdJmRCIvJTa0fLNf18sFEdH7xkVCX858kLzkDIy
         Fa2A==
X-Gm-Message-State: AJIora+Ys97TTjOLyLqM6JJAsKoRjkcH9UO6pOnOAJx6vJENZ92VwpeN
        Fb+81COyKmtxd4xo/RCiQRA4etmB6FjUoAiNj52kecaZ+kmYvD6/I7oKRe0LWyAtwacsyGhWmlF
        VwFrjOj6ocXpY+84u
X-Received: by 2002:a17:906:8448:b0:72b:5659:9873 with SMTP id
 e8-20020a170906844800b0072b56599873mr2932620ejy.117.1657711103803;
        Wed, 13 Jul 2022 04:18:23 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1s3NAhQCtaD2tuuQb7KOxl0OQCSd+4E01o5/hEQ2yTHvV//3Nd2VaTRKAyDHWZZSxHLCskNsw==
X-Received: by 2002:a17:906:8448:b0:72b:5659:9873 with SMTP id
 e8-20020a170906844800b0072b56599873mr2932584ejy.117.1657711103429;
        Wed, 13 Jul 2022 04:18:23 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 ks6-20020a170906f84600b0072ae8fb13e6sm4808330ejb.126.2022.07.13.04.18.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:18:22 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id BBC984D990F; Wed, 13 Jul 2022 13:14:37 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Eric Dumazet <edumazet@google.com>,
 Paolo Abeni <pabeni@redhat.com>
Subject: [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue
 progs
Date: Wed, 13 Jul 2022 13:14:18 +0200
Message-Id: <20220713111430.134810-11-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Allow user to obtain packet pointers from dequeued xdp_md BTF pointer,
by allowing convert_ctx_access implementation for PTR_TO_BTF_ID, and
then tagging loads as packet pointers in verifier context.

Previously, convert_ctx_access was limited to just PTR_TO_CTX, but now
it will also be used to translate access into PTR_TO_BTF_ID of xdp_md
obtained from bpf_packet_dequeue, so it works like xdp_md ctx in XDP
programs. We must also remember that while xdp_buff backs ctx in XDP
programs, xdp_frame backs xdp_md in dequeue programs.

Next, we use pkt_uid support and transfer ref_obj_id on load data,
data_end, and data_meta fields, to make verifier aware of provenance of
these packet pointers, so that comparison can be rejected for unsafe
cases.

In the end, user can reuse their code meant for XDP ctx in deqeueue
programs as well, and don't have to do things differently.

Once packet pointers are obtained, regular verifier logic kicks in where
pointers from same xdp_frame can be compared to modify the range and
perform access into the packet.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h          |  26 +++++--
 include/linux/bpf_verifier.h |   6 ++
 kernel/bpf/verifier.c        |  48 +++++++++---
 net/core/filter.c            | 143 +++++++++++++++++++++++++++++++++++
 4 files changed, 206 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6ea5d6d188cf..a568ddc1f1ea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -653,6 +653,12 @@ struct bpf_prog_ops {
 			union bpf_attr __user *uattr);
 };
 
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+					const struct bpf_insn *src,
+					struct bpf_insn *dst,
+					struct bpf_prog *prog,
+					u32 *target_size);
+
 struct bpf_verifier_ops {
 	/* return eBPF function prototype for verification */
 	const struct bpf_func_proto *
@@ -678,6 +684,9 @@ struct bpf_verifier_ops {
 				 const struct btf_type *t, int off, int size,
 				 enum bpf_access_type atype,
 				 u32 *next_btf_id, enum bpf_type_flag *flag);
+	bpf_convert_ctx_access_t (*get_convert_ctx_access)(struct bpf_verifier_log *log,
+							   const struct btf *btf,
+							   u32 btf_id);
 };
 
 struct bpf_prog_offload_ops {
@@ -1360,11 +1369,6 @@ const struct bpf_func_proto *bpf_get_trace_vprintk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
 					unsigned long off, unsigned long len);
-typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
-					const struct bpf_insn *src,
-					struct bpf_insn *dst,
-					struct bpf_prog *prog,
-					u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 		     void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -2180,6 +2184,18 @@ static inline bool unprivileged_ebpf_enabled(void)
 	return false;
 }
 
+static inline struct btf *bpf_get_btf_vmlinux(void)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static inline int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
+				    const struct btf_type *t, int off, int size,
+				    enum bpf_access_type atype __maybe_unused,
+				    u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 93b69dbf3d19..640f92fece12 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -532,8 +532,14 @@ __printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
 				      const char *fmt, va_list args);
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
 					   const char *fmt, ...);
+#ifdef CONFIG_BPF_SYSCALL
 __printf(2, 3) void bpf_log(struct bpf_verifier_log *log,
 			    const char *fmt, ...);
+#else
+static inline void bpf_log(struct bpf_verifier_log *log, const char *fmt, ...)
+{
+}
+#endif
 
 static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f319e9392587..7edc2b834d9b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1707,7 +1707,7 @@ static void mark_reg_not_init(struct bpf_verifier_env *env,
 static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 			    struct bpf_reg_state *regs, u32 regno,
 			    enum bpf_reg_type reg_type,
-			    struct btf *btf, u32 btf_id,
+			    struct btf *btf, u32 reg_id,
 			    enum bpf_type_flag flag)
 {
 	if (reg_type == SCALAR_VALUE) {
@@ -1715,9 +1715,14 @@ static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 		return;
 	}
 	mark_reg_known_zero(env, regs, regno);
-	regs[regno].type = PTR_TO_BTF_ID | flag;
+	regs[regno].type = (int)reg_type | flag;
+	if (type_is_pkt_pointer_any(reg_type)) {
+		regs[regno].pkt_uid = reg_id;
+		return;
+	}
+	WARN_ON_ONCE(base_type(reg_type) != PTR_TO_BTF_ID);
 	regs[regno].btf = btf;
-	regs[regno].btf_id = btf_id;
+	regs[regno].btf_id = reg_id;
 }
 
 #define DEF_NOT_SUBREG	(0)
@@ -4479,13 +4484,14 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 				   struct bpf_reg_state *regs,
 				   int regno, int off, int size,
 				   enum bpf_access_type atype,
-				   int value_regno)
+				   int value_regno, int insn_idx)
 {
 	struct bpf_reg_state *reg = regs + regno;
 	const struct btf_type *t = btf_type_by_id(reg->btf, reg->btf_id);
 	const char *tname = btf_name_by_offset(reg->btf, t->name_off);
+	struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
 	enum bpf_type_flag flag = 0;
-	u32 btf_id;
+	u32 reg_id;
 	int ret;
 
 	if (off < 0) {
@@ -4520,7 +4526,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 
 	if (env->ops->btf_struct_access) {
 		ret = env->ops->btf_struct_access(&env->log, reg->btf, t,
-						  off, size, atype, &btf_id, &flag);
+						  off, size, atype, &reg_id, &flag);
 	} else {
 		if (atype != BPF_READ) {
 			verbose(env, "only read is supported\n");
@@ -4528,7 +4534,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 		}
 
 		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
-					atype, &btf_id, &flag);
+					atype, &reg_id, &flag);
 	}
 
 	if (ret < 0)
@@ -4540,8 +4546,19 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	if (type_flag(reg->type) & PTR_UNTRUSTED)
 		flag |= PTR_UNTRUSTED;
 
-	if (atype == BPF_READ && value_regno >= 0)
-		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id, flag);
+	/* Remember the BTF ID for later use in convert_ctx_accesses */
+	aux->btf_var.btf_id = reg->btf_id;
+	aux->btf_var.btf = reg->btf;
+
+	if (atype == BPF_READ && value_regno >= 0) {
+		/* For pkt pointers, reg_id is set to pkt_uid, which must be the
+		 * ref_obj_id of the referenced register from which they are
+		 * obtained, denoting different packets e.g. in dequeue progs.
+		 */
+		if (type_is_pkt_pointer_any(ret))
+			reg_id = reg->ref_obj_id;
+		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, reg_id, flag);
+	}
 
 	return 0;
 }
@@ -4896,7 +4913,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 	} else if (base_type(reg->type) == PTR_TO_BTF_ID &&
 		   !type_may_be_null(reg->type)) {
 		err = check_ptr_to_btf_access(env, regs, regno, off, size, t,
-					      value_regno);
+					      value_regno, insn_idx);
 	} else if (reg->type == CONST_PTR_TO_MAP) {
 		err = check_ptr_to_map_access(env, regs, regno, off, size, t,
 					      value_regno);
@@ -13515,8 +13532,15 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_BTF_ID:
 		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
 			if (type == BPF_READ) {
-				insn->code = BPF_LDX | BPF_PROBE_MEM |
-					BPF_SIZE((insn)->code);
+				if (env->ops->get_convert_ctx_access) {
+					struct btf *btf = env->insn_aux_data[i + delta].btf_var.btf;
+					u32 btf_id = env->insn_aux_data[i + delta].btf_var.btf_id;
+
+					convert_ctx_access = env->ops->get_convert_ctx_access(&env->log, btf, btf_id);
+					if (convert_ctx_access)
+						break;
+				}
+				insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
 				env->prog->aux->num_exentries++;
 			} else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) {
 				verbose(env, "Writes through BTF pointers are not allowed\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 893b75515859..6a4881739e9b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -79,6 +79,7 @@
 #include <net/tls.h>
 #include <net/xdp.h>
 #include <net/mptcp.h>
+#include <linux/bpf_verifier.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -9918,6 +9919,146 @@ static u32 dequeue_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static int dequeue_btf_struct_access(struct bpf_verifier_log *log,
+				     const struct btf *btf,
+				     const struct btf_type *t, int off, int size,
+				     enum bpf_access_type atype,
+				     u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+	const struct btf_type *pkt_type;
+	enum bpf_reg_type reg_type;
+	struct btf *btf_vmlinux;
+
+	btf_vmlinux = bpf_get_btf_vmlinux();
+	if (IS_ERR_OR_NULL(btf_vmlinux) || btf != btf_vmlinux)
+		return -EINVAL;
+
+	if (atype != BPF_READ)
+		return -EACCES;
+
+	pkt_type = btf_type_by_id(btf_vmlinux, xdp_md_btf_ids[0]);
+	if (!pkt_type)
+		return -EINVAL;
+	if (t != pkt_type)
+		return btf_struct_access(log, btf, t, off, size, atype,
+					 next_btf_id, flag);
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_meta):
+		reg_type = PTR_TO_PACKET_META;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		reg_type = PTR_TO_PACKET_END;
+		break;
+	default:
+		bpf_log(log, "no read support for xdp_md at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (!__is_valid_xdp_access(off, size))
+		return -EINVAL;
+	return reg_type;
+}
+
+static u32
+dequeue_convert_xdp_md_access(enum bpf_access_type type,
+			      const struct bpf_insn *si, struct bpf_insn *insn_buf,
+			      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int src_reg;
+
+	switch (si->off) {
+	case offsetof(struct xdp_md, data):
+		/* dst_reg = *(src_reg + off(xdp_frame, data)) */
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_frame, data));
+		break;
+	case offsetof(struct xdp_md, data_meta):
+		if (si->dst_reg == si->src_reg) {
+			src_reg = BPF_REG_9;
+			if (si->dst_reg == src_reg)
+				src_reg--;
+			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, src_reg,
+					      offsetof(struct xdp_frame, next));
+			*insn++ = BPF_MOV64_REG(src_reg, si->src_reg);
+		} else {
+			src_reg = si->src_reg;
+		}
+		/* AX = src_reg
+		 * dst_reg = *(src_reg + off(xdp_frame, data))
+		 * src_reg = *(src_reg + off(xdp_frame, metasize))
+		 * dst_reg -= src_reg
+		 * src_reg = AX
+		 */
+		*insn++ = BPF_MOV64_REG(BPF_REG_AX, src_reg);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, src_reg,
+				      offsetof(struct xdp_frame, data));
+		*insn++ = BPF_LDX_MEM(BPF_B, /* metasize == 8 bits */
+				      src_reg, src_reg,
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+				      offsetofend(struct xdp_frame, headroom) + 3);
+#elif defined(__BIG_ENDIAN_BITFIELD)
+				      offsetofend(struct xdp_frame, headroom));
+#endif
+		*insn++ = BPF_ALU64_REG(BPF_SUB, si->dst_reg, src_reg);
+		*insn++ = BPF_MOV64_REG(src_reg, BPF_REG_AX);
+		if (si->dst_reg == si->src_reg)
+			*insn++ = BPF_LDX_MEM(BPF_DW, src_reg, si->src_reg,
+					      offsetof(struct xdp_frame, next));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		if (si->dst_reg == si->src_reg) {
+			src_reg = BPF_REG_9;
+			if (si->dst_reg == src_reg)
+				src_reg--;
+			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, src_reg,
+					      offsetof(struct xdp_frame, next));
+			*insn++ = BPF_MOV64_REG(src_reg, si->src_reg);
+		} else {
+			src_reg = si->src_reg;
+		}
+		/* AX = src_reg
+		 * dst_reg = *(src_reg + off(xdp_frame, data))
+		 * src_reg = *(src_reg + off(xdp_frame, len))
+		 * dst_reg += src_reg
+		 * src_reg = AX
+		 */
+		*insn++ = BPF_MOV64_REG(BPF_REG_AX, src_reg);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, src_reg,
+				      offsetof(struct xdp_frame, data));
+		*insn++ = BPF_LDX_MEM(BPF_H, src_reg, src_reg,
+				      offsetof(struct xdp_frame, len));
+		*insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, src_reg);
+		*insn++ = BPF_MOV64_REG(src_reg, BPF_REG_AX);
+		if (si->dst_reg == si->src_reg)
+			*insn++ = BPF_LDX_MEM(BPF_DW, src_reg, si->src_reg,
+					      offsetof(struct xdp_frame, next));
+		break;
+	}
+	return insn - insn_buf;
+}
+
+static bpf_convert_ctx_access_t
+dequeue_get_convert_ctx_access(struct bpf_verifier_log *log,
+			       const struct btf *btf, u32 btf_id)
+{
+	struct btf *btf_vmlinux;
+
+	btf_vmlinux = bpf_get_btf_vmlinux();
+	if (IS_ERR_OR_NULL(btf_vmlinux) || btf != btf_vmlinux)
+		return NULL;
+	if (btf_id != xdp_md_btf_ids[0])
+		return NULL;
+	return dequeue_convert_xdp_md_access;
+}
+
 /* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
  * context Structure, F is Field in context structure that contains a pointer
  * to Nested Structure of type NS that has the field NF.
@@ -10775,6 +10916,8 @@ const struct bpf_verifier_ops dequeue_verifier_ops = {
 	.is_valid_access	= dequeue_is_valid_access,
 	.convert_ctx_access	= dequeue_convert_ctx_access,
 	.gen_prologue		= bpf_noop_prologue,
+	.btf_struct_access	= dequeue_btf_struct_access,
+	.get_convert_ctx_access = dequeue_get_convert_ctx_access,
 };
 
 const struct bpf_prog_ops dequeue_prog_ops = {

From patchwork Wed Jul 13 11:14:19 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916578
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23011C43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235849AbiGMLPZ (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39346 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236202AbiGMLPM (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:15:12 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2376C1014AD
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710890;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=rXFurxawVFtuft5lsS/J8gtr6dC7kcFoAk62FiTlSIM=;
        b=ZLxSOXV1GzESEj6XogzVWNY3PzRU/z6w/EnQ1xa4BVK/Dg0VRWSIm0Qu2jT6a65KtumVo2
        uZtfp/dKH19s/Is9wqJgGQ9o0yDRLBNIXdwa8Ljn+TWAUKITBfLmay73F6tSrGGCM3PmGG
        Tj5iZmA+VvU5navx+2vC0kv5A/Jb4eQ=
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-543-ym9-BGmDNISUvtmoshiASQ-1; Wed, 13 Jul 2022 07:14:49 -0400
X-MC-Unique: ym9-BGmDNISUvtmoshiASQ-1
Received: by mail-ed1-f70.google.com with SMTP id
 m10-20020a056402510a00b0043a93d807ffso8134991edd.12
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=rXFurxawVFtuft5lsS/J8gtr6dC7kcFoAk62FiTlSIM=;
        b=0o0o7WDrMsqKSY/GsN+X3eklNpbKtWNTn/iEEuSy2qxzlV8bfkYhvT30qMWAoVksQ6
         wutxvbGV7i1E79Z5Ja75W11lsAzb86qnqSyTycE+VKixaBzfxKCN2TLjjlx3RwTuDGUq
         abC8pimwQZ1d6Uzpe8HupQVWLKVziONEoCazkfYf0g1Yx5UDoCFnCOxQlqKypNPAJj/T
         T449K7bSJ1b/FW80ahmse+imcC8dMk6LDrdup/AfDCYFJYQWkkx8Rud0s7s5knptd+5B
         m+wAiDHBlQjF37pREyUe3JRi047wOtCMhvIJ9Aoz1KEfsTclRe2NnQ82CzWeJ46q2GCH
         tIiA==
X-Gm-Message-State: AJIora/XwVpRht7T9YWr6/5vGWFbEpQbMvWNlQ9O9FToLd8kOjLbqihH
        2c4qKCPFuo6M9ycq2iulOMrWw24MUpz+zc4Ah7twm7U7Ooo5+Vvku0cSFh7Skev+cCOf0SzbY6c
        O5tp9FH9jGvwe3do7
X-Received: by 2002:a05:6402:5186:b0:43a:b43a:40bc with SMTP id
 q6-20020a056402518600b0043ab43a40bcmr4119183edd.388.1657710887103;
        Wed, 13 Jul 2022 04:14:47 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1vB/c1K7yaxap/ZUVfOVeBbQ2B1yduKwzNKz+NkrC/B2K0VzhPyb14RkH+qwdDnq/e4IDFFSw==
X-Received: by 2002:a05:6402:5186:b0:43a:b43a:40bc with SMTP id
 q6-20020a056402518600b0043ab43a40bcmr4119000edd.388.1657710885403;
        Wed, 13 Jul 2022 04:14:45 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 f19-20020a170906139300b00722e52d043dsm4844725ejc.114.2022.07.13.04.14.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id B32554D9911; Wed, 13 Jul 2022 13:14:38 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        John Fastabend <john.fastabend@gmail.com>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 11/17] dev: Add XDP dequeue hook
Date: Wed, 13 Jul 2022 13:14:19 +0200
Message-Id: <20220713111430.134810-12-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add a second per-interface XDP hook for dequeueing packets. This hook
allows attaching programs of the dequeue type, which will be executed by
the stack in the TX softirq. Packets returned by the dequeue hook are
subsequently transmitted on the interface using the ndo_xdp_xmit() driver
function. The code to do this is added to devmap.c to be able to reuse the
existing bulking mechanism from there.

To actually schedule a device for transmission, a BPF program needs to call
a helper that is added in the next commit.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/filter.h             |  17 +++++
 include/linux/netdevice.h          |   6 ++
 include/net/xdp.h                  |   7 ++
 include/uapi/linux/if_link.h       |   4 +-
 kernel/bpf/devmap.c                |  88 ++++++++++++++++++++---
 net/core/dev.c                     | 109 +++++++++++++++++++++++++++++
 net/core/dev.h                     |   2 +
 net/core/filter.c                  |   7 ++
 net/core/rtnetlink.c               |  30 ++++++--
 tools/include/uapi/linux/if_link.h |   4 +-
 10 files changed, 256 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b0ddb647d5f2..0f1570daaa52 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -778,6 +778,23 @@ static __always_inline u64 bpf_prog_run_xdp(const struct bpf_prog *prog,
 
 void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog);
 
+DECLARE_BPF_DISPATCHER(xdp_dequeue)
+
+static __always_inline struct xdp_frame *bpf_prog_run_xdp_dequeue(const struct bpf_prog *prog,
+								  struct dequeue_data *ctx)
+{
+	struct xdp_frame *frm = NULL;
+	u64 ret;
+
+	ret = __bpf_prog_run(prog, ctx, BPF_DISPATCHER_FUNC(xdp_dequeue));
+	if (ret)
+		frm = (struct xdp_frame *)(unsigned long)ret;
+
+	return frm;
+}
+
+void bpf_prog_change_xdp_dequeue(struct bpf_prog *prev_prog, struct bpf_prog *prog);
+
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
 {
 	return prog->len * sizeof(struct bpf_insn);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fe9aeca2fce9..4096caac5a2a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
 struct xdp_buff;
+struct xdp_dequeue;
 
 void synchronize_net(void);
 void netdev_set_default_ethtool_ops(struct net_device *dev,
@@ -2326,6 +2327,7 @@ struct net_device {
 
 	/* protected by rtnl_lock */
 	struct bpf_xdp_entity	xdp_state[__MAX_XDP_MODE];
+	struct bpf_prog __rcu	*xdp_dequeue_prog;
 
 	u8 dev_addr_shadow[MAX_ADDR_LEN];
 	netdevice_tracker	linkwatch_dev_tracker;
@@ -3109,6 +3111,7 @@ struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct Qdisc		**output_queue_tailp;
 	struct sk_buff		*completion_queue;
+	struct xdp_dequeue	*xdp_dequeue;
 #ifdef CONFIG_XFRM_OFFLOAD
 	struct sk_buff_head	xfrm_backlog;
 #endif
@@ -3143,6 +3146,7 @@ struct softnet_data {
 	int			defer_ipi_scheduled;
 	struct sk_buff		*defer_list;
 	call_single_data_t	defer_csd;
+
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
@@ -3222,6 +3226,7 @@ static inline void netif_tx_start_all_queues(struct net_device *dev)
 }
 
 void netif_tx_wake_queue(struct netdev_queue *dev_queue);
+void netif_tx_schedule_xdp(struct xdp_dequeue *deq);
 
 /**
  *	netif_wake_queue - restart transmit
@@ -3851,6 +3856,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 u8 dev_xdp_prog_count(struct net_device *dev);
 u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode);
+u32 dev_xdp_dequeue_prog_id(struct net_device *dev);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 728ce943d352..e06b340132dd 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -89,6 +89,13 @@ struct dequeue_data {
 	struct xdp_txq_info *txq;
 };
 
+struct xdp_dequeue {
+	struct xdp_dequeue *next;
+};
+
+void dev_run_xdp_dequeue(struct xdp_dequeue *deq);
+void dev_schedule_xdp_dequeue(struct net_device *dev);
+
 static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
 {
 	return !!(xdp->flags & XDP_FLAGS_HAS_FRAGS);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index e36d9d2c65a7..fb8ab1796cd2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1283,9 +1283,10 @@ enum {
 #define XDP_FLAGS_DRV_MODE		(1U << 2)
 #define XDP_FLAGS_HW_MODE		(1U << 3)
 #define XDP_FLAGS_REPLACE		(1U << 4)
+#define XDP_FLAGS_DEQUEUE_MODE		(1U << 5)
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
-					 XDP_FLAGS_HW_MODE)
+					 XDP_FLAGS_HW_MODE | XDP_FLAGS_DEQUEUE_MODE)
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
 					 XDP_FLAGS_MODES | XDP_FLAGS_REPLACE)
 
@@ -1308,6 +1309,7 @@ enum {
 	IFLA_XDP_SKB_PROG_ID,
 	IFLA_XDP_HW_PROG_ID,
 	IFLA_XDP_EXPECTED_FD,
+	IFLA_XDP_DEQUEUE_PROG_ID,
 	__IFLA_XDP_MAX,
 };
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 980f8928e977..949a60f06d24 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -59,6 +59,7 @@ struct xdp_dev_bulk_queue {
 	struct net_device *dev;
 	struct net_device *dev_rx;
 	struct bpf_prog *xdp_prog;
+	struct xdp_dequeue deq;
 	unsigned int count;
 };
 
@@ -362,16 +363,17 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
 	return nframes; /* sent frames count */
 }
 
-static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
+static bool bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, bool keep)
 {
 	struct net_device *dev = bq->dev;
 	unsigned int cnt = bq->count;
 	int sent = 0, err = 0;
 	int to_send = cnt;
-	int i;
+	bool ret = true;
+	int i, kept = 0;
 
 	if (unlikely(!cnt))
-		return;
+		return true;
 
 	for (i = 0; i < cnt; i++) {
 		struct xdp_frame *xdpf = bq->q[i];
@@ -394,15 +396,29 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 		sent = 0;
 	}
 
-	/* If not all frames have been transmitted, it is our
-	 * responsibility to free them
+	/* If not all frames have been transmitted, it is our responsibility to
+	 * free them, unless the caller asked for them to be kept, in which case
+	 * we'll move them to the head of the queue
 	 */
-	for (i = sent; unlikely(i < to_send); i++)
-		xdp_return_frame_rx_napi(bq->q[i]);
+	if (unlikely(sent < to_send)) {
+		ret = false;
+		if (keep) {
+			if (!sent) {
+				kept = to_send;
+				goto out;
+			}
+			for (i = sent; i < to_send; i++)
+				bq->q[kept++] = bq->q[i];
+		} else {
+			for (i = sent; i < to_send; i++)
+				xdp_return_frame_rx_napi(bq->q[i]);
+		}
+	}
 
 out:
-	bq->count = 0;
-	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err);
+	bq->count = kept;
+	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - kept, err);
+	return ret;
 }
 
 /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the
@@ -415,13 +431,63 @@ void __dev_flush(void)
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
-		bq_xmit_all(bq, XDP_XMIT_FLUSH);
+		bq_xmit_all(bq, XDP_XMIT_FLUSH, false);
 		bq->dev_rx = NULL;
 		bq->xdp_prog = NULL;
 		__list_del_clearprev(&bq->flush_node);
 	}
 }
 
+void dev_schedule_xdp_dequeue(struct net_device *dev)
+{
+	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
+
+	netif_tx_schedule_xdp(&bq->deq);
+}
+
+void dev_run_xdp_dequeue(struct xdp_dequeue *deq)
+{
+	while (deq) {
+		struct xdp_dev_bulk_queue *bq = container_of(deq, struct xdp_dev_bulk_queue, deq);
+		struct xdp_txq_info txqi = { .dev = bq->dev };
+		struct dequeue_data ctx = { .txq = &txqi };
+		struct xdp_dequeue *nxt = deq->next;
+		int quota = dev_tx_weight;
+		struct xdp_frame *xdpf;
+		struct bpf_prog *prog;
+		bool ret = true;
+
+		local_bh_disable();
+
+		prog = rcu_dereference(bq->dev->xdp_dequeue_prog);
+		if (likely(prog)) {
+			do {
+				if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) {
+					ret = bq_xmit_all(bq, 0, true);
+					if (!ret)
+						break;
+				}
+				xdpf = bpf_prog_run_xdp_dequeue(prog, &ctx);
+				if (xdpf)
+					bq->q[bq->count++] = xdpf;
+
+			} while (xdpf && --quota);
+
+			if (ret)
+				ret = bq_xmit_all(bq, XDP_XMIT_FLUSH, true);
+
+			if (!ret || !quota)
+				/* out of space, reschedule */
+				netif_tx_schedule_xdp(deq);
+		}
+
+		deq->next = NULL;
+		deq = nxt;
+
+		local_bh_enable();
+	}
+}
+
 /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or
  * by local_bh_disable() (from XDP calls inside NAPI). The
  * rcu_read_lock_bh_held() below makes lockdep accept both.
@@ -450,7 +516,7 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
-		bq_xmit_all(bq, 0);
+		bq_xmit_all(bq, 0, false);
 
 	/* Ingress dev_rx will be the same for all xdp_frame's in
 	 * bulk_queue, because bq stored per-CPU and must be flushed
diff --git a/net/core/dev.c b/net/core/dev.c
index 978ed0622d8f..07505c88117a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3120,6 +3120,22 @@ void netif_tx_wake_queue(struct netdev_queue *dev_queue)
 }
 EXPORT_SYMBOL(netif_tx_wake_queue);
 
+void netif_tx_schedule_xdp(struct xdp_dequeue *deq)
+{
+	bool need_bh_off = !(hardirq_count() | softirq_count());
+
+	WARN_ON_ONCE(need_bh_off);
+
+	if (!deq->next) {
+		struct softnet_data *sd = this_cpu_ptr(&softnet_data);
+
+		deq->next = sd->xdp_dequeue;
+		sd->xdp_dequeue = deq;
+		raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	}
+}
+EXPORT_SYMBOL(netif_tx_schedule_xdp);
+
 void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason)
 {
 	unsigned long flags;
@@ -5011,6 +5027,17 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 
+	if (sd->xdp_dequeue) {
+		struct xdp_dequeue *deq;
+
+		local_irq_disable();
+		deq = sd->xdp_dequeue;
+		sd->xdp_dequeue = NULL;
+		local_irq_enable();
+
+		dev_run_xdp_dequeue(deq);
+	}
+
 	if (sd->completion_queue) {
 		struct sk_buff *clist;
 
@@ -9522,6 +9549,88 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	return err;
 }
 
+u32 dev_xdp_dequeue_prog_id(struct net_device *dev)
+{
+	struct bpf_prog *prog = rtnl_dereference(dev->xdp_dequeue_prog);
+
+	return prog ? prog->aux->id : 0;
+}
+
+static int dev_xdp_dequeue_attach(struct net_device *dev, struct netlink_ext_ack *extack,
+				  struct bpf_prog *new_prog, struct bpf_prog *old_prog, u32 flags)
+{
+	struct bpf_prog *cur_prog;
+
+	ASSERT_RTNL();
+
+	if (!(flags & XDP_FLAGS_REPLACE) || (flags & XDP_FLAGS_UPDATE_IF_NOEXIST)) {
+		NL_SET_ERR_MSG(extack, "Dequeue prog must use XDP_FLAGS_REPLACE");
+		return -EINVAL;
+	}
+
+	cur_prog = rcu_dereference(dev->xdp_dequeue_prog);
+
+	if (cur_prog != old_prog) {
+		NL_SET_ERR_MSG(extack, "Active program does not match expected");
+		return -EEXIST;
+	}
+
+	if (cur_prog != new_prog) {
+		rcu_assign_pointer(dev->xdp_dequeue_prog, new_prog);
+		bpf_prog_change_xdp_dequeue(cur_prog, new_prog);
+	}
+
+	if (cur_prog)
+		bpf_prog_put(cur_prog);
+
+	return 0;
+}
+
+/**
+ *	dev_change_xdp_dequeue_fd - set or clear a bpf program for a XDP dequeue
+ *	@dev: device
+ *	@extack: netlink extended ack
+ *	@fd: new program fd or negative value to clear
+ *	@expected_fd: old program fd that userspace expects to replace or clear
+ *	@flags: xdp dequeue-related flags
+ *
+ *	Set or clear an XDP dequeue program for a device
+ */
+int dev_change_xdp_dequeue_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+			      int fd, int expected_fd, u32 flags)
+{
+	struct bpf_prog *new_prog = NULL, *old_prog = NULL;
+	int err;
+
+	ASSERT_RTNL();
+
+	if (fd >= 0) {
+		new_prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_DEQUEUE, false);
+		if (IS_ERR(new_prog))
+			return PTR_ERR(new_prog);
+	}
+
+	if (expected_fd >= 0) {
+		old_prog = bpf_prog_get_type_dev(expected_fd,
+						 BPF_PROG_TYPE_DEQUEUE,
+						 false);
+		if (IS_ERR(old_prog)) {
+			err = PTR_ERR(old_prog);
+			old_prog = NULL;
+			goto err_out;
+		}
+	}
+
+	err = dev_xdp_dequeue_attach(dev, extack, new_prog, old_prog, flags);
+
+err_out:
+	if (err && new_prog)
+		bpf_prog_put(new_prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+	return err;
+}
+
 /**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
diff --git a/net/core/dev.h b/net/core/dev.h
index cbb8a925175a..fe598287f786 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -81,6 +81,8 @@ void dev_change_proto_down_reason(struct net_device *dev, unsigned long mask,
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		      int fd, int expected_fd, u32 flags);
+int dev_change_xdp_dequeue_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+			      int fd, int expected_fd, u32 flags);
 
 int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len);
 void dev_set_group(struct net_device *dev, int new_group);
diff --git a/net/core/filter.c b/net/core/filter.c
index 6a4881739e9b..7c89eaa01c29 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11584,6 +11584,13 @@ void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
 	bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
 }
 
+DEFINE_BPF_DISPATCHER(xdp_dequeue)
+
+void bpf_prog_change_xdp_dequeue(struct bpf_prog *prev_prog, struct bpf_prog *prog)
+{
+	bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp_dequeue), prev_prog, prog);
+}
+
 BTF_ID_LIST_GLOBAL(btf_sock_ids, MAX_BTF_SOCK_TYPE)
 #define BTF_SOCK_TYPE(name, type) BTF_ID(struct, type)
 BTF_SOCK_TYPE_xxx
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ac45328607f7..495acb5a6616 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1012,7 +1012,8 @@ static size_t rtnl_xdp_size(void)
 	size_t xdp_size = nla_total_size(0) +	/* nest IFLA_XDP */
 			  nla_total_size(1) +	/* XDP_ATTACHED */
 			  nla_total_size(4) +	/* XDP_PROG_ID (or 1st mode) */
-			  nla_total_size(4);	/* XDP_<mode>_PROG_ID */
+			  nla_total_size(4) +	/* XDP_<mode>_PROG_ID */
+			  nla_total_size(4);	/* XDP_DEQUEUE_PROG_ID */
 
 	return xdp_size;
 }
@@ -1467,6 +1468,11 @@ static u32 rtnl_xdp_prog_hw(struct net_device *dev)
 	return dev_xdp_prog_id(dev, XDP_MODE_HW);
 }
 
+static u32 rtnl_xdp_dequeue_prog(struct net_device *dev)
+{
+	return dev_xdp_dequeue_prog_id(dev);
+}
+
 static int rtnl_xdp_report_one(struct sk_buff *skb, struct net_device *dev,
 			       u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr,
 			       u32 (*get_prog_id)(struct net_device *dev))
@@ -1527,6 +1533,13 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 			goto err_cancel;
 	}
 
+	prog_id = rtnl_xdp_dequeue_prog(dev);
+	if (prog_id) {
+		err = nla_put_u32(skb, IFLA_XDP_DEQUEUE_PROG_ID, prog_id);
+		if (err)
+			goto err_cancel;
+	}
+
 	nla_nest_end(skb, xdp);
 	return 0;
 
@@ -1979,6 +1992,7 @@ static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
 	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
 	[IFLA_XDP_FLAGS]	= { .type = NLA_U32 },
 	[IFLA_XDP_PROG_ID]	= { .type = NLA_U32 },
+	[IFLA_XDP_DEQUEUE_PROG_ID]	= { .type = NLA_U32 },
 };
 
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
@@ -2998,10 +3012,16 @@ static int do_setlink(const struct sk_buff *skb,
 					nla_get_s32(xdp[IFLA_XDP_EXPECTED_FD]);
 			}
 
-			err = dev_change_xdp_fd(dev, extack,
-						nla_get_s32(xdp[IFLA_XDP_FD]),
-						expected_fd,
-						xdp_flags);
+			if (xdp_flags & XDP_FLAGS_DEQUEUE_MODE)
+				err = dev_change_xdp_dequeue_fd(dev, extack,
+								nla_get_s32(xdp[IFLA_XDP_FD]),
+								expected_fd,
+								xdp_flags);
+			else
+				err = dev_change_xdp_fd(dev, extack,
+							nla_get_s32(xdp[IFLA_XDP_FD]),
+							expected_fd,
+							xdp_flags);
 			if (err)
 				goto errout;
 			status |= DO_SETLINK_NOTIFY;
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 0242f31e339c..f40ad0db46b7 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -1188,9 +1188,10 @@ enum {
 #define XDP_FLAGS_DRV_MODE		(1U << 2)
 #define XDP_FLAGS_HW_MODE		(1U << 3)
 #define XDP_FLAGS_REPLACE		(1U << 4)
+#define XDP_FLAGS_DEQUEUE_MODE		(1U << 5)
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
-					 XDP_FLAGS_HW_MODE)
+					 XDP_FLAGS_HW_MODE | XDP_FLAGS_DEQUEUE_MODE)
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
 					 XDP_FLAGS_MODES | XDP_FLAGS_REPLACE)
 
@@ -1213,6 +1214,7 @@ enum {
 	IFLA_XDP_SKB_PROG_ID,
 	IFLA_XDP_HW_PROG_ID,
 	IFLA_XDP_EXPECTED_FD,
+	IFLA_XDP_DEQUEUE_PROG_ID,
 	__IFLA_XDP_MAX,
 };
 

From patchwork Wed Jul 13 11:14:20 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916585
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3F70CC43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:18:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236084AbiGMLSi (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:18:38 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42750 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236003AbiGMLSb (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:18:31 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id D0A63101482
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657711108;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=yLRYZFEE5WheSOr0ZueH2oIEz+AIlR3j+gOBT1CAlzo=;
        b=LFQUcT2cetqSE77FnnU7LFhoypDLxqiMiqAK4o3LS4eezMMHW1+47ubA4xNDuuX0ZcQPyt
        WkN3MCxMqSnJC7WF7oL9hIViFGgiP8K6Lowmt5l6I8NhRL7i0LL++OjcOXKUpEpWXyhKVB
        b5QIMlC3/pr1ITzpmZVChXQ7A+pfzAg=
Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com
 [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-316-p4U2AZYUPGORMRTy5VqRAQ-1; Wed, 13 Jul 2022 07:18:28 -0400
X-MC-Unique: p4U2AZYUPGORMRTy5VqRAQ-1
Received: by mail-ej1-f71.google.com with SMTP id
 hq20-20020a1709073f1400b0072b9824f0a2so580223ejc.23
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:27 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=yLRYZFEE5WheSOr0ZueH2oIEz+AIlR3j+gOBT1CAlzo=;
        b=vmpCg51pIHGljCSxg0b6uV+YHkmJGTkN+VCDv5QJWNHmSNn+XAQZPG/EsQl9sUcixv
         Cr51njjk/caSCVZkXo/moVzr0xWibfdE63Hi443NyGXgXRHYAV36XD6P8reiBSzidsbs
         VYpNv1LP1U1tJo9dRlUCd+zCmDp1gUR7+p7XpLNWJzdgilkVcXqvPt5UCfYgLFjzBLmv
         0NJ4b1rY/sPVxfPRmCz9OHEmvA6Wbx6PAZe2dY7u/1kpBfQWyRt/mMxYqEbI26jy3d+R
         SSc8MHlBRoCVSAR15BTjLB7KMKAwWVWT1f+eKAEeLkVSbJdh2/JfjJI6ZtTiq+MG7xFu
         DWfQ==
X-Gm-Message-State: AJIora+CzhIo3kNjSMvw6sdTzirnbvBBCLXgXh1+66V1Gv/eqP9zr+Ho
        28pDodg9bW+YbzKTkIGZqjC9EmfSTHDT6CD7azpw/0fonxKbhP/4ZpnItzrVbUgL73WH8gXEiG5
        nSP/0Ecq6JxuJNFE1
X-Received: by 2002:a17:906:cc12:b0:72b:67bb:80c3 with SMTP id
 ml18-20020a170906cc1200b0072b67bb80c3mr2868835ejb.668.1657711105122;
        Wed, 13 Jul 2022 04:18:25 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1uBFxK6DbwgDOXIMNl+wGvmseBfFYC5zNP3nMDjTbia6fZ+NjHseC/3kcVLQORZRD33bQ8oSQ==
X-Received: by 2002:a17:906:cc12:b0:72b:67bb:80c3 with SMTP id
 ml18-20020a170906cc1200b0072b67bb80c3mr2868756ejb.668.1657711104006;
        Wed, 13 Jul 2022 04:18:24 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 kv21-20020a17090778d500b0070abf371274sm4814528ejc.136.2022.07.13.04.18.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:18:22 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 1C6074D9914; Wed, 13 Jul 2022 13:14:39 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Eric Dumazet <edumazet@google.com>,
 Paolo Abeni <pabeni@redhat.com>
Subject: [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX
 dequeue
Date: Wed, 13 Jul 2022 13:14:20 +0200
Message-Id: <20220713111430.134810-13-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

This adds a helper that a BPF program can call to schedule an interface for
transmission. The helper can be used from both a regular XDP program (to
schedule transmission after queueing a packet), and from a dequeue program
to (re-)schedule transmission after a dequeue operation. In particular, the
latter use can be combined with BPF timers to schedule delayed
transmission, for instance to implement traffic shaping.

The helper always schedules transmission on the interface on the current
CPU. For cross-CPU operation, it is up to the BPF program to arrange for
the helper to be called on the appropriate CPU, either by configuring
hardware RSS appropriately, or by using a cpumap. Likewise, it is up to the
BPF programs to decide whether to use separate queues per CPU (by using
multiple maps to queue packets in), or accept the lock contention of using
a single map across CPUs.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/uapi/linux/bpf.h       | 11 +++++++
 net/core/filter.c              | 52 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 11 +++++++
 3 files changed, 74 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d44382644391..b352ecc280f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5358,6 +5358,16 @@ union bpf_attr {
  *		*bpf_packet_dequeue()* (and checked to not be NULL).
  *	Return
  *		This always succeeds and returns zero.
+ *
+ * long bpf_schedule_iface_dequeue(void *ctx, int ifindex, int flags)
+ *	Description
+ *		Schedule the interface with index *ifindex* for transmission from
+ *		its dequeue program as soon as possible. The *flags* argument
+ *		must be zero.
+ *
+ *	Return
+ *		Returns zero on success, or -ENOENT if no dequeue program is
+ *		loaded on the interface.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5570,6 +5580,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(packet_dequeue),		\
 	FN(packet_drop),		\
+	FN(schedule_iface_dequeue),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c89eaa01c29..bb556d873b52 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4431,6 +4431,54 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+static int bpf_schedule_iface_dequeue(struct net *net, int ifindex, int flags)
+{
+	struct net_device *dev;
+	struct bpf_prog *prog;
+
+	if (flags)
+		return -EINVAL;
+
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	prog = rcu_dereference(dev->xdp_dequeue_prog);
+	if (!prog)
+		return -ENOENT;
+
+	dev_schedule_xdp_dequeue(dev);
+	return 0;
+}
+
+BPF_CALL_3(bpf_xdp_schedule_iface_dequeue, struct xdp_buff *, ctx, int, ifindex, int, flags)
+{
+	return bpf_schedule_iface_dequeue(dev_net(ctx->rxq->dev), ifindex, flags);
+}
+
+static const struct bpf_func_proto bpf_xdp_schedule_iface_dequeue_proto = {
+	.func           = bpf_xdp_schedule_iface_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_dequeue_schedule_iface_dequeue, struct dequeue_data *, ctx, int, ifindex, int, flags)
+{
+	return bpf_schedule_iface_dequeue(dev_net(ctx->txq->dev), ifindex, flags);
+}
+
+static const struct bpf_func_proto bpf_dequeue_schedule_iface_dequeue_proto = {
+	.func           = bpf_dequeue_schedule_iface_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 BTF_ID_LIST_SINGLE(xdp_md_btf_ids, struct, xdp_md)
 
 BPF_CALL_4(bpf_packet_dequeue, struct dequeue_data *, ctx, struct bpf_map *, map,
@@ -8068,6 +8116,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_fib_lookup_proto;
 	case BPF_FUNC_check_mtu:
 		return &bpf_xdp_check_mtu_proto;
+	case BPF_FUNC_schedule_iface_dequeue:
+		return &bpf_xdp_schedule_iface_dequeue_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
@@ -8105,6 +8155,8 @@ dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_packet_dequeue_proto;
 	case BPF_FUNC_packet_drop:
 		return &bpf_packet_drop_proto;
+	case BPF_FUNC_schedule_iface_dequeue:
+		return &bpf_dequeue_schedule_iface_dequeue_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1dab68a89e18..9eb9a5b52c76 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5358,6 +5358,16 @@ union bpf_attr {
  *		*bpf_packet_dequeue()* (and checked to not be NULL).
  *	Return
  *		This always succeeds and returns zero.
+ *
+ * long bpf_schedule_iface_dequeue(void *ctx, int ifindex, int flags)
+ *	Description
+ *		Schedule the interface with index *ifindex* for transmission from
+ *		its dequeue program as soon as possible. The *flags* argument
+ *		must be zero.
+ *
+ *	Return
+ *		Returns zero on success, or -ENOENT if no dequeue program is
+ *		loaded on the interface.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5570,6 +5580,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(packet_dequeue),		\
 	FN(packet_drop),		\
+	FN(schedule_iface_dequeue),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper

From patchwork Wed Jul 13 11:14:21 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916574
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8B5A3C433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235974AbiGMLPR (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38910 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236059AbiGMLO5 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:57 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 04C95100CE6
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710885;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Ef9GXgQuBeTFQaGFOOxbyhfbiGrCHyDokmB4U/+rt5g=;
        b=YikGjR1OYF+luf+qMT6dw0/qaVTR58go7BahLEEfDGldCeCngBm9WEDQpaYj+F7SVMj5CM
        /fEfzO1KQmw6IxZ9aDG4MOw53e04rYa2fzdnhp+T7XFOpz4OlBmzuWZbX8i1EaHnR62Cbi
        lEfFV0ktSEvk5hNgC/FVOAkN89g594w=
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-649-9M1ZEFw8PFOuOC8ZGI9GLA-1; Wed, 13 Jul 2022 07:14:44 -0400
X-MC-Unique: 9M1ZEFw8PFOuOC8ZGI9GLA-1
Received: by mail-ed1-f69.google.com with SMTP id
 z20-20020a05640240d400b0043a82d9d65fso8060626edb.0
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=Ef9GXgQuBeTFQaGFOOxbyhfbiGrCHyDokmB4U/+rt5g=;
        b=ndi6Mvn/HJyJ5UyQBj6kcnUfQoqDJQ6OiMkFXk2IsfWH4rSHFTEyXiSe0sewF8yFX9
         yvLrLRr8rKFPzZQSClL75yQTDIBncjrL8BjiY50sz1qCnsbQnFoUHTpybVGp5Bwa65V9
         OHWM9XQqOzgosECtfk3o3xqKGt00iU/HKfdAnJ4NnM8WfY8LkY2cljI6HJDmz2OPAgJK
         a0U+Qhn1o4aXlj0SlQOyuj9qBoKAKYMcChhGNiYs/XbiFbKWYyXNrWwGtTKrkohg3q8d
         m9jD403g6mCogot/GudiRzWVKktAC2WeyHlmZmaEltMnTHW5k22i59rDJdbAsri+ntLp
         SS4A==
X-Gm-Message-State: AJIora9+IvW+NbBfHJcodmNmwchE879pHhx8hVnjjn2FxHYnkaNd95Ws
        QC8HTCMvXEggeNPHT7pbM3ZXe/WRG6XuXecQCahwqylXxXaPGFjKmVimDjnNwF5pbEM3cSHQH/B
        I93x/vd+CEorz3hik
X-Received: by 2002:a17:907:3f04:b0:6e8:4b0e:438d with SMTP id
 hq4-20020a1709073f0400b006e84b0e438dmr2910868ejc.391.1657710883346;
        Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1u/+Z9t8vIbQv0pG9ukm9WXsOBRvLEg5K6MZXKEs01Vcfh3GIy0UlwIw32bqf1J7qA2+eRI/Q==
X-Received: by 2002:a17:907:3f04:b0:6e8:4b0e:438d with SMTP id
 hq4-20020a1709073f0400b006e84b0e438dmr2910845ejc.391.1657710883090;
        Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 k19-20020a05640212d300b0043a8f5ad272sm7781074edx.49.2022.07.13.04.14.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:42 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 603734D9916; Wed, 13 Jul 2022 13:14:39 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 13/17] libbpf: Add support for dequeue program type and
 PIFO map type
Date: Wed, 13 Jul 2022 13:14:21 +0200
Message-Id: <20220713111430.134810-14-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add support for a 'dequeue' section type to specify dequeue type programs
and add support for dequeue program and PIFO map to probing code.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/bpf/libbpf.c        | 1 +
 tools/lib/bpf/libbpf_probes.c | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index cb49408eb298..8553bb8369e0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8431,6 +8431,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("xdp/cpumap",		XDP, BPF_XDP_CPUMAP, SEC_ATTACHABLE),
 	SEC_DEF("xdp.frags",		XDP, BPF_XDP, SEC_XDP_FRAGS),
 	SEC_DEF("xdp",			XDP, BPF_XDP, SEC_ATTACHABLE_OPT),
+	SEC_DEF("dequeue",		DEQUEUE, 0, SEC_NONE),
 	SEC_DEF("perf_event",		PERF_EVENT, 0, SEC_NONE),
 	SEC_DEF("lwt_in",		LWT_IN, 0, SEC_NONE),
 	SEC_DEF("lwt_out",		LWT_OUT, 0, SEC_NONE),
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 0b5398786bf3..a9ead2d55264 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -97,6 +97,7 @@ static int probe_prog_load(enum bpf_prog_type prog_type,
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_DEQUEUE:
 		break;
 	default:
 		return -EOPNOTSUPP;
@@ -244,6 +245,10 @@ static int probe_map_create(enum bpf_map_type map_type)
 		key_size = 0;
 		max_entries = 1;
 		break;
+	case BPF_MAP_TYPE_PIFO_GENERIC:
+	case BPF_MAP_TYPE_PIFO_XDP:
+		opts.map_extra = 8;
+		break;
 	case BPF_MAP_TYPE_HASH:
 	case BPF_MAP_TYPE_ARRAY:
 	case BPF_MAP_TYPE_PROG_ARRAY:

From patchwork Wed Jul 13 11:14:22 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916579
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F37EDC433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236265AbiGMLP2 (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39352 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236219AbiGMLPN (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:15:13 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 081F31014B2
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710891;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=gCuW9cbcf8hMLILwpOQETuB1FTQ/X1nboKBZtIGXp00=;
        b=fXVbWyeWoKYkAp5t1ZvgaYwj6QbOyZ0htf8QR3LxAouiBfUMJum1u5SMivcX1nJW7AfFw9
        U4dWPZbDI0kEZtZvZseVnqZ+QOcaf9Hh0IPSR3jjYp77QUwlgQl6sRRN81afuB21rPuPTN
        UYSlZe1wc7VeeoovDejHST4COCOqxsE=
Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com
 [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-549-I8Xw42TzMRy49x9NVTXJUw-1; Wed, 13 Jul 2022 07:14:50 -0400
X-MC-Unique: I8Xw42TzMRy49x9NVTXJUw-1
Received: by mail-ej1-f72.google.com with SMTP id
 nc23-20020a1709071c1700b0072b94109144so762321ejc.2
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:50 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=gCuW9cbcf8hMLILwpOQETuB1FTQ/X1nboKBZtIGXp00=;
        b=Xph0HoOA1yTnenuBOF7z8B3fzyr6HE5ltcWc6xg8Ou7QFS82YDp7p9SVyC7EOhwsJY
         tLrYWRI4dyTe0xGFbkA/L+nTo/HeE89uPzx/T2bVoOc7xi0jvcefv6+CtnL4k5O2kNxc
         8dGN9DY89efjFYHQCc4QzDghNNloSsWOo3xGKray4Ofo+RBUp2x3joi7VPlziGrP28PO
         agc32ewZY1cLOE66l5aMdRSvUfMvCIQZaGEmS+OB+ckEG93oOUs8pp/FI8R7eot+augf
         CUvFSx7HstDM4U4ERlOIokiSUBNEU5v0viCaT5tt7xJ6RespjezQwBDXIyUkWyMunpIb
         dChg==
X-Gm-Message-State: AJIora9O7exUbKSVqqlykR3fa3CLCnwWEevoUyAMUxqu5Sh7RV6TJT97
        nyMZz2wU+1wPrelgqYGwA/I6ln+5Cj7WQH4ejy9uzNtCIxwguKpUV3XzCEDT7DQg1H9/8gRGjq5
        ScSOFP2dpfAuEULAh
X-Received: by 2002:aa7:c2d7:0:b0:43a:78af:6e57 with SMTP id
 m23-20020aa7c2d7000000b0043a78af6e57mr4117060edp.163.1657710885916;
        Wed, 13 Jul 2022 04:14:45 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1sxEwR3OSiEpZ8MGJOFm3lT1P38sF2u7mDclfnoS5W7ITmvEBxkx/lZGY0yG0BKt6kuLj2+OQ==
X-Received: by 2002:aa7:c2d7:0:b0:43a:78af:6e57 with SMTP id
 m23-20020aa7c2d7000000b0043a78af6e57mr4116953edp.163.1657710884805;
        Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 h5-20020a0564020e8500b0043a7404314csm7653673eda.8.2022.07.13.04.14.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id A7BF84D9919; Wed, 13 Jul 2022 13:14:39 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Andrii Nakryiko <andrii@kernel.org>,
        Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs
Date: Wed, 13 Jul 2022 13:14:22 +0200
Message-Id: <20220713111430.134810-15-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add support to libbpf for reading the dequeue program ID from netlink when
querying for installed XDP programs. No additional support is needed to
install dequeue programs, as they are just using a new mode flag for the
regular XDP program installation mechanism.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/bpf/libbpf.h  | 1 +
 tools/lib/bpf/netlink.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e4d5353f757b..b15ff90279cb 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -906,6 +906,7 @@ struct bpf_xdp_query_opts {
 	__u32 drv_prog_id;	/* output */
 	__u32 hw_prog_id;	/* output */
 	__u32 skb_prog_id;	/* output */
+	__u32 dequeue_prog_id;	/* output */
 	__u8 attach_mode;	/* output */
 	size_t :0;
 };
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 6c013168032d..64a9aceb9c9c 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -32,6 +32,7 @@ struct xdp_link_info {
 	__u32 drv_prog_id;
 	__u32 hw_prog_id;
 	__u32 skb_prog_id;
+	__u32 dequeue_prog_id;
 	__u8 attach_mode;
 };
 
@@ -354,6 +355,10 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb)
 		xdp_id->info.hw_prog_id = libbpf_nla_getattr_u32(
 			xdp_tb[IFLA_XDP_HW_PROG_ID]);
 
+	if (xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID])
+		xdp_id->info.dequeue_prog_id = libbpf_nla_getattr_u32(
+			xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID]);
+
 	return 0;
 }
 
@@ -391,6 +396,7 @@ int bpf_xdp_query(int ifindex, int xdp_flags, struct bpf_xdp_query_opts *opts)
 	OPTS_SET(opts, drv_prog_id, xdp_id.info.drv_prog_id);
 	OPTS_SET(opts, hw_prog_id, xdp_id.info.hw_prog_id);
 	OPTS_SET(opts, skb_prog_id, xdp_id.info.skb_prog_id);
+	OPTS_SET(opts, dequeue_prog_id, xdp_id.info.dequeue_prog_id);
 	OPTS_SET(opts, attach_mode, xdp_id.info.attach_mode);
 
 	return 0;
@@ -415,6 +421,8 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 		*prog_id = opts.hw_prog_id;
 	else if (flags & XDP_FLAGS_SKB_MODE)
 		*prog_id = opts.skb_prog_id;
+	else if (flags & XDP_FLAGS_DEQUEUE_MODE)
+		*prog_id = opts.dequeue_prog_id;
 	else
 		*prog_id = 0;
 

From patchwork Wed Jul 13 11:14:23 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916573
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 46DABC43334
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236083AbiGMLPP (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38956 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236060AbiGMLO5 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:14:57 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id B4EDB100CF5
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710886;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=a+aE4ElkiLVX6Bh75bMcr/iPsxW7Ryr6p7928iEMT6Q=;
        b=FtzANAOJdFUuR/tc0LJMW9YJCHUVhWQztAN2DWmo716BDMbR//y0Y3WIYzWLkr/W3vjFUT
        xbyXBIw3NuVRcXUZfEDaMcpDQs6tMfj2snMc096GE+xyF8VTLVn1IHVN9YkHFpAO+v/mYu
        pTGOKmDZoUT8A5RBmSJcyNKqxDf1Yww=
Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com
 [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-151-ZYHSxVi8NUOICqsQF9SsAg-1; Wed, 13 Jul 2022 07:14:45 -0400
X-MC-Unique: ZYHSxVi8NUOICqsQF9SsAg-1
Received: by mail-ej1-f70.google.com with SMTP id
 hq20-20020a1709073f1400b0072b9824f0a2so576536ejc.23
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:45 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=a+aE4ElkiLVX6Bh75bMcr/iPsxW7Ryr6p7928iEMT6Q=;
        b=WTEms5DhU4YYbOIC/tvLaJDOBM/y3ws1pzZW1BIBOJQz8tyZpf141H3fDpIwlSM+mw
         A6LJA5SfgDPYisOL+27ggtIKwdof7m6N44BU51U9pTIHlgKeakjRRGuaYNpVFRZ+6od6
         7sHT1xkCyBIpLidP9BwqmnFv2lj3JyUSoTSJZC8LX1JnA70BsXaGCleXiCxTL94LW5uG
         Ow3KmDk1CORcpUM/t1nasVsQ+cKNczAGkRcfOpSJPFSLiVY1ncEdo+741V62E1Yi4Lyh
         O0ZpJQ3GL9IMfWYcjG++aiTM6U4E0ZNVZWxoXUiRCa6r+6nxxLx+TEUmP8DqKOynIi+c
         I8Hg==
X-Gm-Message-State: AJIora+3hnFc/jIAYkMysK2wneWAypJMKeQA/lHYvlLb/H5dw7uc+y4D
        apkLHybMol3wI07VbQzSFGazH1GzA1pFfxjh1aBPoyaDAv2FWeYDx7fhaY2MwJK6ucRHwYEN7Zc
        RN9RKIcIZJON6u8xX
X-Received: by 2002:a17:907:2c54:b0:72b:64bd:cbf7 with SMTP id
 hf20-20020a1709072c5400b0072b64bdcbf7mr2931134ejc.116.1657710884106;
        Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1v/qsgYjx2KM8l53apx0996JUN9j69dzlnwkriCAi9iDR9M0YcdcMfUlfqbDuOTpWhT5b+BxA==
X-Received: by 2002:a17:907:2c54:b0:72b:64bd:cbf7 with SMTP id
 hf20-20020a1709072c5400b0072b64bdcbf7mr2931091ejc.116.1657710883803;
        Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 e23-20020a170906315700b00726c0e60940sm4878256eje.100.2022.07.13.04.14.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:43 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 1BB2D4D991C; Wed, 13 Jul 2022 13:14:40 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
        Mykola Lysenko <mykolal@fb.com>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Shuah Khan <shuah@kernel.org>
Subject: [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
Date: Wed, 13 Jul 2022 13:14:23 +0200
Message-Id: <20220713111430.134810-16-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Test various cases of direct packet access (proper range propagation,
comparison of packet pointers pointing into separate xdp_frames, and
correct invalidation on packet drop (so that multiple packet pointers
are usable safely in a dequeue program)).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
 .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
 2 files changed, 180 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index f9d553fbf68a..8d26ca96520b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -55,7 +55,7 @@
 #define MAX_UNEXPECTED_INSNS	32
 #define MAX_TEST_INSNS	1000000
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	23
+#define MAX_NR_MAPS	24
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -131,6 +131,7 @@ struct bpf_test {
 	int fixup_map_ringbuf[MAX_FIXUPS];
 	int fixup_map_timer[MAX_FIXUPS];
 	int fixup_map_kptr[MAX_FIXUPS];
+	int fixup_map_pifo[MAX_FIXUPS];
 	struct kfunc_btf_id_pair fixup_kfunc_btf_id[MAX_FIXUPS];
 	/* Expected verifier log output for result REJECT or VERBOSE_ACCEPT.
 	 * Can be a tab-separated sequence of expected strings. An empty string
@@ -145,6 +146,7 @@ struct bpf_test {
 		ACCEPT,
 		REJECT,
 		VERBOSE_ACCEPT,
+		VERBOSE_REJECT,
 	} result, result_unpriv;
 	enum bpf_prog_type prog_type;
 	uint8_t flags;
@@ -546,11 +548,12 @@ static bool skip_unsupported_map(enum bpf_map_type map_type)
 
 static int __create_map(uint32_t type, uint32_t size_key,
 			uint32_t size_value, uint32_t max_elem,
-			uint32_t extra_flags)
+			uint32_t extra_flags, uint64_t map_extra)
 {
 	LIBBPF_OPTS(bpf_map_create_opts, opts);
 	int fd;
 
+	opts.map_extra = map_extra;
 	opts.map_flags = (type == BPF_MAP_TYPE_HASH ? BPF_F_NO_PREALLOC : 0) | extra_flags;
 	fd = bpf_map_create(type, NULL, size_key, size_value, max_elem, &opts);
 	if (fd < 0) {
@@ -565,7 +568,7 @@ static int __create_map(uint32_t type, uint32_t size_key,
 static int create_map(uint32_t type, uint32_t size_key,
 		      uint32_t size_value, uint32_t max_elem)
 {
-	return __create_map(type, size_key, size_value, max_elem, 0);
+	return __create_map(type, size_key, size_value, max_elem, 0, 0);
 }
 
 static void update_map(int fd, int index)
@@ -904,6 +907,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_ringbuf = test->fixup_map_ringbuf;
 	int *fixup_map_timer = test->fixup_map_timer;
 	int *fixup_map_kptr = test->fixup_map_kptr;
+	int *fixup_map_pifo = test->fixup_map_pifo;
 	struct kfunc_btf_id_pair *fixup_kfunc_btf_id = test->fixup_kfunc_btf_id;
 
 	if (test->fill_helper) {
@@ -1033,7 +1037,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	if (*fixup_map_array_ro) {
 		map_fds[14] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
 					   sizeof(struct test_val), 1,
-					   BPF_F_RDONLY_PROG);
+					   BPF_F_RDONLY_PROG, 0);
 		update_map(map_fds[14], 0);
 		do {
 			prog[*fixup_map_array_ro].imm = map_fds[14];
@@ -1043,7 +1047,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	if (*fixup_map_array_wo) {
 		map_fds[15] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
 					   sizeof(struct test_val), 1,
-					   BPF_F_WRONLY_PROG);
+					   BPF_F_WRONLY_PROG, 0);
 		update_map(map_fds[15], 0);
 		do {
 			prog[*fixup_map_array_wo].imm = map_fds[15];
@@ -1052,7 +1056,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_array_small) {
 		map_fds[16] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
-					   1, 1, 0);
+					   1, 1, 0, 0);
 		update_map(map_fds[16], 0);
 		do {
 			prog[*fixup_map_array_small].imm = map_fds[16];
@@ -1068,7 +1072,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_event_output) {
 		map_fds[18] = __create_map(BPF_MAP_TYPE_PERF_EVENT_ARRAY,
-					   sizeof(int), sizeof(int), 1, 0);
+					   sizeof(int), sizeof(int), 1, 0, 0);
 		do {
 			prog[*fixup_map_event_output].imm = map_fds[18];
 			fixup_map_event_output++;
@@ -1076,7 +1080,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_reuseport_array) {
 		map_fds[19] = __create_map(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
-					   sizeof(u32), sizeof(u64), 1, 0);
+					   sizeof(u32), sizeof(u64), 1, 0, 0);
 		do {
 			prog[*fixup_map_reuseport_array].imm = map_fds[19];
 			fixup_map_reuseport_array++;
@@ -1104,6 +1108,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_kptr++;
 		} while (*fixup_map_kptr);
 	}
+	if (*fixup_map_pifo) {
+		map_fds[23] = __create_map(BPF_MAP_TYPE_PIFO_XDP, sizeof(u32), sizeof(u32), 1, 0, 8);
+		do {
+			prog[*fixup_map_pifo].imm = map_fds[23];
+			fixup_map_pifo++;
+		} while (*fixup_map_pifo);
+	}
 
 	/* Patch in kfunc BTF IDs */
 	if (fixup_kfunc_btf_id->kfunc) {
@@ -1490,7 +1501,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 		       test->errstr_unpriv : test->errstr;
 
 	opts.expected_attach_type = test->expected_attach_type;
-	if (verbose)
+	if (verbose || expected_ret == VERBOSE_REJECT)
 		opts.log_level = VERBOSE_LIBBPF_LOG_LEVEL;
 	else if (expected_ret == VERBOSE_ACCEPT)
 		opts.log_level = 2;
diff --git a/tools/testing/selftests/bpf/verifier/dequeue.c b/tools/testing/selftests/bpf/verifier/dequeue.c
new file mode 100644
index 000000000000..730f14395bcc
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/dequeue.c
@@ -0,0 +1,160 @@
+{
+	"dequeue: non-xdp_md retval",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "At program exit the register R0 must be NULL or referenced ptr_xdp_md",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: NULL retval",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.runs = -1,
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = ACCEPT,
+},
+{
+	"dequeue: cannot access except data, data_end, data_meta",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data_end)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data_meta)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, ingress_ifindex)),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "no read support for xdp_md at off 12",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: pkt_uid preserved when resetting range on rX += var",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct dequeue_ctx, egress_ifindex)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = VERBOSE_REJECT,
+	.errstr = "13: (0f) r0 += r1                     ; R0_w=pkt(id=3,off=0,r=0,pkt_uid=2",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: dpa bad comparison",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_8, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_7, offsetof(struct xdp_md, data_end)),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_JMP_REG(BPF_JGE, BPF_REG_0, BPF_REG_1, 1),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_drop),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "R0, R1 pkt pointer comparison prohibited",
+	.fixup_map_pifo = { 1, 14 },
+},
+{
+	"dequeue: dpa scoped range propagation",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_8, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_8, offsetof(struct xdp_md, data_end)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_7, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_7, offsetof(struct xdp_md, data_end)),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_JMP_REG(BPF_JGE, BPF_REG_0, BPF_REG_1, 1),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_drop),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)",
+	.fixup_map_pifo = { 1, 14 },
+},

From patchwork Wed Jul 13 11:14:24 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916577
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8EB1DC433EF
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:15:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236197AbiGMLPX (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:15:23 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38956 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236192AbiGMLPM (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:15:12 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 10D371014A1
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657710889;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=JTualzM4oL2vgENjqe1wfSZOmJQzk4aA4xXaa8p3Kw8=;
        b=DVMhJ6bNITrKchWvmFOEl2vAsvDPkJgtrHgpyAfPRGN1UltaMeAvOuomJSpUPF8pCn4X7h
        9pnGo5/wmGwlZQLqhUC15n50yyy0AWGF+pdwjxADQBI/kn5MWBuX1GhjQSR7TJf0Di/E5R
        DZZnz8sypanvQ6Y/sNnRAFDyT992q4c=
Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com
 [209.85.208.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-513-iuVBm8WlPZug1oUhOWhI-g-1; Wed, 13 Jul 2022 07:14:48 -0400
X-MC-Unique: iuVBm8WlPZug1oUhOWhI-g-1
Received: by mail-ed1-f72.google.com with SMTP id
 t5-20020a056402524500b0043a923324b2so8239103edd.22
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:14:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=JTualzM4oL2vgENjqe1wfSZOmJQzk4aA4xXaa8p3Kw8=;
        b=qwyt65WvHQGMzPyp/G/EROF+/DrnUEFJsFxvEAB1nlS+KitIR6Mx+8QE4FTS1ivev0
         q1IYnEdzdBUhYfFHUdDbB9sRxedaApL+kJ/mjI6htkUUeBQWDtqIOTPZW/KLTWBe5rTy
         f+0IEvVZcvK8mLGKFTBdiTG1+CsBPGbJhCKwJbfSEB/N2iu4Pw1xz7T3zkCD58lmencl
         TGzmFKHZOJMEdbKjmmWBG/bqsoRAMyEHTNzwwuUlPs92cOBtBPxoosCp6YpaaUQvboJI
         qsKv8DMxxSdZhrZq24IXEnQEn09nJoSHhDI+jsj2unTeq3U07nbUspqzE+jBGJOOy6zK
         O7BA==
X-Gm-Message-State: AJIora8oZ1mYUrF7+MzpibmsBD/olJ14gFf2clBmRfNVVSNgTopP14NC
        iPlmTeZd4pgbVjeRHnyYoWAhPXJYSjDFVsix4KbMK8lsFa/8nxtMohtfMyQPGcI3H4cqSOV+alI
        EWIVuOk/njTwyfiEu
X-Received: by 2002:a05:6402:782:b0:43a:7387:39df with SMTP id
 d2-20020a056402078200b0043a738739dfmr4211345edy.251.1657710886100;
        Wed, 13 Jul 2022 04:14:46 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1vXnNWI44L6kygD3dzfFuwC7JWimjzIW9xbUM6ULQgOQCzHJd4S0QCisMYM3N+I+KwcoXga2Q==
X-Received: by 2002:a05:6402:782:b0:43a:7387:39df with SMTP id
 d2-20020a056402078200b0043a738739dfmr4211225edy.251.1657710885100;
        Wed, 13 Jul 2022 04:14:45 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id
 fs13-20020a170907600d00b0072b2f95d5d1sm4938507ejc.170.2022.07.13.04.14.41
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:14:44 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id B1C0D4D991E; Wed, 13 Jul 2022 13:14:40 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        John Fastabend <john.fastabend@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>, Andrii Nakryiko <andrii@kernel.org>,
 Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>,
 Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
 Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
 Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com>,
 Shuah Khan <shuah@kernel.org>
Subject: [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through
 PIFO maps
Date: Wed, 13 Jul 2022 13:14:24 +0200
Message-Id: <20220713111430.134810-17-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

This adds selftests for both variants of the generic PIFO map type, and for
the dequeue program type. The XDP test uses bpf_prog_run() to run an XDP
program that puts packets into a PIFO map, and then adds tests that pull
them back out again through bpf_prog_run() of a dequeue program, as well as
by attaching a dequeue program to a veth device and scheduling transmission
there.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++++++++++++
 .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++++++
 .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++++++++++++
 4 files changed, 443 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
 create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c

diff --git a/tools/testing/selftests/bpf/prog_tests/pifo_map.c b/tools/testing/selftests/bpf/prog_tests/pifo_map.c
new file mode 100644
index 000000000000..ae23bcc0683f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/pifo_map.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "pifo_map.skel.h"
+
+static int run_prog(int prog_fd, __u32 exp_retval)
+{
+	struct xdp_md ctx_in = {};
+	char data[10] = {};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_in = data,
+			    .data_size_in = sizeof(data),
+			    .ctx_in = &ctx_in,
+			    .ctx_size_in = sizeof(ctx_in),
+			    .repeat = 1,
+		);
+	int err;
+
+	ctx_in.data_end = sizeof(data);
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	if (!ASSERT_OK(err, "bpf_prog_test_run(valid)"))
+		return -1;
+	if (!ASSERT_EQ(opts.retval, exp_retval, "prog retval"))
+		return -1;
+
+	return 0;
+}
+
+static void check_map_counts(int map_fd, int start, int interval, int num, int exp_val)
+{
+	__u32 val, key, next_key, *kptr = NULL;
+	int i, err;
+
+	for (i = 0; i < num; i++) {
+		err = bpf_map_get_next_key(map_fd, kptr, &next_key);
+		if (!ASSERT_OK(err, "bpf_map_get_next_key()"))
+			return;
+
+		key = next_key;
+		kptr = &key;
+
+		if (!ASSERT_EQ(key, start + i * interval, "expected key"))
+			break;
+		err = bpf_map_lookup_elem(map_fd, &key, &val);
+		if (!ASSERT_OK(err, "bpf_map_lookup_elem()"))
+			break;
+		if (!ASSERT_EQ(val, exp_val, "map value"))
+			break;
+	}
+}
+
+static void run_enqueue_fail(struct pifo_map *skel, int start, int interval, __u32 exp_retval)
+{
+	int enqueue_fd;
+
+	skel->bss->start = start;
+	skel->data->interval = interval;
+
+	enqueue_fd = bpf_program__fd(skel->progs.pifo_enqueue);
+
+	if (run_prog(enqueue_fd, exp_retval))
+		return;
+}
+
+static void run_test(struct pifo_map *skel, int start, int interval)
+{
+	int enqueue_fd, dequeue_fd;
+
+	skel->bss->start = start;
+	skel->data->interval = interval;
+
+	enqueue_fd = bpf_program__fd(skel->progs.pifo_enqueue);
+	dequeue_fd = bpf_program__fd(skel->progs.pifo_dequeue);
+
+	if (run_prog(enqueue_fd, 0))
+		return;
+	check_map_counts(bpf_map__fd(skel->maps.pifo_map),
+			 skel->bss->start, skel->data->interval,
+			 skel->rodata->num_entries, 1);
+	run_prog(dequeue_fd, 0);
+}
+
+void test_pifo_map(void)
+{
+	struct pifo_map *skel = NULL;
+	int err;
+
+	skel = pifo_map__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	run_test(skel, 0, 1);
+	run_test(skel, 0, 10);
+	run_test(skel, 0, 100);
+
+	/* do a series of runs that keep advancing the priority, to check that
+	 * we can keep rorating the two internal maps
+	 */
+	run_test(skel, 0, 125);
+	run_test(skel, 1250, 1);
+	run_test(skel, 1250, 125);
+
+	/* after rotating, starting enqueue at prio 0 will now fail */
+	run_enqueue_fail(skel, 0, 1, -ERANGE);
+
+	run_test(skel, 2500, 125);
+	run_test(skel, 3750, 125);
+	run_test(skel, 5000, 125);
+
+	pifo_map__destroy(skel);
+
+	/* reopen but change rodata */
+	skel = pifo_map__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		return;
+
+	skel->rodata->num_entries = 12;
+	err = pifo_map__load(skel);
+	if (!ASSERT_OK(err, "load skel"))
+		goto out;
+
+	/* fails because the map is too small */
+	run_enqueue_fail(skel, 0, 1, -EOVERFLOW);
+out:
+	pifo_map__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
new file mode 100644
index 000000000000..bac029731eee
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <net/if.h>
+#include <linux/if_link.h>
+
+#include "test_xdp_pifo.skel.h"
+
+#define SYS(fmt, ...)						\
+	({							\
+		char cmd[1024];					\
+		snprintf(cmd, sizeof(cmd), fmt, ##__VA_ARGS__);	\
+		if (!ASSERT_OK(system(cmd), cmd))		\
+			goto out;				\
+	})
+
+static void run_xdp_prog(int prog_fd, void *data, size_t data_size, int repeat)
+{
+	struct xdp_md ctx_in = {};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_in = data,
+			    .data_size_in = data_size,
+			    .ctx_in = &ctx_in,
+			    .ctx_size_in = sizeof(ctx_in),
+			    .repeat = repeat,
+			    .flags = BPF_F_TEST_XDP_LIVE_FRAMES,
+		);
+	int err;
+
+	ctx_in.data_end = ctx_in.data + sizeof(pkt_v4);
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	ASSERT_OK(err, "bpf_prog_test_run(valid)");
+}
+
+static void run_dequeue_prog(int prog_fd, int exp_proto)
+{
+	struct ipv4_packet data_out;
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_out = &data_out,
+			    .data_size_out = sizeof(data_out),
+			    .repeat = 1,
+		);
+	int err;
+
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	ASSERT_OK(err, "bpf_prog_test_run(valid)");
+	ASSERT_EQ(opts.retval, exp_proto == -1 ? 0 : 1, "valid-retval");
+	if (exp_proto >= 0) {
+		ASSERT_EQ(opts.data_size_out, sizeof(pkt_v4), "valid-datasize");
+		ASSERT_EQ(data_out.eth.h_proto, exp_proto, "valid-pkt");
+	} else {
+		ASSERT_EQ(opts.data_size_out, 0, "no-pkt-returned");
+	}
+}
+
+void test_xdp_pifo(void)
+{
+	int xdp_prog_fd, dequeue_prog_fd, i;
+	struct test_xdp_pifo *skel = NULL;
+	struct ipv4_packet data;
+
+	skel = test_xdp_pifo__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo);
+	dequeue_prog_fd = bpf_program__fd(skel->progs.dequeue_pifo);
+	data = pkt_v4;
+
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 3);
+
+	/* kernel program queues packets with prio 2, 1, 0 (in that order), we
+	 * should get back 0 and 1, and 2 should get dropped on dequeue
+	 */
+	run_dequeue_prog(dequeue_prog_fd, 0);
+	run_dequeue_prog(dequeue_prog_fd, 1);
+	run_dequeue_prog(dequeue_prog_fd, -1);
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo_inc);
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 1024);
+
+	skel->bss->pkt_count = 0;
+	skel->data->prio = 0;
+	skel->data->drop_above = 1024;
+	for (i = 0; i < 1024; i++)
+		run_dequeue_prog(dequeue_prog_fd, i*10);
+
+	test_xdp_pifo__destroy(skel);
+}
+
+void test_xdp_pifo_live(void)
+{
+	struct test_xdp_pifo *skel = NULL;
+	int err, ifindex_src, ifindex_dst;
+	int xdp_prog_fd, dequeue_prog_fd;
+	struct nstoken *nstoken = NULL;
+	struct ipv4_packet data;
+	struct bpf_link *link;
+	__u32 xdp_flags = XDP_FLAGS_DEQUEUE_MODE;
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts,
+		    .old_prog_fd = -1);
+
+	skel = test_xdp_pifo__open();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	SYS("ip netns add testns");
+	nstoken = open_netns("testns");
+	if (!ASSERT_OK_PTR(nstoken, "setns"))
+		goto out;
+
+	SYS("ip link add veth_src type veth peer name veth_dst");
+	SYS("ip link set dev veth_src up");
+	SYS("ip link set dev veth_dst up");
+
+	ifindex_src = if_nametoindex("veth_src");
+	ifindex_dst = if_nametoindex("veth_dst");
+	if (!ASSERT_NEQ(ifindex_src, 0, "ifindex_src") ||
+	    !ASSERT_NEQ(ifindex_dst, 0, "ifindex_dst"))
+		goto out;
+
+	skel->bss->tgt_ifindex = ifindex_src;
+	skel->data->drop_above = 3;
+
+	err = test_xdp_pifo__load(skel);
+	ASSERT_OK(err, "load skel");
+
+	link = bpf_program__attach_xdp(skel->progs.xdp_check_pkt, ifindex_dst);
+	if (!ASSERT_OK_PTR(link, "prog_attach"))
+		goto out;
+	skel->links.xdp_check_pkt = link;
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo);
+	dequeue_prog_fd = bpf_program__fd(skel->progs.dequeue_pifo);
+	data = pkt_v4;
+
+	err = bpf_xdp_attach(ifindex_src, dequeue_prog_fd, xdp_flags, &opts);
+	if (!ASSERT_OK(err, "attach-dequeue"))
+		goto out;
+
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 3);
+
+	/* wait for the packets to be flushed */
+	kern_sync_rcu();
+
+	ASSERT_EQ(skel->bss->seen_good_pkts, 3, "live packets OK");
+
+	opts.old_prog_fd = dequeue_prog_fd;
+	err = bpf_xdp_attach(ifindex_src, -1, xdp_flags, &opts);
+	ASSERT_OK(err, "dequeue-detach");
+
+out:
+	test_xdp_pifo__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/pifo_map.c b/tools/testing/selftests/bpf/progs/pifo_map.c
new file mode 100644
index 000000000000..b27bc2d0de03
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/pifo_map.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PIFO_GENERIC);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 10);
+	__uint(map_extra, 1024); /* range */
+} pifo_map SEC(".maps");
+
+const volatile int num_entries = 10;
+volatile int interval = 10;
+volatile int start = 0;
+
+SEC("xdp")
+int pifo_dequeue(struct xdp_md *xdp)
+{
+	__u32 val, exp;
+	int i, ret;
+
+	for (i = 0; i < num_entries; i++) {
+		exp = start + i * interval;
+		ret = bpf_map_pop_elem(&pifo_map, &val);
+		if (ret)
+			return ret;
+		if (val != exp)
+			return 1;
+	}
+
+	return 0;
+}
+
+SEC("xdp")
+int pifo_enqueue(struct xdp_md *xdp)
+{
+	__u64 flags;
+	__u32 val;
+	int i, ret;
+
+	for (i = num_entries - 1; i >= 0; i--) {
+		val = start + i * interval;
+		flags = val;
+		ret = bpf_map_push_elem(&pifo_map, &val, flags);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_pifo.c b/tools/testing/selftests/bpf/progs/test_xdp_pifo.c
new file mode 100644
index 000000000000..702611e0cd1a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_pifo.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PIFO_XDP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1024);
+	__uint(map_extra, 8192); /* range */
+} pifo_map SEC(".maps");
+
+__u16 prio = 3;
+int tgt_ifindex = 0;
+
+SEC("xdp")
+int xdp_pifo(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	/* We write the priority into the ethernet proto field so userspace can
+	 * pick it back out and confirm that it's correct
+	 */
+	eth->h_proto = --prio;
+	ret = bpf_redirect_map(&pifo_map, prio, 0);
+	if (tgt_ifindex && ret == XDP_REDIRECT)
+		bpf_schedule_iface_dequeue(xdp, tgt_ifindex, 0);
+	return ret;
+}
+
+__u16 check_prio = 0;
+__u16 seen_good_pkts = 0;
+
+SEC("xdp")
+int xdp_check_pkt(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	if (eth->h_proto == check_prio) {
+		check_prio++;
+		seen_good_pkts++;
+		return XDP_DROP;
+	}
+
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_pifo_inc(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	/* We write the priority into the ethernet proto field so userspace can
+	 * pick it back out and confirm that it's correct
+	 */
+	eth->h_proto = prio;
+	ret = bpf_redirect_map(&pifo_map, prio, 0);
+	prio += 10;
+	return ret;
+}
+
+__u16 pkt_count = 0;
+__u16 drop_above = 2;
+
+SEC("dequeue")
+void *dequeue_pifo(struct dequeue_ctx *ctx)
+{
+	__u64 prio = 0, pkt_prio = 0;
+	void *data, *data_end;
+	struct xdp_md *pkt;
+	struct ethhdr *eth;
+
+	pkt = (void *)bpf_packet_dequeue(ctx, &pifo_map, 0, &prio);
+	if (!pkt)
+		return NULL;
+
+	data = (void *)(long)pkt->data;
+	data_end = (void *)(long)pkt->data_end;
+	eth = data;
+
+	if (eth + 1 <= data_end)
+		pkt_prio = eth->h_proto;
+
+	if (pkt_prio != prio || ++pkt_count > drop_above) {
+		bpf_packet_drop(ctx, pkt);
+		return NULL;
+	}
+
+	return pkt;
+}
+
+char _license[] SEC("license") = "GPL";

From patchwork Wed Jul 13 11:14:25 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
 <toke@redhat.com>
X-Patchwork-Id: 12916583
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 222CDCCA485
	for <netdev@archiver.kernel.org>; Wed, 13 Jul 2022 11:18:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236196AbiGMLSd (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 13 Jul 2022 07:18:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42730 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236060AbiGMLS3 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 13 Jul 2022 07:18:29 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9180BF5116
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1657711106;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=BDuTtgPYXERcYe6YSrw64YmlYChWKjHDpeR4it4TBh0=;
        b=cfYIunHcx/lRCCUjPuBoQ1mDXnmTKFwRtT9S9lV+fUiAUfNr5N3d6XxyiR7wIFNN7vfYI1
        RpDwcIu6kG95FjdOLkXUy2ftJxMKo0MQq4l+YjhO+yWuujxKONc3HIt22KLkXo/tzpDxy4
        pAKmCJxANIzmY5qYwGeZA/3oC3PS8pE=
Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com
 [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-639-cGCtARgkM3SO2IOeOwHF1g-1; Wed, 13 Jul 2022 07:18:25 -0400
X-MC-Unique: cGCtARgkM3SO2IOeOwHF1g-1
Received: by mail-ej1-f71.google.com with SMTP id
 nc23-20020a1709071c1700b0072b94109144so765856ejc.2
        for <netdev@vger.kernel.org>; Wed, 13 Jul 2022 04:18:25 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=BDuTtgPYXERcYe6YSrw64YmlYChWKjHDpeR4it4TBh0=;
        b=OXqplW+rZubiiWcLlutHKe/CMJmPhEtpO4mIGgTsiOeDLct0Vnf2mbFWwfoqu7g2GM
         kFtvcSR8mHd/8fkJKJfMerw93hVmvQEgjjxh2uQmevLy1TEpLNMnNONjVRAbx/j6vvlK
         AREpk0q8A1iScBNziFC1kp09fQbv8QKqCGsnnQn69Pi8HwKqRmQuMOaVK2cyLHVmnV0Z
         1zKDhsjEdbnWJanWLw0ZtSNdsL1zFFSIA9YPFhGva4VfHQTXuMbCk/O/QakpQp2eCZ3t
         LZ5Xwav+9MVLdYRFiA3sgwwR2gETfIgPi3QGsMt5OxTrTp3HkGOFZLzd1CM5cY36J6d0
         IbGg==
X-Gm-Message-State: AJIora+Rcem/om8F2ao0CNpQec+FfU4xreovOFY5Dchwtjww2ncq/n/X
        VZT0lm33azbaCFTGGAa7d800LnWTrfZoKSVoQZ8bsZpNwSteI6aQlVRMS7+liruQnEnoQ5tNNeG
        lZN3x/lVxUTU+PtJy
X-Received: by 2002:a17:907:7b92:b0:72b:67fb:8985 with SMTP id
 ne18-20020a1709077b9200b0072b67fb8985mr2760321ejc.569.1657711104138;
        Wed, 13 Jul 2022 04:18:24 -0700 (PDT)
X-Google-Smtp-Source: 
 AGRyM1sXNO5rg/tDHkU+OLM52NxMejHM8BDcFzfxdTh7py1+Z9YBRZeZ/+C2bPXp9t6qVrBQAFx22g==
X-Received: by 2002:a17:907:7b92:b0:72b:67fb:8985 with SMTP id
 ne18-20020a1709077b9200b0072b67fb8985mr2760280ejc.569.1657711103712;
        Wed, 13 Jul 2022 04:18:23 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2])
        by smtp.gmail.com with ESMTPSA id
 b4-20020a17090630c400b006fe0abb00f0sm4839488ejb.209.2022.07.13.04.18.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Jul 2022 04:18:22 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id 210254D9920; Wed, 13 Jul 2022 13:14:41 +0200 (CEST)
From: =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Jesper Dangaard Brouer <hawk@kernel.org>,
        John Fastabend <john.fastabend@gmail.com>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, netdev@vger.kernel.org,
 bpf@vger.kernel.org, Freysteinn Alfredsson <freysteinn.alfredsson@kau.se>,
 Cong Wang <xiyou.wangcong@gmail.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rg?=
	=?utf-8?q?ensen?= <toke@redhat.com>
Subject: [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample
Date: Wed, 13 Jul 2022 13:14:25 +0200
Message-Id: <20220713111430.134810-18-toke@redhat.com>
X-Mailer: git-send-email 2.37.0
In-Reply-To: <20220713111430.134810-1-toke@redhat.com>
References: <20220713111430.134810-1-toke@redhat.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add support for queueing packets before forwarding them to the xdp_fwd
sample. This is meant to serve as an example (for the RFC series) of how
one could add queueing to a forwarding application. It doesn't actually
implement any fancy queueing algorithms, it just uses the queue maps to do
simple FIFO queueing, instantiating one queue map per interface.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 samples/bpf/xdp_fwd_kern.c |  65 +++++++++++-
 samples/bpf/xdp_fwd_user.c | 200 +++++++++++++++++++++++++++----------
 2 files changed, 205 insertions(+), 60 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 54c099cbd639..125adb02c658 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -23,6 +23,14 @@
 
 #define IPV6_FLOWINFO_MASK              cpu_to_be32(0x0FFFFFFF)
 
+struct pifo_map {
+	__uint(type, BPF_MAP_TYPE_PIFO_XDP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1024);
+	__uint(map_extra, 8192); /* range */
+} pmap SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_DEVMAP);
 	__uint(key_size, sizeof(int));
@@ -30,6 +38,13 @@ struct {
 	__uint(max_entries, 64);
 } xdp_tx_ports SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(key_size, sizeof(__u32));
+	__uint(max_entries, 64);
+	__array(values, struct pifo_map);
+} pifo_maps SEC(".maps");
+
 /* from include/net/ip.h */
 static __always_inline int ip_decrease_ttl(struct iphdr *iph)
 {
@@ -40,7 +55,7 @@ static __always_inline int ip_decrease_ttl(struct iphdr *iph)
 	return --iph->ttl;
 }
 
-static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
+static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags, bool queue)
 {
 	void *data_end = (void *)(long)ctx->data_end;
 	void *data = (void *)(long)ctx->data;
@@ -137,22 +152,62 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 
 		memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
 		memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
+
+		if (queue) {
+			void *ptr;
+			int ret;
+
+			ptr = bpf_map_lookup_elem(&pifo_maps, &fib_params.ifindex);
+			if (!ptr)
+				return XDP_DROP;
+
+			ret = bpf_redirect_map(ptr, 0, 0);
+			if (ret == XDP_REDIRECT)
+				bpf_schedule_iface_dequeue(ctx, fib_params.ifindex, 0);
+			return ret;
+		}
+
 		return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0);
 	}
 
 	return XDP_PASS;
 }
 
-SEC("xdp_fwd")
+SEC("xdp")
 int xdp_fwd_prog(struct xdp_md *ctx)
 {
-	return xdp_fwd_flags(ctx, 0);
+	return xdp_fwd_flags(ctx, 0, false);
 }
 
-SEC("xdp_fwd_direct")
+SEC("xdp")
 int xdp_fwd_direct_prog(struct xdp_md *ctx)
 {
-	return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT);
+	return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT, false);
+}
+
+SEC("xdp")
+int xdp_fwd_queue(struct xdp_md *ctx)
+{
+	return xdp_fwd_flags(ctx, 0, true);
+}
+
+SEC("dequeue")
+void *xdp_dequeue(struct dequeue_ctx *ctx)
+{
+	__u32 ifindex = ctx->egress_ifindex;
+	struct xdp_md *pkt;
+	__u64 prio = 0;
+	void *pifo_ptr;
+
+	pifo_ptr = bpf_map_lookup_elem(&pifo_maps, &ifindex);
+	if (!pifo_ptr)
+		return NULL;
+
+	pkt = (void *)bpf_packet_dequeue(ctx, pifo_ptr, 0, &prio);
+	if (!pkt)
+		return NULL;
+
+	return pkt;
 }
 
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index 84f57f1209ce..ec3f29d0babe 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 
+#include "linux/if_link.h"
 #include <linux/bpf.h>
 #include <linux/if_link.h>
 #include <linux/limits.h>
@@ -29,66 +30,122 @@
 
 static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
 
-static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+const char *redir_prog_names[] = {
+	"xdp_fwd_prog",
+	"xdp_fwd_direct_", /* name truncated to BPF_OBJ_NAME_LEN */
+	"xdp_fwd_queue",
+};
+
+const char *dequeue_prog_names[] = {
+	"xdp_dequeue"
+};
+
+static int do_attach(int idx, int redir_prog_fd, int dequeue_prog_fd,
+		     int redir_map_fd, int pifos_map_fd, const char *name)
 {
 	int err;
 
-	err = bpf_xdp_attach(idx, prog_fd, xdp_flags, NULL);
+	if (pifos_map_fd > -1) {
+		LIBBPF_OPTS(bpf_map_create_opts, map_opts, .map_extra = 8192);
+		char map_name[BPF_OBJ_NAME_LEN];
+		int pifo_fd;
+
+		snprintf(map_name, sizeof(map_name), "pifo_%d", idx);
+		map_name[BPF_OBJ_NAME_LEN - 1] = '\0';
+
+		pifo_fd = bpf_map_create(BPF_MAP_TYPE_PIFO_XDP, map_name,
+					 sizeof(__u32), sizeof(__u32), 10240, &map_opts);
+		if (pifo_fd < 0) {
+			err = -errno;
+			printf("ERROR: Couldn't create PIFO map: %s\n", strerror(-err));
+			return err;
+		}
+
+		err = bpf_map_update_elem(pifos_map_fd, &idx, &pifo_fd, 0);
+		if (err)
+			printf("ERROR: failed adding PIFO map for device %s\n", name);
+	}
+
+	if (dequeue_prog_fd > -1) {
+		LIBBPF_OPTS(bpf_xdp_attach_opts, prog_opts, .old_prog_fd = -1);
+
+		err = bpf_xdp_attach(idx, dequeue_prog_fd,
+				     (XDP_FLAGS_DEQUEUE_MODE | XDP_FLAGS_REPLACE),
+				     &prog_opts);
+		if (err < 0) {
+			printf("ERROR: failed to attach dequeue program to %s\n", name);
+			return err;
+		}
+	}
+
+	err = bpf_xdp_attach(idx, redir_prog_fd, xdp_flags, NULL);
 	if (err < 0) {
-		printf("ERROR: failed to attach program to %s\n", name);
+		printf("ERROR: failed to attach redir program to %s\n", name);
 		return err;
 	}
 
 	/* Adding ifindex as a possible egress TX port */
-	err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
+	err = bpf_map_update_elem(redir_map_fd, &idx, &idx, 0);
 	if (err)
 		printf("ERROR: failed using device %s as TX-port\n", name);
 
 	return err;
 }
 
+static bool should_detach(__u32 prog_fd, const char **prog_names, int num_prog_names)
+{
+	struct bpf_prog_info prog_info = {};
+	__u32 info_len = sizeof(prog_info);
+	int err, i;
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &info_len);
+	if (err) {
+		printf("ERROR: bpf_obj_get_info_by_fd failed (%s)\n",
+		       strerror(errno));
+		return false;
+	}
+
+	for (i = 0; i < num_prog_names; i++)
+		if (!strcmp(prog_info.name, prog_names[i]))
+			return true;
+
+	return false;
+}
+
 static int do_detach(int ifindex, const char *ifname, const char *app_name)
 {
 	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
-	struct bpf_prog_info prog_info = {};
-	char prog_name[BPF_OBJ_NAME_LEN];
-	__u32 info_len, curr_prog_id;
-	int prog_fd;
-	int err = 1;
+	LIBBPF_OPTS(bpf_xdp_query_opts, query_opts);
+	int prog_fd, err = 1;
+	__u32 curr_prog_id;
 
-	if (bpf_xdp_query_id(ifindex, xdp_flags, &curr_prog_id)) {
+	if (bpf_xdp_query(ifindex, xdp_flags, &query_opts)) {
 		printf("ERROR: bpf_xdp_query_id failed (%s)\n",
 		       strerror(errno));
 		return err;
 	}
 
+	curr_prog_id = (xdp_flags & XDP_FLAGS_SKB_MODE) ? query_opts.skb_prog_id
+								: query_opts.drv_prog_id;
 	if (!curr_prog_id) {
 		printf("ERROR: flags(0x%x) xdp prog is not attached to %s\n",
 		       xdp_flags, ifname);
 		return err;
 	}
 
-	info_len = sizeof(prog_info);
 	prog_fd = bpf_prog_get_fd_by_id(curr_prog_id);
 	if (prog_fd < 0) {
 		printf("ERROR: bpf_prog_get_fd_by_id failed (%s)\n",
 		       strerror(errno));
-		return prog_fd;
-	}
-
-	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &info_len);
-	if (err) {
-		printf("ERROR: bpf_obj_get_info_by_fd failed (%s)\n",
-		       strerror(errno));
-		goto close_out;
+		return err;
 	}
-	snprintf(prog_name, sizeof(prog_name), "%s_prog", app_name);
-	prog_name[BPF_OBJ_NAME_LEN - 1] = '\0';
 
-	if (strcmp(prog_info.name, prog_name)) {
+	if (!should_detach(prog_fd, redir_prog_names, ARRAY_SIZE(redir_prog_names))) {
 		printf("ERROR: %s isn't attached to %s\n", app_name, ifname);
-		err = 1;
-		goto close_out;
+		close(prog_fd);
+		return 1;
 	}
 
 	opts.old_prog_fd = prog_fd;
@@ -96,11 +153,34 @@ static int do_detach(int ifindex, const char *ifname, const char *app_name)
 	if (err < 0)
 		printf("ERROR: failed to detach program from %s (%s)\n",
 		       ifname, strerror(errno));
-	/* TODO: Remember to cleanup map, when adding use of shared map
+
+	close(prog_fd);
+
+	if (query_opts.dequeue_prog_id) {
+		prog_fd = bpf_prog_get_fd_by_id(query_opts.dequeue_prog_id);
+		if (prog_fd < 0) {
+			printf("ERROR: bpf_prog_get_fd_by_id failed (%s)\n",
+			       strerror(errno));
+			return err;
+		}
+
+		if (!should_detach(prog_fd, dequeue_prog_names, ARRAY_SIZE(dequeue_prog_names))) {
+			close(prog_fd);
+			return err;
+		}
+
+		opts.old_prog_fd = prog_fd;
+		err = bpf_xdp_detach(ifindex,
+				     (XDP_FLAGS_DEQUEUE_MODE | XDP_FLAGS_REPLACE),
+				     &opts);
+		if (err < 0)
+			printf("ERROR: failed to detach dequeue program from %s (%s)\n",
+			       ifname, strerror(errno));
+	}
+
+	/* todo: Remember to cleanup map, when adding use of shared map
 	 *  bpf_map_delete_elem((map_fd, &idx);
 	 */
-close_out:
-	close(prog_fd);
 	return err;
 }
 
@@ -112,24 +192,23 @@ static void usage(const char *prog)
 		"    -d    detach program\n"
 		"    -S    use skb-mode\n"
 		"    -F    force loading prog\n"
-		"    -D    direct table lookups (skip fib rules)\n",
+		"    -D    direct table lookups (skip fib rules)\n"
+		"    -Q    direct table lookups (skip fib rules)\n",
 		prog);
 }
 
 int main(int argc, char **argv)
 {
-	const char *prog_name = "xdp_fwd";
-	struct bpf_program *prog = NULL;
-	struct bpf_program *pos;
-	const char *sec_name;
-	int prog_fd = -1, map_fd = -1;
+	int redir_prog_fd = -1, dequeue_prog_fd = -1, redir_map_fd = -1, pifos_map_fd = -1;
+	const char *prog_name = "xdp_fwd_prog";
 	char filename[PATH_MAX];
 	struct bpf_object *obj;
 	int opt, i, idx, err;
+	bool queue = false;
 	int attach = 1;
 	int ret = 0;
 
-	while ((opt = getopt(argc, argv, ":dDSF")) != -1) {
+	while ((opt = getopt(argc, argv, ":dDQSF")) != -1) {
 		switch (opt) {
 		case 'd':
 			attach = 0;
@@ -141,7 +220,11 @@ int main(int argc, char **argv)
 			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
 		case 'D':
-			prog_name = "xdp_fwd_direct";
+			prog_name = "xdp_fwd_direct_prog";
+			break;
+		case 'Q':
+			prog_name = "xdp_fwd_queue";
+			queue = true;
 			break;
 		default:
 			usage(basename(argv[0]));
@@ -170,9 +253,6 @@ int main(int argc, char **argv)
 		if (libbpf_get_error(obj))
 			return 1;
 
-		prog = bpf_object__next_program(obj, NULL);
-		bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);
-
 		err = bpf_object__load(obj);
 		if (err) {
 			printf("Does kernel support devmap lookup?\n");
@@ -181,25 +261,34 @@ int main(int argc, char **argv)
 			 */
 			return 1;
 		}
-
-		bpf_object__for_each_program(pos, obj) {
-			sec_name = bpf_program__section_name(pos);
-			if (sec_name && !strcmp(sec_name, prog_name)) {
-				prog = pos;
-				break;
-			}
-		}
-		prog_fd = bpf_program__fd(prog);
-		if (prog_fd < 0) {
-			printf("program not found: %s\n", strerror(prog_fd));
+		redir_prog_fd = bpf_program__fd(bpf_object__find_program_by_name(obj,
+										 prog_name));
+		if (redir_prog_fd < 0) {
+			printf("program not found: %s\n", strerror(redir_prog_fd));
 			return 1;
 		}
-		map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
-							"xdp_tx_ports"));
-		if (map_fd < 0) {
-			printf("map not found: %s\n", strerror(map_fd));
+
+		redir_map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
+									"xdp_tx_ports"));
+		if (redir_map_fd < 0) {
+			printf("map not found: %s\n", strerror(redir_map_fd));
 			return 1;
 		}
+
+		if (queue) {
+			dequeue_prog_fd = bpf_program__fd(bpf_object__find_program_by_name(obj,
+											   "xdp_dequeue"));
+			if (dequeue_prog_fd < 0) {
+				printf("dequeue program not found: %s\n",
+				       strerror(-dequeue_prog_fd));
+				return 1;
+			}
+			pifos_map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj, "pifo_maps"));
+			if (pifos_map_fd < 0) {
+				printf("map not found: %s\n", strerror(-pifos_map_fd));
+				return 1;
+			}
+		}
 	}
 
 	for (i = optind; i < argc; ++i) {
@@ -212,11 +301,12 @@ int main(int argc, char **argv)
 			return 1;
 		}
 		if (!attach) {
-			err = do_detach(idx, argv[i], prog_name);
+			err = do_detach(idx, argv[i], argv[0]);
 			if (err)
 				ret = err;
 		} else {
-			err = do_attach(idx, prog_fd, map_fd, argv[i]);
+			err = do_attach(idx, redir_prog_fd, dequeue_prog_fd,
+					redir_map_fd, pifos_map_fd, argv[i]);
 			if (err)
 				ret = err;
 		}