From patchwork Fri Feb 11 07:12:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742935 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C775AC433EF for ; Fri, 11 Feb 2022 07:13:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347597AbiBKHNE (ORCPT ); Fri, 11 Feb 2022 02:13:04 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347594AbiBKHM5 (ORCPT ); Fri, 11 Feb 2022 02:12:57 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 93B8710BE for ; Thu, 10 Feb 2022 23:12:56 -0800 (PST) Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrYmP018278 for ; Thu, 10 Feb 2022 23:12:55 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=vgpYGO1YkEQ34WOQoYqpLWFVMysT+QGf6gsCwqC096M=; b=HGSpabQa6/1NuCUEOn62l6jmuJVoCU7QRlQQzDbPFYm2klhs/9PFuzG3Rn54gxCaCzS2 i6cSa3YXMovcehb/iFpAKP/JxyT9FNtOh+Pm9MAyoK8dBnRuqLTHzpSgUMAQONJpMmon GvW8yvO/wBNz8BRC/iqJONyVwfO7d4dIbE8= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e592ukfrk-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:12:55 -0800 Received: from twshared0983.05.ash8.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:12:43 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 973146C756E9; Thu, 10 Feb 2022 23:12:38 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Date: Thu, 10 Feb 2022 23:12:38 -0800 Message-ID: <20220211071238.885669-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-GUID: K0e099wPFfQbYOKnggL8W-pDysOAj84T X-Proofpoint-ORIG-GUID: K0e099wPFfQbYOKnggL8W-pDysOAj84T X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 malwarescore=0 suspectscore=0 bulkscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 lowpriorityscore=0 spamscore=0 impostorscore=0 phishscore=0 adultscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org skb->tstamp was first used as the (rcv) timestamp. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. A later patch will also allow tc-bpf@ingress to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patch will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. A note on the change in ip_send_unicast_reply() in ip_output.c. It is only used by TCP to send reset/ack out of a ctl_sk. Like the new skb_set_delivery_time(), this patch sets the skb->mono_delivery_time to 0 for now as a place holder. It will be enabled in a later patch. A similar case in tcp_ipv6 can be done with skb_set_delivery_time() in tcp_v6_send_response(). [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 13 +++++++++++++ net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- net/ipv4/ip_output.c | 7 +++++-- net/ipv4/tcp_output.c | 16 +++++++++------- net/ipv6/ip6_output.c | 5 +++-- net/ipv6/netfilter.c | 5 +++-- net/ipv6/tcp_ipv6.c | 2 +- 7 files changed, 37 insertions(+), 16 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a5adbf6b51e8..32c793de3801 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t; * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required + * @mono_delivery_time: When set, skb->tstamp has the + * delivery_time in mono clock base (i.e. EDT). Otherwise, the + * skb->tstamp has the (rcv) timestamp at ingress and + * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -917,6 +921,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -3925,6 +3930,14 @@ static inline ktime_t net_timedelta(ktime_t t) return ktime_sub(ktime_get_real(), t); } +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, + bool mono) +{ + skb->tstamp = kt; + /* Setting mono_delivery_time will be enabled later */ + skb->mono_delivery_time = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index fdbed3158555..ebfb2a5c59e4 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; unsigned int hlen, ll_rs, mtu; ktime_t tstamp = skb->tstamp; struct ip_frag_state state; @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, if (iter.frag) ip_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 0c0574eb5f5b..7af5d1849bc9 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -761,6 +761,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, { struct iphdr *iph; struct sk_buff *skb2; + bool mono_delivery_time = skb->mono_delivery_time; struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; @@ -852,7 +853,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, } } - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) @@ -908,7 +909,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, skb2); if (err) goto fail; @@ -1727,6 +1728,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; + /* Setting mono_delivery_time will be enabled later */ + nskb->mono_delivery_time = 0; ip_push_pending_frames(sk, &fl4); } out: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e76bf1e9251e..2319531267c6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); if (clone_it) { oskb = skb; @@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, skb_split(skb, buff, len); - buff->tstamp = skb->tstamp; + skb_set_delivery_time(buff, skb->tstamp, true); tcp_fragment_tstamp(skb, buff); old_factor = tcp_skb_pcount(skb); @@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp_ns" is used as a start point for the retransmit timer */ - skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache; + tp->tcp_wstamp_ns = tp->tcp_clock_cache; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); goto repair; /* Skip network transmission */ @@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, now = tcp_clock_ns(); #ifdef CONFIG_SYN_COOKIES if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok)) - skb->skb_mstamp_ns = cookie_init_timestamp(req, now); + skb_set_delivery_time(skb, cookie_init_timestamp(req, now), + true); else #endif { - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */ tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb); } @@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, synack_type, &opts); - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); tcp_add_tx_delay(skb, tp); return skb; @@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation); - syn->skb_mstamp_ns = syn_data->skb_mstamp_ns; + skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true); /* Now full SYN+DATA was cloned and sent (or not), * remove the SYN from the original skb (syn_data) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 0c6c971ce0a5..55665f3f7a77 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; + bool mono_delivery_time = skb->mono_delivery_time; struct ip6_frag_state state; unsigned int mtu, hlen, nexthdr_offset; ktime_t tstamp = skb->tstamp; @@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), @@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - frag->tstamp = tstamp; + skb_set_delivery_time(frag, tstamp, mono_delivery_time); err = output(net, sk, frag); if (err) goto fail; diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6ab710b5a1a8..1da332450d98 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; ktime_t tstamp = skb->tstamp; struct ip6_frag_state state; u8 *prevhdr, nexthdr = 0; @@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0c648bf07f39..d4fc6641afa4 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -992,7 +992,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 } else { mark = sk->sk_mark; } - buff->tstamp = tcp_transmit_time(sk); + skb_set_delivery_time(buff, tcp_transmit_time(sk), true); } fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark; fl6.fl6_dport = t1->dest; From patchwork Fri Feb 11 07:12:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742934 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49B50C433F5 for ; Fri, 11 Feb 2022 07:12:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347616AbiBKHM4 (ORCPT ); Fri, 11 Feb 2022 02:12:56 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347594AbiBKHM4 (ORCPT ); Fri, 11 Feb 2022 02:12:56 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E88910A4 for ; Thu, 10 Feb 2022 23:12:56 -0800 (PST) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrS2k025979 for ; Thu, 10 Feb 2022 23:12:55 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=HwB36yUzMIbq/eK7sxBobUtZt2V8qtQlQAh/HWHriW8=; b=Z2IcAksZXAxAqkCvW+8yqwSPE3RBib9tuhx/QGBwLi8Ohkso17Z/ZbxAK41OpkhhY1// H/Itzo9RjsDO6vNy3/+MbXRuie0IDF/xPtqciSoutxDb0JL+Yv6mO+EfEcobP60iK1sW xHGmgONhSUwzTiT22EoREDoO94ni1jZ7VXU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e59rpu6sk-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:12:55 -0800 Received: from twshared22811.39.frc1.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:12:53 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id DA06E6C7572C; Thu, 10 Feb 2022 23:12:44 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 2/8] net: Add skb_clear_tstamp() to keep the mono delivery_time Date: Thu, 10 Feb 2022 23:12:44 -0800 Message-ID: <20220211071244.886294-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: jUp8dwXeM9ctf2jvOGyqMP1dWd58kPLy X-Proofpoint-GUID: jUp8dwXeM9ctf2jvOGyqMP1dWd58kPLy X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 malwarescore=0 spamscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 suspectscore=0 mlxscore=0 impostorscore=0 bulkscore=0 phishscore=0 lowpriorityscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org Right now, skb->tstamp is reset to 0 whenever the skb is forwarded. If skb->tstamp has the mono delivery_time, clearing it can hurt the performance when it finally transmits out to fq@phy-dev. The earlier patch added a skb->mono_delivery_time bit to flag the skb->tstamp carrying the mono delivery_time. This patch adds skb_clear_tstamp() helper which keeps the mono delivery_time and clear everything else. The delivery_time clearing is postponed until the stack knows the skb will be delivered locally. It will be done in a later patch. Signed-off-by: Martin KaFai Lau --- drivers/net/loopback.c | 2 +- include/linux/skbuff.h | 10 +++++++++- net/bridge/br_forward.c | 2 +- net/core/filter.c | 6 +++--- net/core/skbuff.c | 2 +- net/ipv4/ip_forward.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 6 +++--- net/netfilter/nf_dup_netdev.c | 2 +- net/netfilter/nf_flow_table_ip.c | 4 ++-- net/netfilter/nft_fwd_netdev.c | 2 +- net/openvswitch/vport.c | 2 +- net/xfrm/xfrm_interface.c | 2 +- 13 files changed, 26 insertions(+), 18 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index ed0edf5884ef..70a38fa09299 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb, skb_tx_timestamp(skb); /* do not fool net_timestamp_check() with various clock bases */ - skb->tstamp = 0; + skb_clear_tstamp(skb); skb_orphan(skb); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 32c793de3801..07e618f8b41a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3938,6 +3938,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, skb->mono_delivery_time = 0; } +static inline void skb_clear_tstamp(struct sk_buff *skb) +{ + if (skb->mono_delivery_time) + return; + + skb->tstamp = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; @@ -4794,7 +4802,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress) #ifdef CONFIG_NET_REDIRECT skb->from_ingress = from_ingress; if (skb->from_ingress) - skb->tstamp = 0; + skb_clear_tstamp(skb); #endif } diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index ec646656dbf1..02bb620d3b8d 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit); int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { - skb->tstamp = 0; + skb_clear_tstamp(skb); return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING, net, sk, skb, NULL, skb->dev, br_dev_queue_push_xmit); diff --git a/net/core/filter.c b/net/core/filter.c index f497ca7a16d2..a2d712be4985 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_xmit_recursion_inc(); ret = dev_queue_xmit(skb); @@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); @@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0118f0afaa4f..3e3da8fdf8f5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5350,7 +5350,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) ipvs_reset(skb); skb->mark = 0; - skb->tstamp = 0; + skb_clear_tstamp(skb); } EXPORT_SYMBOL_GPL(skb_scrub_packet); diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 00ec819f949b..92ba3350274b 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s if (unlikely(opt->optlen)) ip_forward_options(skb); - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 55665f3f7a77..9fc1b08cf622 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk, } #endif - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index d2e5a8f644b8..029171379884 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, nf_reset_ct(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); } return ret; } @@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, if (!local) { skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else @@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c index a579e59ee5c5..7873bd1389c3 100644 --- a/net/netfilter/nf_dup_netdev.c +++ b/net/netfilter/nf_dup_netdev.c @@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev) skb_push(skb, skb->mac_len); skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_queue_xmit(skb); } diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 889cf88d3dba..f1d387129f02 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ip(flow, skb, thoff, dir, iph); ip_decrease_ttl(iph); - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); @@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ipv6(flow, skb, dir, ip6h); ip6h->hop_limit--; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c index fa9301ca6033..4b2b0946c0b6 100644 --- a/net/netfilter/nft_fwd_netdev.c +++ b/net/netfilter/nft_fwd_netdev.c @@ -140,7 +140,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr, return; skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); neigh_xmit(neigh_table, dev, addr, skb); out: regs->verdict.code = verdict; diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index cf2ce5812489..82a74f998966 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) } skb->dev = vport->dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); vport->ops->send(skb); return; diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 57448fc519fc..4991e99ced9a 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev) static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet) { - skb->tstamp = 0; + skb_clear_tstamp(skb); skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb->ignore_df = 0; From patchwork Fri Feb 11 07:12:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742936 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6557C433EF for ; Fri, 11 Feb 2022 07:13:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347591AbiBKHNK (ORCPT ); Fri, 11 Feb 2022 02:13:10 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36366 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233247AbiBKHNF (ORCPT ); Fri, 11 Feb 2022 02:13:05 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 515B51156 for ; Thu, 10 Feb 2022 23:13:03 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrJOQ013600 for ; Thu, 10 Feb 2022 23:13:02 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=wf2YkDyqpkLJmezwybTz4VHJZySTcf25JSmYTP2ZY0U=; b=gk3s4M76SUl7WJY3GugdbfQOL2vGlyY+GykJsDpOIO2Bx+h+2Fr8oOBMSHl5oUtmHR8j swsqcuPZEPdVHbJCSZnVBBRJ4STWiWecl9M0Cea3UMO7LcHNsJvtYWeLK5HwWA5tMPSR hnjCEX8dKlkphsE7YMrGb9q6zQEZcHKKup4= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e58e1kxhw-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:02 -0800 Received: from twshared18912.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:01 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 32B266C757E3; Thu, 10 Feb 2022 23:12:51 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Date: Thu, 10 Feb 2022 23:12:51 -0800 Message-ID: <20220211071251.887078-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: zNx-Alf1VoBtrTVSmfFi4YYAUf7bf4nY X-Proofpoint-ORIG-GUID: zNx-Alf1VoBtrTVSmfFi4YYAUf7bf4nY X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 phishscore=0 malwarescore=0 clxscore=1015 bulkscore=0 mlxlogscore=999 impostorscore=0 mlxscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110039 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org This patch sets the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp. skb_clear_delivery_time() is added to clear the delivery_time and set back to the (rcv) timestamp after sch_handle_ingress() such that the tc-bpf prog can use bpf_redirect_*() to forward it to the egress of another iface and keep the EDT delivery_time. The next patch will postpone the skb_clear_delivery_time() until the stack learns that the skb is being delivered locally and that will make other kernel forwarding paths (ip[6]_forward) able to keep the delivery_time also. Thus, like the previous patches on using the skb->mono_delivery_time bit, calling skb_clear_delivery_time() is not done within the CONFIG_NET_INGRESS to avoid too many code churns among this set. Before sch_handle_ingress(), another case needs to clear the delivery_time is the network tapping (e.g. af_packet by tcpdump). Regardless of tapping at the ingress or egress, the tapped skb is received by the af_packet socket, so it is ingress to the af_packet socket and it expects the (rcv) timestamp. When tapping at egress, dev_queue_xmit_nit() is used. It has already expected skb->tstamp may have delivery_time, so it does skb_clone()+net_timestamp_set() to ensure the cloned skb has the (rcv) timestamp before passing to the af_packet sk. This patch only adds to clear the skb->mono_delivery_time bit in net_timestamp_set(). When tapping at ingress, it currently expects the skb->tstamp is either 0 or has the (rcv) timestamp. Meaning, the tapping at ingress path has already expected the skb->tstamp could be 0 and it will get the (rcv) timestamp by ktime_get_real() when needed. There are two cases for tapping at ingress: One case is af_packet queues the skb to its sk_receive_queue. The skb is either not shared or new clone created. The skb_clear_delivery_time() is called to clear the delivery_time (if any) before it is queued to the sk_receive_queue. Another case, the ingress skb is directly copied to the rx_ring and tpacket_get_timestamp() is used to get the (rcv) timestamp. skb_tstamp() is used in tpacket_get_timestamp() to check the skb->mono_delivery_time bit before returning skb->tstamp. As mentioned earlier, the tapping@ingress has already expected the skb may not have the (rcv) timestamp (because no sk has asked for it) and has handled this case by directly calling ktime_get_real(). In __skb_tstamp_tx, it clones the egress skb and queues the clone to the sk_error_queue. The outgoing skb may have the mono delivery_time while the (rcv) timestamp is expected for the clone, so the skb->mono_delivery_time bit is also cleared from the clone. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 27 +++++++++++++++++++++++++-- net/core/dev.c | 5 ++++- net/core/skbuff.c | 6 ++++-- net/ipv4/ip_output.c | 3 +-- net/packet/af_packet.c | 4 +++- 5 files changed, 37 insertions(+), 8 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 07e618f8b41a..0e09e75fa787 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3934,8 +3934,23 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, bool mono) { skb->tstamp = kt; - /* Setting mono_delivery_time will be enabled later */ - skb->mono_delivery_time = 0; + skb->mono_delivery_time = kt && mono; +} + +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key); + +/* It is used in the ingress path to clear the delivery_time. + * If needed, set the skb->tstamp to the (rcv) timestamp. + */ +static inline void skb_clear_delivery_time(struct sk_buff *skb) +{ + if (unlikely(skb->mono_delivery_time)) { + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = ktime_get_real(); + else + skb->tstamp = 0; + } } static inline void skb_clear_tstamp(struct sk_buff *skb) @@ -3946,6 +3961,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb) skb->tstamp = 0; } +static inline ktime_t skb_tstamp(const struct sk_buff *skb) +{ + if (unlikely(skb->mono_delivery_time)) + return 0; + + return skb->tstamp; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/core/dev.c b/net/core/dev.c index f5ef51601081..f41707ab2fb9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2000,7 +2000,8 @@ void net_dec_egress_queue(void) EXPORT_SYMBOL_GPL(net_dec_egress_queue); #endif -static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +EXPORT_SYMBOL(netstamp_needed_key); #ifdef CONFIG_JUMP_LABEL static atomic_t netstamp_needed_deferred; static atomic_t netstamp_wanted; @@ -2061,6 +2062,7 @@ EXPORT_SYMBOL(net_disable_timestamp); static inline void net_timestamp_set(struct sk_buff *skb) { skb->tstamp = 0; + skb->mono_delivery_time = 0; if (static_branch_unlikely(&netstamp_needed_key)) __net_timestamp(skb); } @@ -5220,6 +5222,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto out; } #endif + skb_clear_delivery_time(skb); skb_reset_redirect(skb); skip_classify: if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3e3da8fdf8f5..93dc763da8cb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4817,10 +4817,12 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb, skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey; } - if (hwtstamps) + if (hwtstamps) { *skb_hwtstamps(skb) = *hwtstamps; - else + } else { skb->tstamp = ktime_get_real(); + skb->mono_delivery_time = 0; + } __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats); } diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 7af5d1849bc9..bfe08feb5d82 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1728,8 +1728,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; - /* Setting mono_delivery_time will be enabled later */ - nskb->mono_delivery_time = 0; + nskb->mono_delivery_time = !!transmit_time; ip_push_pending_frames(sk, &fl4); } out: diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index ab87f22cc7ec..1b93ce1a5600 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts, return TP_STATUS_TS_RAW_HARDWARE; if ((flags & SOF_TIMESTAMPING_SOFTWARE) && - ktime_to_timespec64_cond(skb->tstamp, ts)) + ktime_to_timespec64_cond(skb_tstamp(skb), ts)) return TP_STATUS_TS_SOFTWARE; return 0; @@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, spin_lock(&sk->sk_receive_queue.lock); po->stats.stats1.tp_packets++; sock_skb_set_dropcount(sk, skb); + skb_clear_delivery_time(skb); __skb_queue_tail(&sk->sk_receive_queue, skb); spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk); @@ -2377,6 +2378,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, po->stats.stats1.tp_packets++; if (copy_skb) { status |= TP_STATUS_COPY; + skb_clear_delivery_time(copy_skb); __skb_queue_tail(&sk->sk_receive_queue, copy_skb); } spin_unlock(&sk->sk_receive_queue.lock); From patchwork Fri Feb 11 07:12:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742937 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 236C5C433F5 for ; Fri, 11 Feb 2022 07:13:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347600AbiBKHNL (ORCPT ); Fri, 11 Feb 2022 02:13:11 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347598AbiBKHNJ (ORCPT ); Fri, 11 Feb 2022 02:13:09 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 808F11111 for ; Thu, 10 Feb 2022 23:13:06 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrI5h013553 for ; Thu, 10 Feb 2022 23:13:05 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=NkytTpRrUG56ut/EYoku5/SNA6tMzAT7dNYLvc4tjn0=; b=cPcXD175TTVNoUBzfADiQ2vHrsqNZMuVYDE4p5KM4JHQh176hNnLm9MQZy9t3VtDqhU2 0KBxdbRYpySazcM3J7qydycJQznyK/uV2cU3X/4qjJbRes6GsolVTZK3oV+PrejO8cUD 0wltAGiLIhJLKwFD2LrwBfROGDwlMwimIvo= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e58e1kxj6-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:05 -0800 Received: from twshared12416.02.ash9.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:21d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:04 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 7575D6C75963; Thu, 10 Feb 2022 23:12:57 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 4/8] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Date: Thu, 10 Feb 2022 23:12:57 -0800 Message-ID: <20220211071257.888905-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-GUID: -jRb_ll-KXHlM7IFSwR-Si_ynJjtXLLH X-Proofpoint-ORIG-GUID: -jRb_ll-KXHlM7IFSwR-Si_ynJjtXLLH X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 phishscore=0 malwarescore=0 clxscore=1015 bulkscore=0 mlxlogscore=999 impostorscore=0 mlxscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110039 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org This patch postpones the delivery_time clearing until the stack knows the skb is being delivered locally. That will allow other kernel forwarding path (e.g. ip[6]_forward) to keep the delivery_time also. An earlier attempt was to do skb_clear_delivery_time() in ip_local_deliver() and ip6_input(). The discussion [0] requested to move it one step later into ip_local_deliver_finish() and ip6_input_finish() so that the delivery_time can be kept for the ip_vs forwarding path also. To do that, this patch also needs to take care of the (rcv) timestamp usecase in ip_is_fragment(). It needs to expect delivery_time in the skb->tstamp, so it needs to save the mono_delivery_time bit in inet_frag_queue such that the delivery_time (if any) can be restored in the final defragmented skb. The ipv6 defrag is done in ip6_protocol_deliver_rcu() when figuring out how to handle nexthdr and IPPROTO_FRAGMENT (44) is one of the ipv6 extension header. ip6_protocol_deliver_rcu() is after ip6_input_finish() where the skb_clear_delivery_time() has already been done, so change is not needed. [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/ Signed-off-by: Martin KaFai Lau --- include/net/inet_frag.h | 1 + net/core/dev.c | 1 - net/ipv4/inet_fragment.c | 1 + net/ipv4/ip_fragment.c | 1 + net/ipv4/ip_input.c | 1 + net/ipv6/ip6_input.c | 1 + 6 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 63540be0fc34..c0e517f31d82 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -90,6 +90,7 @@ struct inet_frag_queue { ktime_t stamp; int len; int meat; + bool mono_delivery_time; __u8 flags; u16 max_size; struct fqdir *fqdir; diff --git a/net/core/dev.c b/net/core/dev.c index f41707ab2fb9..041cef7473fd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5222,7 +5222,6 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto out; } #endif - skb_clear_delivery_time(skb); skb_reset_redirect(skb); skip_classify: if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index 341096807100..63948f6aeca0 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -572,6 +572,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, skb_mark_not_on_list(head); head->prev = NULL; head->tstamp = q->stamp; + head->mono_delivery_time = q->mono_delivery_time; } EXPORT_SYMBOL(inet_frag_reasm_finish); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index fad803d2d711..fb153569889e 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -349,6 +349,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) qp->iif = dev->ifindex; qp->q.stamp = skb->tstamp; + qp->q.mono_delivery_time = skb->mono_delivery_time; qp->q.meat += skb->len; qp->ecn |= ecn; add_frag_mem_limit(qp->q.fqdir, skb->truesize); diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index d94f9f7e60c3..95f7bb052784 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol) static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { + skb_clear_delivery_time(skb); __skb_pull(skb, skb_network_header_len(skb)); rcu_read_lock(); diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index d4b1e2c5aa76..5b5ea35635f9 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr, static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { + skb_clear_delivery_time(skb); rcu_read_lock(); ip6_protocol_deliver_rcu(net, skb, 0, false); rcu_read_unlock(); From patchwork Fri Feb 11 07:13:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742938 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A843C4332F for ; Fri, 11 Feb 2022 07:13:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347598AbiBKHNM (ORCPT ); Fri, 11 Feb 2022 02:13:12 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347593AbiBKHNL (ORCPT ); Fri, 11 Feb 2022 02:13:11 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B9A9D10EA for ; Thu, 10 Feb 2022 23:13:10 -0800 (PST) Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrRaP018514 for ; Thu, 10 Feb 2022 23:13:10 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=DgvDXN9XBlObkLmDDnSGd2RQgPd6++UhlTWkwCnDYGw=; b=nfVCSjrojoOR+8JKaOc3eipBHeGkUTnEMycqDcpYruwpFEx4lQspRjnwfCS98fT9GO+p pw1F2WWsq3yWkbHwFn2u/ldvk5YrfZOEYeQkINfg5SqQUmKQdQKOwKXpZ8yybBUhsGj1 hOquHbBb1AK4qr8/f2ugUVo2H+tfiCExshI= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e5866v2ye-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:10 -0800 Received: from twshared6457.05.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:09 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id B82616C75A26; Thu, 10 Feb 2022 23:13:03 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Date: Thu, 10 Feb 2022 23:13:03 -0800 Message-ID: <20220211071303.890169-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: Mh_4PREj25pFdEWt-DN0oqGihBjbogch X-Proofpoint-ORIG-GUID: Mh_4PREj25pFdEWt-DN0oqGihBjbogch X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 clxscore=1015 suspectscore=0 mlxscore=0 malwarescore=0 bulkscore=0 impostorscore=0 lowpriorityscore=0 spamscore=0 phishscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org The current tc-bpf@ingress reads and writes the __sk_buff->tstamp as a (rcv) timestamp. This patch is to backward compatible with the (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. If needed, the patch first saves the mono delivery_time. Depending on the static key "netstamp_needed_key", it then resets the skb->tstamp to either 0 or ktime_get_real() before running the tc-bpf@ingress. After the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not been changed, it will restore the earlier saved mono delivery_time. The current logic to run tc-bpf@ingress is refactored to a new bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. The above new delivery_time save/restore logic is also done together in this function. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 28 ++++++++++++++++++++++++++++ net/sched/act_bpf.c | 5 +---- net/sched/cls_bpf.c | 6 +----- 3 files changed, 30 insertions(+), 9 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index d23e999dc032..e43e1701a80e 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) cb->data_end = skb->data + skb_headlen(skb); } +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, + struct sk_buff *skb) +{ + ktime_t tstamp, saved_mono_dtime = 0; + int filter_res; + + if (unlikely(skb->mono_delivery_time)) { + saved_mono_dtime = skb->tstamp; + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = tstamp = ktime_get_real(); + else + skb->tstamp = tstamp = 0; + } + + /* It is safe to push/pull even if skb_shared() */ + __skb_push(skb, skb->mac_len); + bpf_compute_data_pointers(skb); + filter_res = bpf_prog_run(prog, skb); + __skb_pull(skb, skb->mac_len); + + /* __sk_buff->tstamp was not changed, restore the delivery_time */ + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) + skb_set_delivery_time(skb, saved_mono_dtime, true); + + return filter_res; +} + /* Similar to bpf_compute_data_pointers(), except that save orginal * data in cb->data and cb->meta_data for restore. */ diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index a77d8908e737..14c3bd0a5088 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, filter = rcu_dereference(prog->filter); if (at_ingress) { - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index df19a847829e..036b2e1f74af 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, if (tc_skip_sw(prog->gen_flags)) { filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0; } else if (at_ingress) { - /* It is safe to push/pull even if skb_shared() */ - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(prog->filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(prog->filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb); From patchwork Fri Feb 11 07:13:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742939 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9B0AC433EF for ; Fri, 11 Feb 2022 07:13:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347594AbiBKHNS (ORCPT ); Fri, 11 Feb 2022 02:13:18 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347617AbiBKHNP (ORCPT ); Fri, 11 Feb 2022 02:13:15 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6772F110E for ; Thu, 10 Feb 2022 23:13:14 -0800 (PST) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrRRS025901 for ; Thu, 10 Feb 2022 23:13:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=EVTUWhfTh9Xw+oE/92HMbTSBKTlmpsbLoRRswODvVV0=; b=rTOGxSa5gt1HExPduCmJy/ErfM0blQ5dgfZVlzN1+UxmRYUq/OR0hjrySp7tJzqsvdGT UZSpndqFdLf5FF7q36DYetg/eBKv49Njdv1JIjHMdjE5/p9r8Frd1PF45rVpueHxtcP9 C5uu1nP3BD1DkUTp2utvb64nbGt/AFMYmpE= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e59rpu6tx-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:14 -0800 Received: from twshared18912.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:12 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 06CE86C75AEE; Thu, 10 Feb 2022 23:13:10 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 6/8] bpf: Clear skb->mono_delivery_time bit if needed after running tc-bpf@egress Date: Thu, 10 Feb 2022 23:13:10 -0800 Message-ID: <20220211071310.892007-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: QE6Bu_TU9jJA-5bPPTSdAffRNpDo3RL4 X-Proofpoint-GUID: QE6Bu_TU9jJA-5bPPTSdAffRNpDo3RL4 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 malwarescore=0 spamscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 suspectscore=0 mlxscore=0 impostorscore=0 bulkscore=0 phishscore=0 lowpriorityscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org The tc-bpf@egress reads and writes the skb->tstamp as delivery_time. If tc-bpf@egress sets skb->tstamp to 0, skb->mono_delivery_time should also be cleared. It is done in cls_bpf.c and act_bpf.c after running the tc-bpf@egress. Signed-off-by: Martin KaFai Lau --- net/sched/act_bpf.c | 2 ++ net/sched/cls_bpf.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 14c3bd0a5088..d89470976500 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -49,6 +49,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); + if (unlikely(skb->mono_delivery_time && !skb->tstamp)) + skb->mono_delivery_time = 0; } if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK) skb_orphan(skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 036b2e1f74af..c251d1df1fa3 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -97,6 +97,8 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb); + if (unlikely(skb->mono_delivery_time && !skb->tstamp)) + skb->mono_delivery_time = 0; } if (prog->exts_integrated) { From patchwork Fri Feb 11 07:13:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742940 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2125C433F5 for ; Fri, 11 Feb 2022 07:13:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347624AbiBKHNX (ORCPT ); Fri, 11 Feb 2022 02:13:23 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347621AbiBKHNU (ORCPT ); Fri, 11 Feb 2022 02:13:20 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD68D110E for ; Thu, 10 Feb 2022 23:13:19 -0800 (PST) Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrSNj018649 for ; Thu, 10 Feb 2022 23:13:19 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=LEZ8scgy1qDfbqzhAvnlUfuwJes0fc2MM9TH2kqtX14=; b=hNtQu47+Ti+nEpkvJqku9Ce6KsM/SGVGyrvQbYw+/MtYtaVnofqy0OBLqnpdWxHNNtAr b8uET2fv1q+tl+Kqx7S8K98k0xex6U61BPlrgXbQ7eOivNgkRZU4tMl/j25Z9zKmkYPH dV/1dlWwSLYaz7G6yi1hHvFaVr1CV5iUGyI= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e5866v2yv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:19 -0800 Received: from twshared14630.35.frc1.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:18 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 596CB6C75B6D; Thu, 10 Feb 2022 23:13:16 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 7/8] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time() Date: Thu, 10 Feb 2022 23:13:16 -0800 Message-ID: <20220211071316.892630-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: nNw0gq8i3Rbw1DBZiWoEmMoZJBUyGccE X-Proofpoint-ORIG-GUID: nNw0gq8i3Rbw1DBZiWoEmMoZJBUyGccE X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 clxscore=1015 suspectscore=0 mlxscore=0 malwarescore=0 bulkscore=0 impostorscore=0 lowpriorityscore=0 spamscore=0 phishscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org * __sk_buff->delivery_time_type: This patch adds __sk_buff->delivery_time_type. It tells if the delivery_time is stored in __sk_buff->tstamp or not. It will be most useful for ingress to tell if the __sk_buff->tstamp has the (rcv) timestamp or delivery_time. If delivery_time_type is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp. Two non-zero types are defined for the delivery_time_type, BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC. For UNSPEC, it can only happen in egress because only mono delivery_time can be forwarded to ingress now. The clock of UNSPEC delivery_time can be deduced from the skb->sk->sk_clockid which is how the sch_etf doing it also. Thus, while delivery_time_type provides (rcv) timestamp vs delivery_time signal to tc-bpf@ingress, it should not change the existing way of doing thing for tc-bpf@egress other than spelling out more explicitly in the new __sk_buff->delivery_time_type instead of having the tc-bpf to deduce it by checking the sk is tcp (mono EDT) or by checking sk->sk_clockid for non-tcp. delivery_time_type is read only. Its convert_ctx_access() requires the skb's mono_delivery_time bit and tc_at_ingress bit. They are moved up in sk_buff so that bpf rewrite can be done at a fixed offset. tc_skip_classify is moved together with tc_at_ingress. To get one bit for mono_delivery_time, csum_not_inet is moved down and this bit is currently used by sctp. * Provide forwarded delivery_time to tc-bpf@ingress: With the help of the new delivery_time_type, the tc-bpf has a way to tell if the __sk_buff->tstamp has the (rcv) timestamp or the delivery_time. During bpf load time, the verifier will learn if the bpf prog has accessed the new __sk_buff->delivery_time_type. If it does, it means the tc-bpf@ingress is expecting the skb->tstamp could have the delivery_time. The kernel will then keep the forwarded delivery_time in skb->tstamp. This is done by adding a new prog->delivery_time_access bit. Since the tc-bpf@ingress can access the delivery_time, it also needs to clear the skb->mono_delivery_time after running the bpf if 0 has been written to skb->tstamp. This is the same as the tc-bpf@egress in the previous patch. For tail call, the callee will follow the __sk_buff->tstamp expectation of its caller at ingress. If caller does not have its prog->delivery_time_access set, the callee prog will not have the forwarded delivery_time in __sk_buff->tstamp and will have the (rcv) timestamp instead. If needed, in the future, a new attach_type can be added to allow the tc-bpf to explicitly specify its expectation on the __sk_buff->tstamp. * bpf_skb_set_delivery_time(): The bpf_skb_set_delivery_time() helper is added to allow setting both delivery_time and the delivery_time_type at the same time. If the tc-bpf does not need to change the delivery_time_type, it can directly write to the __sk_buff->tstamp as the existing tc-bpf has already been doing. It will be most useful at ingress to change the __sk_buff->tstamp from the (rcv) timestamp to a mono delivery_time and then bpf_redirect_*(). bpf only has mono clock helper (bpf_ktime_get_ns), and the current known use case is the mono EDT for fq, and only mono delivery time can be kept during forward now, so bpf_skb_set_delivery_time() only supports setting BPF_SKB_DELIVERY_TIME_MONO. It can be extended later when use cases come up and the forwarding path also supports other clock bases. This function could be inline and is left as a future exercise. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 7 ++- include/linux/skbuff.h | 20 ++++++--- include/uapi/linux/bpf.h | 35 ++++++++++++++- net/core/filter.c | 79 +++++++++++++++++++++++++++++++++- tools/include/uapi/linux/bpf.h | 35 ++++++++++++++- 5 files changed, 164 insertions(+), 12 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e43e1701a80e..00bbde352ad0 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -572,7 +572,8 @@ struct bpf_prog { has_callchain_buf:1, /* callchain buffer allocated? */ enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */ - call_get_func_ip:1; /* Do we call get_func_ip() */ + call_get_func_ip:1, /* Do we call get_func_ip() */ + delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ @@ -705,7 +706,7 @@ static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, ktime_t tstamp, saved_mono_dtime = 0; int filter_res; - if (unlikely(skb->mono_delivery_time)) { + if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) { saved_mono_dtime = skb->tstamp; skb->mono_delivery_time = 0; if (static_branch_unlikely(&netstamp_needed_key)) @@ -723,6 +724,8 @@ static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, /* __sk_buff->tstamp was not changed, restore the delivery_time */ if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) skb_set_delivery_time(skb, saved_mono_dtime, true); + if (unlikely(skb->mono_delivery_time && !skb->tstamp)) + skb->mono_delivery_time = 0; return filter_res; } diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e09e75fa787..fb7146be48f7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -893,22 +893,23 @@ struct sk_buff { __u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */ __u8 csum_complete_sw:1; __u8 csum_level:2; - __u8 csum_not_inet:1; __u8 dst_pending_confirm:1; + __u8 mono_delivery_time:1; + +#ifdef CONFIG_NET_CLS_ACT + __u8 tc_skip_classify:1; + __u8 tc_at_ingress:1; +#endif #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif - + __u8 csum_not_inet:1; __u8 ipvs_property:1; __u8 inner_protocol_type:1; __u8 remcsum_offload:1; #ifdef CONFIG_NET_SWITCHDEV __u8 offload_fwd_mark:1; __u8 offload_l3_fwd_mark:1; -#endif -#ifdef CONFIG_NET_CLS_ACT - __u8 tc_skip_classify:1; - __u8 tc_at_ingress:1; #endif __u8 redirected:1; #ifdef CONFIG_NET_REDIRECT @@ -921,7 +922,6 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; - __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -999,10 +999,16 @@ struct sk_buff { /* if you move pkt_vlan_present around you also must adapt these constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_VLAN_PRESENT_BIT 7 +#define TC_AT_INGRESS_MASK (1 << 0) +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 2) #else #define PKT_VLAN_PRESENT_BIT 0 +#define TC_AT_INGRESS_MASK (1 << 7) +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 5) #endif #define PKT_VLAN_PRESENT_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) #ifdef __KERNEL__ /* diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 16a7574292a5..b36771b1bfa1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5076,6 +5076,31 @@ union bpf_attr { * associated to *xdp_md*, at *offset*. * Return * 0 on success, or a negative error in case of failure. + * + * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type) + * Description + * Set a *dtime* (delivery time) to the __sk_buff->tstamp and also + * change the __sk_buff->delivery_time_type to *dtime_type*. + * + * Only BPF_SKB_DELIVERY_TIME_MONO is supported in *dtime_type* + * and it is the only delivery_time_type that will be kept + * after bpf_redirect_*(). + * Only ipv4 and ipv6 skb->protocol is supported. + * + * If there is no need to change the __sk_buff->delivery_time_type, + * the delivery_time can be directly written to __sk_buff->tstamp + * instead. + * + * This function is most useful when it needs to set a + * mono delivery_time to __sk_buff->tstamp and then + * bpf_redirect_*() to the egress of an iface. For example, + * changing the (rcv) timestamp in __sk_buff->tstamp at + * ingress to a mono delivery time and then bpf_redirect_*() + * to sch_fq@phy-dev. + * Return + * 0 on success. + * **-EINVAL** for invalid input + * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5269,6 +5294,7 @@ union bpf_attr { FN(xdp_get_buff_len), \ FN(xdp_load_bytes), \ FN(xdp_store_bytes), \ + FN(skb_set_delivery_time), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -5458,6 +5484,12 @@ union { \ __u64 :64; \ } __attribute__((aligned(8))) +enum { + BPF_SKB_DELIVERY_TIME_NONE, + BPF_SKB_DELIVERY_TIME_UNSPEC, + BPF_SKB_DELIVERY_TIME_MONO, +}; + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ @@ -5498,7 +5530,8 @@ struct __sk_buff { __u32 gso_segs; __bpf_md_ptr(struct bpf_sock *, sk); __u32 gso_size; - __u32 :32; /* Padding, future use. */ + __u8 delivery_time_type; + __u32 :24; /* Padding, future use. */ __u64 hwtstamp; }; diff --git a/net/core/filter.c b/net/core/filter.c index a2d712be4985..0e79a6ca4a95 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7159,6 +7159,33 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_3(bpf_skb_set_delivery_time, struct sk_buff *, skb, + u64, dtime, u32, dtime_type) +{ + if (!dtime) + return -EINVAL; + + /* skb_clear_delivery_time() is done for inet protocol */ + if (dtime_type != BPF_SKB_DELIVERY_TIME_MONO || + (skb->protocol != htons(ETH_P_IP) && + skb->protocol != htons(ETH_P_IPV6))) + return -EOPNOTSUPP; + + skb->mono_delivery_time = 1; + skb->tstamp = dtime; + + return 0; +} + +static const struct bpf_func_proto bpf_skb_set_delivery_time_proto = { + .func = bpf_skb_set_delivery_time, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; + static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend, u8 search_kind, const u8 *magic, u8 magic_len, bool *eol) @@ -7746,6 +7773,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_tcp_gen_syncookie_proto; case BPF_FUNC_sk_assign: return &bpf_sk_assign_proto; + case BPF_FUNC_skb_set_delivery_time: + return &bpf_skb_set_delivery_time_proto; #endif default: return bpf_sk_base_func_proto(func_id); @@ -8085,7 +8114,9 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type return false; info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL; break; - case offsetofend(struct __sk_buff, gso_size) ... offsetof(struct __sk_buff, hwtstamp) - 1: + case offsetof(struct __sk_buff, delivery_time_type): + return false; + case offsetofend(struct __sk_buff, delivery_time_type) ... offsetof(struct __sk_buff, hwtstamp) - 1: /* Explicitly prohibit access to padding in __sk_buff. */ return false; default: @@ -8432,6 +8463,8 @@ static bool tc_cls_act_is_valid_access(int off, int size, break; case bpf_ctx_range_till(struct __sk_buff, family, local_port): return false; + case offsetof(struct __sk_buff, delivery_time_type): + return size == sizeof(__u8); } return bpf_skb_is_valid_access(off, size, type, prog, info); @@ -8848,6 +8881,45 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, return insn; } +static struct bpf_insn *bpf_convert_dtime_type_read(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->dst_reg; + __u8 skb_reg = si->src_reg; + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* value_reg = BPF_SKB_DELIVERY_TIME_MONO */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_MONO); + *insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 10 : 5); + + *insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, skb_reg, + offsetof(struct sk_buff, tstamp)); + *insn++ = BPF_JMP_IMM(BPF_JNE, tmp_reg, 0, 2); + /* value_reg = BPF_SKB_DELIVERY_TIME_NONE */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_NONE); + *insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 6 : 1); + +#ifdef CONFIG_NET_CLS_ACT + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* At ingress, value_reg = 0 */ + *insn++ = BPF_MOV32_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); +#endif + + /* value_reg = BPF_SKB_DELIVERYT_TIME_UNSPEC */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_UNSPEC); + + /* 15 insns with CONFIG_NET_CLS_ACT */ + return insn; +} + static u32 bpf_convert_ctx_access(enum bpf_access_type type, const struct bpf_insn *si, struct bpf_insn *insn_buf, @@ -9169,6 +9241,11 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, target_size)); break; + case offsetof(struct __sk_buff, delivery_time_type): + insn = bpf_convert_dtime_type_read(si, insn); + prog->delivery_time_access = 1; + break; + case offsetof(struct __sk_buff, gso_segs): insn = bpf_convert_shinfo_access(si, insn); *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct skb_shared_info, gso_segs), diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 16a7574292a5..b36771b1bfa1 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5076,6 +5076,31 @@ union bpf_attr { * associated to *xdp_md*, at *offset*. * Return * 0 on success, or a negative error in case of failure. + * + * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type) + * Description + * Set a *dtime* (delivery time) to the __sk_buff->tstamp and also + * change the __sk_buff->delivery_time_type to *dtime_type*. + * + * Only BPF_SKB_DELIVERY_TIME_MONO is supported in *dtime_type* + * and it is the only delivery_time_type that will be kept + * after bpf_redirect_*(). + * Only ipv4 and ipv6 skb->protocol is supported. + * + * If there is no need to change the __sk_buff->delivery_time_type, + * the delivery_time can be directly written to __sk_buff->tstamp + * instead. + * + * This function is most useful when it needs to set a + * mono delivery_time to __sk_buff->tstamp and then + * bpf_redirect_*() to the egress of an iface. For example, + * changing the (rcv) timestamp in __sk_buff->tstamp at + * ingress to a mono delivery time and then bpf_redirect_*() + * to sch_fq@phy-dev. + * Return + * 0 on success. + * **-EINVAL** for invalid input + * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5269,6 +5294,7 @@ union bpf_attr { FN(xdp_get_buff_len), \ FN(xdp_load_bytes), \ FN(xdp_store_bytes), \ + FN(skb_set_delivery_time), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -5458,6 +5484,12 @@ union { \ __u64 :64; \ } __attribute__((aligned(8))) +enum { + BPF_SKB_DELIVERY_TIME_NONE, + BPF_SKB_DELIVERY_TIME_UNSPEC, + BPF_SKB_DELIVERY_TIME_MONO, +}; + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ @@ -5498,7 +5530,8 @@ struct __sk_buff { __u32 gso_segs; __bpf_md_ptr(struct bpf_sock *, sk); __u32 gso_size; - __u32 :32; /* Padding, future use. */ + __u8 delivery_time_type; + __u32 :24; /* Padding, future use. */ __u64 hwtstamp; }; From patchwork Fri Feb 11 07:13:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12742941 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77B1EC433EF for ; Fri, 11 Feb 2022 07:13:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347628AbiBKHNe (ORCPT ); Fri, 11 Feb 2022 02:13:34 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347642AbiBKHNc (ORCPT ); Fri, 11 Feb 2022 02:13:32 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E9A126DF for ; Thu, 10 Feb 2022 23:13:30 -0800 (PST) Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21ANrgL9007907 for ; Thu, 10 Feb 2022 23:13:30 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=/jtFkhp35HfBDdv1QEOtm4rdpgGFv2fjKm6qwsS63gM=; b=WJQeVkv9GYuWglu9S0HLdpyhlEdhmG8Ev0tYs49bAwg7SKM54XpNxrqGkKMzM6WNY0Dz AE/CBuDg4+OKa4VOMbo7wMYfoJtdTjjNxilW5RxU/c/3S2/L1iLdAkZKhDKUo2aAKw5Y 3OwMfKjbarKg2Uh4IdOcOLePX61YJtr3unE= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3e5853c4va-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 10 Feb 2022 23:13:30 -0800 Received: from twshared12416.02.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 10 Feb 2022 23:13:29 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9C1496C75C7C; Thu, 10 Feb 2022 23:13:22 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v4 net-next 8/8] bpf: selftests: test skb->tstamp in redirect_neigh Date: Thu, 10 Feb 2022 23:13:22 -0800 Message-ID: <20220211071322.894148-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220211071232.885225-1-kafai@fb.com> References: <20220211071232.885225-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: -p5gwyddRHiUTqcXluRCmTiWcbT4cFzc X-Proofpoint-ORIG-GUID: -p5gwyddRHiUTqcXluRCmTiWcbT4cFzc X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-11_02,2022-02-09_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 impostorscore=0 mlxlogscore=999 suspectscore=0 priorityscore=1501 lowpriorityscore=0 bulkscore=0 spamscore=0 clxscore=1015 malwarescore=0 phishscore=0 mlxscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202110040 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org This patch adds tests on forwarding the delivery_time for the following cases - tcp/udp + ip4/ip6 + bpf_redirect_neigh - tcp/udp + ip4/ip6 + ip[6]_forward - bpf_skb_set_delivery_time - The old rcv timestamp expectation on tc-bpf@ingress Signed-off-by: Martin KaFai Lau --- .../selftests/bpf/prog_tests/tc_redirect.c | 434 ++++++++++++++++++ .../selftests/bpf/progs/test_tc_dtime.c | 348 ++++++++++++++ 2 files changed, 782 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c diff --git a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c index c2426df58e17..3036154916a7 100644 --- a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c +++ b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c @@ -17,6 +17,8 @@ #include #include #include +#include +#include #include #include #include @@ -29,6 +31,11 @@ #include "test_tc_neigh_fib.skel.h" #include "test_tc_neigh.skel.h" #include "test_tc_peer.skel.h" +#include "test_tc_dtime.skel.h" + +#ifndef TCP_TX_DELAY +#define TCP_TX_DELAY 37 +#endif #define NS_SRC "ns_src" #define NS_FWD "ns_fwd" @@ -61,6 +68,7 @@ #define CHK_PROG_PIN_FILE "/sys/fs/bpf/test_tc_chk" #define TIMEOUT_MILLIS 10000 +#define NSEC_PER_SEC 1000000000ULL #define log_err(MSG, ...) \ fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \ @@ -440,6 +448,431 @@ static int set_forwarding(bool enable) return 0; } +static void rcv_tstamp(int fd, const char *expected, size_t s) +{ + struct __kernel_timespec pkt_ts = {}; + char ctl[CMSG_SPACE(sizeof(pkt_ts))]; + struct timespec now_ts; + struct msghdr msg = {}; + __u64 now_ns, pkt_ns; + struct cmsghdr *cmsg; + struct iovec iov; + char data[32]; + int ret; + + iov.iov_base = data; + iov.iov_len = sizeof(data); + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = &ctl; + msg.msg_controllen = sizeof(ctl); + + ret = recvmsg(fd, &msg, 0); + if (!ASSERT_EQ(ret, s, "recvmsg")) + return; + ASSERT_STRNEQ(data, expected, s, "expected rcv data"); + + cmsg = CMSG_FIRSTHDR(&msg); + if (cmsg && cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SO_TIMESTAMPNS_NEW) + memcpy(&pkt_ts, CMSG_DATA(cmsg), sizeof(pkt_ts)); + + pkt_ns = pkt_ts.tv_sec * NSEC_PER_SEC + pkt_ts.tv_nsec; + ASSERT_NEQ(pkt_ns, 0, "pkt rcv tstamp"); + + ret = clock_gettime(CLOCK_REALTIME, &now_ts); + ASSERT_OK(ret, "clock_gettime"); + now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec; + + if (ASSERT_GE(now_ns, pkt_ns, "check rcv tstamp")) + ASSERT_LT(now_ns - pkt_ns, 5 * NSEC_PER_SEC, + "check rcv tstamp"); +} + +static void snd_tstamp(int fd, char *b, size_t s) +{ + struct sock_txtime opt = { .clockid = CLOCK_TAI }; + char ctl[CMSG_SPACE(sizeof(__u64))]; + struct timespec now_ts; + struct msghdr msg = {}; + struct cmsghdr *cmsg; + struct iovec iov; + __u64 now_ns; + int ret; + + ret = clock_gettime(CLOCK_TAI, &now_ts); + ASSERT_OK(ret, "clock_get_time(CLOCK_TAI)"); + now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec; + + iov.iov_base = b; + iov.iov_len = s; + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = &ctl; + msg.msg_controllen = sizeof(ctl); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_TXTIME; + cmsg->cmsg_len = CMSG_LEN(sizeof(now_ns)); + *(__u64 *)CMSG_DATA(cmsg) = now_ns; + + ret = setsockopt(fd, SOL_SOCKET, SO_TXTIME, &opt, sizeof(opt)); + ASSERT_OK(ret, "setsockopt(SO_TXTIME)"); + + ret = sendmsg(fd, &msg, 0); + ASSERT_EQ(ret, s, "sendmsg"); +} + +static void test_inet_dtime(int family, int type, const char *addr, __u16 port) +{ + int opt = 1, accept_fd = -1, client_fd = -1, listen_fd, err; + char buf[] = "testing testing"; + struct nstoken *nstoken; + + nstoken = open_netns(NS_DST); + if (!ASSERT_OK_PTR(nstoken, "setns dst")) + return; + listen_fd = start_server(family, type, addr, port, 0); + close_netns(nstoken); + + if (!ASSERT_GE(listen_fd, 0, "listen")) + return; + + /* Ensure the kernel puts the (rcv) timestamp for all skb */ + err = setsockopt(listen_fd, SOL_SOCKET, SO_TIMESTAMPNS_NEW, + &opt, sizeof(opt)); + if (!ASSERT_OK(err, "setsockopt(SO_TIMESTAMPNS_NEW)")) + goto done; + + if (type == SOCK_STREAM) { + /* Ensure the kernel set EDT when sending out rst/ack + * from the kernel's ctl_sk. + */ + err = setsockopt(listen_fd, SOL_TCP, TCP_TX_DELAY, &opt, + sizeof(opt)); + if (!ASSERT_OK(err, "setsockopt(TCP_TX_DELAY)")) + goto done; + } + + nstoken = open_netns(NS_SRC); + if (!ASSERT_OK_PTR(nstoken, "setns src")) + goto done; + client_fd = connect_to_fd(listen_fd, TIMEOUT_MILLIS); + close_netns(nstoken); + + if (!ASSERT_GE(client_fd, 0, "connect_to_fd")) + goto done; + + if (type == SOCK_STREAM) { + int n; + + accept_fd = accept(listen_fd, NULL, NULL); + if (!ASSERT_GE(accept_fd, 0, "accept")) + goto done; + + n = write(client_fd, buf, sizeof(buf)); + if (!ASSERT_EQ(n, sizeof(buf), "send to server")) + goto done; + rcv_tstamp(accept_fd, buf, sizeof(buf)); + } else { + snd_tstamp(client_fd, buf, sizeof(buf)); + rcv_tstamp(listen_fd, buf, sizeof(buf)); + } + +done: + close(listen_fd); + if (accept_fd != -1) + close(accept_fd); + if (client_fd != -1) + close(client_fd); +} + +static int netns_load_dtime_bpf(struct test_tc_dtime *skel) +{ + struct nstoken *nstoken; + +#define PIN_FNAME(__file) "/sys/fs/bpf/" #__file +#define PIN(__prog) ({ \ + int err = bpf_program__pin(skel->progs.__prog, PIN_FNAME(__prog)); \ + if (!ASSERT_OK(err, "pin " #__prog)) \ + goto fail; \ + }) + + /* setup ns_src tc progs */ + nstoken = open_netns(NS_SRC); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_SRC)) + return -1; + PIN(egress_host); + PIN(ingress_host); + SYS("tc qdisc add dev veth_src clsact"); + SYS("tc filter add dev veth_src ingress bpf da object-pinned " + PIN_FNAME(ingress_host)); + SYS("tc filter add dev veth_src egress bpf da object-pinned " + PIN_FNAME(egress_host)); + close_netns(nstoken); + + /* setup ns_dst tc progs */ + nstoken = open_netns(NS_DST); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_DST)) + return -1; + PIN(egress_host); + PIN(ingress_host); + SYS("tc qdisc add dev veth_dst clsact"); + SYS("tc filter add dev veth_dst ingress bpf da object-pinned " + PIN_FNAME(ingress_host)); + SYS("tc filter add dev veth_dst egress bpf da object-pinned " + PIN_FNAME(egress_host)); + close_netns(nstoken); + + /* setup ns_fwd tc progs */ + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_FWD)) + return -1; + PIN(ingress_fwdns_prio100); + PIN(egress_fwdns_prio100); + PIN(ingress_fwdns_prio101); + PIN(egress_fwdns_prio101); + SYS("tc qdisc add dev veth_dst_fwd clsact"); + SYS("tc filter add dev veth_dst_fwd ingress prio 100 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio100)); + SYS("tc filter add dev veth_dst_fwd ingress prio 101 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio101)); + SYS("tc filter add dev veth_dst_fwd egress prio 100 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio100)); + SYS("tc filter add dev veth_dst_fwd egress prio 101 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio101)); + SYS("tc qdisc add dev veth_src_fwd clsact"); + SYS("tc filter add dev veth_src_fwd ingress prio 100 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio100)); + SYS("tc filter add dev veth_src_fwd ingress prio 101 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio101)); + SYS("tc filter add dev veth_src_fwd egress prio 100 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio100)); + SYS("tc filter add dev veth_src_fwd egress prio 101 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio101)); + close_netns(nstoken); + +#undef PIN + + return 0; + +fail: + close_netns(nstoken); + return -1; +} + +enum { + INGRESS_FWDNS_P100, + INGRESS_FWDNS_P101, + EGRESS_FWDNS_P100, + EGRESS_FWDNS_P101, + INGRESS_ENDHOST, + EGRESS_ENDHOST, + SET_DTIME, + __MAX_CNT, +}; + +const char *cnt_names[] = { + "ingress_fwdns_p100", + "ingress_fwdns_p101", + "egress_fwdns_p100", + "egress_fwdns_p101", + "ingress_endhost", + "egress_endhost", + "set_dtime", +}; + +enum { + TCP_IP6_CLEAR_DTIME, + TCP_IP4, + TCP_IP6, + UDP_IP4, + UDP_IP6, + TCP_IP4_RT_FWD, + TCP_IP6_RT_FWD, + UDP_IP4_RT_FWD, + UDP_IP6_RT_FWD, + UKN_TEST, + __NR_TESTS, +}; + +const char *test_names[] = { + "tcp ip6 clear dtime", + "tcp ip4", + "tcp ip6", + "udp ip4", + "udp ip6", + "tcp ip4 rt fwd", + "tcp ip6 rt fwd", + "udp ip4 rt fwd", + "udp ip6 rt fwd", +}; + +static const char *dtime_cnt_str(int test, int cnt) +{ + static char name[64]; + + snprintf(name, sizeof(name), "%s %s", test_names[test], cnt_names[cnt]); + + return name; +} + +static const char *dtime_err_str(int test, int cnt) +{ + static char name[64]; + + snprintf(name, sizeof(name), "%s %s errs", test_names[test], + cnt_names[cnt]); + + return name; +} + +static void test_tcp_clear_dtime(struct test_tc_dtime *skel) +{ + int i, t = TCP_IP6_CLEAR_DTIME; + __u32 *dtimes = skel->bss->dtimes[t]; + __u32 *errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(AF_INET6, SOCK_STREAM, IP6_DST, 0); + + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P101)); + ASSERT_GT(dtimes[EGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, EGRESS_FWDNS_P100)); + ASSERT_EQ(dtimes[EGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, EGRESS_FWDNS_P101)); + ASSERT_GT(dtimes[EGRESS_ENDHOST], 0, + dtime_cnt_str(t, EGRESS_ENDHOST)); + ASSERT_GT(dtimes[INGRESS_ENDHOST], 0, + dtime_cnt_str(t, INGRESS_ENDHOST)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_tcp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd) +{ + __u32 *dtimes, *errs; + const char *addr; + int i, t; + + if (family == AF_INET) { + t = bpf_fwd ? TCP_IP4 : TCP_IP4_RT_FWD; + addr = IP4_DST; + } else { + t = bpf_fwd ? TCP_IP6 : TCP_IP6_RT_FWD; + addr = IP6_DST; + } + + dtimes = skel->bss->dtimes[t]; + errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(family, SOCK_STREAM, addr, 0); + + /* fwdns_prio100 prog does not read delivery_time_type, so + * kernel puts the (rcv) timetamp in __sk_buff->tstamp + */ + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + for (i = INGRESS_FWDNS_P101; i < SET_DTIME; i++) + ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_udp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd) +{ + __u32 *dtimes, *errs; + const char *addr; + int i, t; + + if (family == AF_INET) { + t = bpf_fwd ? UDP_IP4 : UDP_IP4_RT_FWD; + addr = IP4_DST; + } else { + t = bpf_fwd ? UDP_IP6 : UDP_IP6_RT_FWD; + addr = IP6_DST; + } + + dtimes = skel->bss->dtimes[t]; + errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(family, SOCK_DGRAM, addr, 0); + + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + /* non mono delivery time is not forwarded */ + ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + for (i = EGRESS_FWDNS_P100; i < SET_DTIME; i++) + ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_tc_redirect_dtime(struct netns_setup_result *setup_result) +{ + struct test_tc_dtime *skel; + struct nstoken *nstoken; + int err; + + skel = test_tc_dtime__open(); + if (!ASSERT_OK_PTR(skel, "test_tc_dtime__open")) + return; + + skel->rodata->IFINDEX_SRC = setup_result->ifindex_veth_src_fwd; + skel->rodata->IFINDEX_DST = setup_result->ifindex_veth_dst_fwd; + + err = test_tc_dtime__load(skel); + if (!ASSERT_OK(err, "test_tc_dtime__load")) + goto done; + + if (netns_load_dtime_bpf(skel)) + goto done; + + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns fwd")) + goto done; + err = set_forwarding(false); + close_netns(nstoken); + if (!ASSERT_OK(err, "disable forwarding")) + goto done; + + test_tcp_clear_dtime(skel); + + test_tcp_dtime(skel, AF_INET, true); + test_tcp_dtime(skel, AF_INET6, true); + test_udp_dtime(skel, AF_INET, true); + test_udp_dtime(skel, AF_INET6, true); + + /* Test the kernel ip[6]_forward path instead + * of bpf_redirect_neigh(). + */ + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns fwd")) + goto done; + err = set_forwarding(true); + close_netns(nstoken); + if (!ASSERT_OK(err, "enable forwarding")) + goto done; + + test_tcp_dtime(skel, AF_INET, false); + test_tcp_dtime(skel, AF_INET6, false); + test_udp_dtime(skel, AF_INET, false); + test_udp_dtime(skel, AF_INET6, false); + +done: + test_tc_dtime__destroy(skel); +} + static void test_tc_redirect_neigh_fib(struct netns_setup_result *setup_result) { struct nstoken *nstoken = NULL; @@ -787,6 +1220,7 @@ static void *test_tc_redirect_run_tests(void *arg) RUN_TEST(tc_redirect_peer_l3); RUN_TEST(tc_redirect_neigh); RUN_TEST(tc_redirect_neigh_fib); + RUN_TEST(tc_redirect_dtime); return NULL; } diff --git a/tools/testing/selftests/bpf/progs/test_tc_dtime.c b/tools/testing/selftests/bpf/progs/test_tc_dtime.c new file mode 100644 index 000000000000..b52ed4e4696b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tc_dtime.c @@ -0,0 +1,348 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* veth_src --- veth_src_fwd --- veth_det_fwd --- veth_dst + * | | + * ns_src | ns_fwd | ns_dst + * + * ns_src and ns_dst: ENDHOST namespace + * ns_fwd: Fowarding namespace + */ + +#define ctx_ptr(field) (void *)(long)(field) + +#define ip4_src __bpf_htonl(0xac100164) /* 172.16.1.100 */ +#define ip4_dst __bpf_htonl(0xac100264) /* 172.16.2.100 */ + +#define ip6_src { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } +#define ip6_dst { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } + +#define v6_equal(a, b) (a.s6_addr32[0] == b.s6_addr32[0] && \ + a.s6_addr32[1] == b.s6_addr32[1] && \ + a.s6_addr32[2] == b.s6_addr32[2] && \ + a.s6_addr32[3] == b.s6_addr32[3]) + +volatile const __u32 IFINDEX_SRC; +volatile const __u32 IFINDEX_DST; + +#define EGRESS_ENDHOST_MAGIC 0x0b9fbeef +#define INGRESS_FWDNS_MAGIC 0x1b9fbeef +#define EGRESS_FWDNS_MAGIC 0x2b9fbeef + +enum { + INGRESS_FWDNS_P100, + INGRESS_FWDNS_P101, + EGRESS_FWDNS_P100, + EGRESS_FWDNS_P101, + INGRESS_ENDHOST, + EGRESS_ENDHOST, + SET_DTIME, + __MAX_CNT, +}; + +enum { + TCP_IP6_CLEAR_DTIME, + TCP_IP4, + TCP_IP6, + UDP_IP4, + UDP_IP6, + TCP_IP4_RT_FWD, + TCP_IP6_RT_FWD, + UDP_IP4_RT_FWD, + UDP_IP6_RT_FWD, + UKN_TEST, + __NR_TESTS, +}; + +enum { + SRC_NS = 1, + DST_NS, +}; + +__u32 dtimes[__NR_TESTS][__MAX_CNT] = {}; +__u32 errs[__NR_TESTS][__MAX_CNT] = {}; +__u32 test = 0; + +static void inc_dtimes(__u32 idx) +{ + if (test < __NR_TESTS) + dtimes[test][idx]++; + else + dtimes[UKN_TEST][idx]++; +} + +static void inc_errs(__u32 idx) +{ + if (test < __NR_TESTS) + errs[test][idx]++; + else + errs[UKN_TEST][idx]++; +} + +static int skb_proto(int type) +{ + return type & 0xff; +} + +static int skb_ns(int type) +{ + return (type >> 8) & 0xff; +} + +static bool fwdns_clear_dtime(void) +{ + return test == TCP_IP6_CLEAR_DTIME; +} + +static bool bpf_fwd(void) +{ + return test < TCP_IP4_RT_FWD; +} + +/* -1: parse error: TC_ACT_SHOT + * 0: not testing traffic: TC_ACT_OK + * >0: first byte is the inet_proto, second byte has the netns + * of the sender + */ +static int skb_get_type(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + __u8 inet_proto = 0, ns = 0; + struct ipv6hdr *ip6h; + struct iphdr *iph; + + switch (skb->protocol) { + case __bpf_htons(ETH_P_IP): + iph = data + sizeof(struct ethhdr); + if (iph + 1 > data_end) + return -1; + if (iph->saddr == ip4_src) + ns = SRC_NS; + else if (iph->saddr == ip4_dst) + ns = DST_NS; + inet_proto = iph->protocol; + break; + case __bpf_htons(ETH_P_IPV6): + ip6h = data + sizeof(struct ethhdr); + if (ip6h + 1 > data_end) + return -1; + if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_src)) + ns = SRC_NS; + else if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_dst)) + ns = DST_NS; + inet_proto = ip6h->nexthdr; + break; + default: + return 0; + } + + if ((inet_proto != IPPROTO_TCP && inet_proto != IPPROTO_UDP) || !ns) + return 0; + + return (ns << 8 | inet_proto); +} + +/* format: direction@iface@netns + * egress@veth_(src|dst)@ns_(src|dst) + */ +SEC("tc") +int egress_host(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + if (skb_proto(skb_type) == IPPROTO_TCP) { + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO && + skb->tstamp) + inc_dtimes(EGRESS_ENDHOST); + else + inc_errs(EGRESS_ENDHOST); + } else { + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_UNSPEC && + skb->tstamp) + inc_dtimes(EGRESS_ENDHOST); + else + inc_errs(EGRESS_ENDHOST); + } + + skb->tstamp = EGRESS_ENDHOST_MAGIC; + + return TC_ACT_OK; +} + +/* ingress@veth_(src|dst)@ns_(src|dst) */ +SEC("tc") +int ingress_host(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO && + skb->tstamp == EGRESS_FWDNS_MAGIC) + inc_dtimes(INGRESS_ENDHOST); + else + inc_errs(INGRESS_ENDHOST); + + return TC_ACT_OK; +} + +/* ingress@veth_(src|dst)_fwd@ns_fwd priority 100 */ +SEC("tc") +int ingress_fwdns_prio100(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + /* delivery_time is only available to the ingress + * if the tc-bpf checks the skb->delivery_time_type. + */ + if (skb->tstamp == EGRESS_ENDHOST_MAGIC) + inc_errs(INGRESS_FWDNS_P100); + + if (fwdns_clear_dtime()) + skb->tstamp = 0; + + return TC_ACT_UNSPEC; +} + +/* egress@veth_(src|dst)_fwd@ns_fwd priority 100 */ +SEC("tc") +int egress_fwdns_prio100(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + /* delivery_time is always available to egress even + * the tc-bpf did not use the delivery_time_type. + */ + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO && + skb->tstamp == INGRESS_FWDNS_MAGIC) + inc_dtimes(EGRESS_FWDNS_P100); + else + inc_errs(EGRESS_FWDNS_P100); + + if (fwdns_clear_dtime()) + skb->tstamp = 0; + + return TC_ACT_UNSPEC; +} + +/* ingress@veth_(src|dst)_fwd@ns_fwd priority 101 */ +SEC("tc") +int ingress_fwdns_prio101(struct __sk_buff *skb) +{ + __u64 expected_dtime = EGRESS_ENDHOST_MAGIC; + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1 || !skb_type) + /* Should have handled in prio100 */ + return TC_ACT_SHOT; + + if (skb_proto(skb_type) == IPPROTO_UDP) + expected_dtime = 0; + + if (skb->delivery_time_type) { + if (fwdns_clear_dtime() || + skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO || + skb->tstamp != expected_dtime) + inc_errs(INGRESS_FWDNS_P101); + else + inc_dtimes(INGRESS_FWDNS_P101); + } else { + if (!fwdns_clear_dtime() && expected_dtime) + inc_errs(INGRESS_FWDNS_P101); + } + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) { + skb->tstamp = INGRESS_FWDNS_MAGIC; + } else { + if (bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_MONO)) + inc_errs(SET_DTIME); + if (!bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_UNSPEC)) + inc_errs(SET_DTIME); + } + + if (skb_ns(skb_type) == SRC_NS) + return bpf_fwd() ? + bpf_redirect_neigh(IFINDEX_DST, NULL, 0, 0) : TC_ACT_OK; + else + return bpf_fwd() ? + bpf_redirect_neigh(IFINDEX_SRC, NULL, 0, 0) : TC_ACT_OK; +} + +/* egress@veth_(src|dst)_fwd@ns_fwd priority 101 */ +SEC("tc") +int egress_fwdns_prio101(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1 || !skb_type) + /* Should have handled in prio100 */ + return TC_ACT_SHOT; + + if (skb->delivery_time_type) { + if (fwdns_clear_dtime() || + skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO || + skb->tstamp != INGRESS_FWDNS_MAGIC) + inc_errs(EGRESS_FWDNS_P101); + else + inc_dtimes(EGRESS_FWDNS_P101); + } else { + if (!fwdns_clear_dtime()) + inc_errs(EGRESS_FWDNS_P101); + } + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) { + skb->tstamp = EGRESS_FWDNS_MAGIC; + } else { + if (bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_MONO)) + inc_errs(SET_DTIME); + if (!bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_UNSPEC)) + inc_errs(SET_DTIME); + } + + return TC_ACT_OK; +} + +char __license[] SEC("license") = "GPL";