From patchwork Tue Mar 1 05:36:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764118 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 033F6C433EF for ; Tue, 1 Mar 2022 05:36:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232155AbiCAFh3 (ORCPT ); Tue, 1 Mar 2022 00:37:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231182AbiCAFh3 (ORCPT ); Tue, 1 Mar 2022 00:37:29 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98DF970F61 for ; Mon, 28 Feb 2022 21:36:48 -0800 (PST) Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwXdC028621 for ; Mon, 28 Feb 2022 21:36:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=L2oUmkZyBi2ZIgFPWJ6KYV2/n/blzk5PDSx/W/f/Qno=; b=lLUgQUHx9HHSoX4QIewtFhcqALyPoZwhREbXBa9zyYIXHxtNI+G1/AE8aLUo9iMzm5BM 40HRK0sF9ME/J8lwM4wpSq8a2fmex49Oo2Eo6nuoaaqacOry6VIGgKbDKspE4MJv5djt fjLdkj+hW4dKEmNgzoenLihaQeJHXuxg018= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egm760enu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:36:48 -0800 Received: from twshared33837.14.frc2.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:36:47 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id D8B507A8955F; Mon, 28 Feb 2022 21:36:37 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Date: Mon, 28 Feb 2022 21:36:37 -0800 Message-ID: <20220301053637.930759-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: nD3lWKp_4-XnFpsLROM6caNmY-ZH0ts2 X-Proofpoint-GUID: nD3lWKp_4-XnFpsLROM6caNmY-ZH0ts2 X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 clxscore=1015 malwarescore=0 mlxscore=0 suspectscore=0 mlxlogscore=999 spamscore=0 bulkscore=0 impostorscore=0 priorityscore=1501 phishscore=0 lowpriorityscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org skb->tstamp was first used as the (rcv) timestamp. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. A latter patch will also allow tc-bpf@ingress to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patches will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. A note on the change in ip_send_unicast_reply() in ip_output.c. It is only used by TCP to send reset/ack out of a ctl_sk. Like the new skb_set_delivery_time(), this patch sets the skb->mono_delivery_time to 0 for now as a place holder. It will be enabled in a latter patch. A similar case in tcp_ipv6 can be done with skb_set_delivery_time() in tcp_v6_send_response(). [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 13 +++++++++++++ net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- net/ipv4/ip_output.c | 7 +++++-- net/ipv4/tcp_output.c | 16 +++++++++------- net/ipv6/ip6_output.c | 5 +++-- net/ipv6/netfilter.c | 5 +++-- net/ipv6/tcp_ipv6.c | 2 +- 7 files changed, 37 insertions(+), 16 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d67941f78b92..803ffa63dea6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -795,6 +795,10 @@ typedef unsigned char *sk_buff_data_t; * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required + * @mono_delivery_time: When set, skb->tstamp has the + * delivery_time in mono clock base (i.e. EDT). Otherwise, the + * skb->tstamp has the (rcv) timestamp at ingress and + * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -965,6 +969,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -3983,6 +3988,14 @@ static inline ktime_t net_timedelta(ktime_t t) return ktime_sub(ktime_get_real(), t); } +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, + bool mono) +{ + skb->tstamp = kt; + /* Setting mono_delivery_time will be enabled later */ + skb->mono_delivery_time = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index fdbed3158555..ebfb2a5c59e4 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; unsigned int hlen, ll_rs, mtu; ktime_t tstamp = skb->tstamp; struct ip_frag_state state; @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, if (iter.frag) ip_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 6df3545c891d..a9588e0c82c5 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -761,6 +761,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, { struct iphdr *iph; struct sk_buff *skb2; + bool mono_delivery_time = skb->mono_delivery_time; struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; @@ -852,7 +853,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, } } - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) @@ -908,7 +909,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, skb2); if (err) goto fail; @@ -1727,6 +1728,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; + /* Setting mono_delivery_time will be enabled later */ + nskb->mono_delivery_time = 0; ip_push_pending_frames(sk, &fl4); } out: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e76bf1e9251e..2319531267c6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); if (clone_it) { oskb = skb; @@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, skb_split(skb, buff, len); - buff->tstamp = skb->tstamp; + skb_set_delivery_time(buff, skb->tstamp, true); tcp_fragment_tstamp(skb, buff); old_factor = tcp_skb_pcount(skb); @@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp_ns" is used as a start point for the retransmit timer */ - skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache; + tp->tcp_wstamp_ns = tp->tcp_clock_cache; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); goto repair; /* Skip network transmission */ @@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, now = tcp_clock_ns(); #ifdef CONFIG_SYN_COOKIES if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok)) - skb->skb_mstamp_ns = cookie_init_timestamp(req, now); + skb_set_delivery_time(skb, cookie_init_timestamp(req, now), + true); else #endif { - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */ tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb); } @@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, synack_type, &opts); - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); tcp_add_tx_delay(skb, tp); return skb; @@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation); - syn->skb_mstamp_ns = syn_data->skb_mstamp_ns; + skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true); /* Now full SYN+DATA was cloned and sent (or not), * remove the SYN from the original skb (syn_data) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index c5edc86b18bd..dad4e3d0492e 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; + bool mono_delivery_time = skb->mono_delivery_time; struct ip6_frag_state state; unsigned int mtu, hlen, nexthdr_offset; ktime_t tstamp = skb->tstamp; @@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), @@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - frag->tstamp = tstamp; + skb_set_delivery_time(frag, tstamp, mono_delivery_time); err = output(net, sk, frag); if (err) goto fail; diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6ab710b5a1a8..1da332450d98 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; ktime_t tstamp = skb->tstamp; struct ip6_frag_state state; u8 *prevhdr, nexthdr = 0; @@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index e98af869ff3a..cb2bb7d2e907 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -940,7 +940,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 } else { mark = sk->sk_mark; } - buff->tstamp = tcp_transmit_time(sk); + skb_set_delivery_time(buff, tcp_transmit_time(sk), true); } fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark; fl6.fl6_dport = t1->dest; From patchwork Tue Mar 1 05:36:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764119 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BF1BC433FE for ; Tue, 1 Mar 2022 05:37:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232387AbiCAFhl (ORCPT ); Tue, 1 Mar 2022 00:37:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232370AbiCAFhk (ORCPT ); Tue, 1 Mar 2022 00:37:40 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 847C9710C7 for ; Mon, 28 Feb 2022 21:36:59 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwSoP022915 for ; Mon, 28 Feb 2022 21:36:58 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=ao61stmVMH+WewOCZleQk5lbCn+S0Uua1Ez+aBAwE8c=; b=o5eEgLBgLRIYNkVsbWqj0xp/ZpgDAiLBoS7g0sqer/Ku0xG6Q+CqY4+TQdR9EUD2B46o FvwBcvwLkkX1iPyw8mZ8g2zkgM5OW91zq3cHfXjtYPUHXVO+AkCx8C6wF5FPg5nzeKjL XYWuh1bg7HZ2xFFm6Vh7tC5zwRtE1SE3KUg= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7a5yd0-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:36:58 -0800 Received: from twshared1433.06.ash8.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:36:56 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 2625D7A8956C; Mon, 28 Feb 2022 21:36:44 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time Date: Mon, 28 Feb 2022 21:36:44 -0800 Message-ID: <20220301053644.931008-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: ksfoB_IzC-s-HOckkVKndb5gBKAGtFVB X-Proofpoint-ORIG-GUID: ksfoB_IzC-s-HOckkVKndb5gBKAGtFVB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxlogscore=999 suspectscore=0 bulkscore=0 spamscore=0 mlxscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org Right now, skb->tstamp is reset to 0 whenever the skb is forwarded. If skb->tstamp has the mono delivery_time, clearing it can hurt the performance when it finally transmits out to fq@phy-dev. The earlier patch added a skb->mono_delivery_time bit to flag the skb->tstamp carrying the mono delivery_time. This patch adds skb_clear_tstamp() helper which keeps the mono delivery_time and clears everything else. The delivery_time clearing will be postponed until the stack knows the skb will be delivered locally. It will be done in a latter patch. Signed-off-by: Martin KaFai Lau --- drivers/net/loopback.c | 2 +- include/linux/skbuff.h | 10 +++++++++- net/bridge/br_forward.c | 2 +- net/core/filter.c | 6 +++--- net/core/skbuff.c | 2 +- net/ipv4/ip_forward.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 6 +++--- net/netfilter/nf_dup_netdev.c | 2 +- net/netfilter/nf_flow_table_ip.c | 4 ++-- net/netfilter/nft_fwd_netdev.c | 2 +- net/openvswitch/vport.c | 2 +- net/xfrm/xfrm_interface.c | 2 +- 13 files changed, 26 insertions(+), 18 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index d05f86fe78c9..720394c0639b 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb, skb_tx_timestamp(skb); /* do not fool net_timestamp_check() with various clock bases */ - skb->tstamp = 0; + skb_clear_tstamp(skb); skb_orphan(skb); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 803ffa63dea6..27a28920e7b3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3996,6 +3996,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, skb->mono_delivery_time = 0; } +static inline void skb_clear_tstamp(struct sk_buff *skb) +{ + if (skb->mono_delivery_time) + return; + + skb->tstamp = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; @@ -4852,7 +4860,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress) #ifdef CONFIG_NET_REDIRECT skb->from_ingress = from_ingress; if (skb->from_ingress) - skb->tstamp = 0; + skb_clear_tstamp(skb); #endif } diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index ec646656dbf1..02bb620d3b8d 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit); int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { - skb->tstamp = 0; + skb_clear_tstamp(skb); return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING, net, sk, skb, NULL, skb->dev, br_dev_queue_push_xmit); diff --git a/net/core/filter.c b/net/core/filter.c index 65869fd510e8..cfcf9b4d1ec2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_xmit_recursion_inc(); ret = dev_queue_xmit(skb); @@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); @@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b32c5d782fe1..9abb0028309f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5381,7 +5381,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) ipvs_reset(skb); skb->mark = 0; - skb->tstamp = 0; + skb_clear_tstamp(skb); } EXPORT_SYMBOL_GPL(skb_scrub_packet); diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 00ec819f949b..92ba3350274b 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s if (unlikely(opt->optlen)) ip_forward_options(skb); - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index dad4e3d0492e..50db9b20d746 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk, } #endif - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index d2e5a8f644b8..029171379884 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, nf_reset_ct(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); } return ret; } @@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, if (!local) { skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else @@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c index a579e59ee5c5..7873bd1389c3 100644 --- a/net/netfilter/nf_dup_netdev.c +++ b/net/netfilter/nf_dup_netdev.c @@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev) skb_push(skb, skb->mac_len); skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_queue_xmit(skb); } diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 889cf88d3dba..f1d387129f02 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ip(flow, skb, thoff, dir, iph); ip_decrease_ttl(iph); - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); @@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ipv6(flow, skb, dir, ip6h); ip6h->hop_limit--; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c index 619e394a91de..08e7a289738e 100644 --- a/net/netfilter/nft_fwd_netdev.c +++ b/net/netfilter/nft_fwd_netdev.c @@ -145,7 +145,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr, return; skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); neigh_xmit(neigh_table, dev, addr, skb); out: regs->verdict.code = verdict; diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index cf2ce5812489..82a74f998966 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) } skb->dev = vport->dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); vport->ops->send(skb); return; diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 57448fc519fc..4991e99ced9a 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev) static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet) { - skb->tstamp = 0; + skb_clear_tstamp(skb); skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb->ignore_df = 0; From patchwork Tue Mar 1 05:36:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764120 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC0A0C433FE for ; Tue, 1 Mar 2022 05:37:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231790AbiCAFhr (ORCPT ); Tue, 1 Mar 2022 00:37:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33800 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232429AbiCAFhn (ORCPT ); Tue, 1 Mar 2022 00:37:43 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69F60710D5 for ; Mon, 28 Feb 2022 21:37:03 -0800 (PST) Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwemQ028278 for ; Mon, 28 Feb 2022 21:37:02 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=WRkKvD9Ln1BF31cy79ItEby3Tja1SmNRAJaXQh89H7E=; b=REf8KIvhY340BIHORZfNEu1s/BIUfNkeXatHyU0rT0chVTiNjK+V8hm6FY62hQExKCER iYn/3hH2ILqEeOeRbGazIT9JgEBsHwb8NnSEngDkaQ456YzbNyr75VqgKpW+QJ0A0pAO PHfngG7jxddXsLhiks43HQ6Dhqxf85uJ3Dc= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egxd5dwmn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:02 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:01 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 68BFF7A8957D; Mon, 28 Feb 2022 21:36:50 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet Date: Mon, 28 Feb 2022 21:36:50 -0800 Message-ID: <20220301053650.931238-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: F8T9jAGtaWwroYcCnxJIVPxXAROLysEh X-Proofpoint-ORIG-GUID: F8T9jAGtaWwroYcCnxJIVPxXAROLysEh X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 mlxscore=0 adultscore=0 lowpriorityscore=0 impostorscore=0 mlxlogscore=970 malwarescore=0 suspectscore=0 bulkscore=0 spamscore=0 priorityscore=1501 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp. skb_clear_tstamp() will then keep this delivery_time during forwarding. This patch is to make the network tapping (with af_packet) to handle the delivery_time stored in skb->tstamp. Regardless of tapping at the ingress or egress, the tapped skb is received by the af_packet socket, so it is ingress to the af_packet socket and it expects the (rcv) timestamp. When tapping at egress, dev_queue_xmit_nit() is used. It has already expected skb->tstamp may have delivery_time, so it does skb_clone()+net_timestamp_set() to ensure the cloned skb has the (rcv) timestamp before passing to the af_packet sk. This patch only adds to clear the skb->mono_delivery_time bit in net_timestamp_set(). When tapping at ingress, it currently expects the skb->tstamp is either 0 or the (rcv) timestamp. Meaning, the tapping at ingress path has already expected the skb->tstamp could be 0 and it will get the (rcv) timestamp by ktime_get_real() when needed. There are two cases for tapping at ingress: One case is af_packet queues the skb to its sk_receive_queue. The skb is either not shared or new clone created. The newly added skb_clear_delivery_time() is called to clear the delivery_time (if any) and set the (rcv) timestamp if needed before the skb is queued to the sk_receive_queue. Another case, the ingress skb is directly copied to the rx_ring and tpacket_get_timestamp() is used to get the (rcv) timestamp. The newly added skb_tstamp() is used in tpacket_get_timestamp() to check the skb->mono_delivery_time bit before returning skb->tstamp. As mentioned earlier, the tapping@ingress has already expected the skb may not have the (rcv) timestamp (because no sk has asked for it) and has handled this case by directly calling ktime_get_real(). Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 24 ++++++++++++++++++++++++ net/core/dev.c | 4 +++- net/packet/af_packet.c | 4 +++- 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 27a28920e7b3..7e2d796ece80 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3996,6 +3996,22 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, skb->mono_delivery_time = 0; } +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key); + +/* It is used in the ingress path to clear the delivery_time. + * If needed, set the skb->tstamp to the (rcv) timestamp. + */ +static inline void skb_clear_delivery_time(struct sk_buff *skb) +{ + if (skb->mono_delivery_time) { + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = ktime_get_real(); + else + skb->tstamp = 0; + } +} + static inline void skb_clear_tstamp(struct sk_buff *skb) { if (skb->mono_delivery_time) @@ -4004,6 +4020,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb) skb->tstamp = 0; } +static inline ktime_t skb_tstamp(const struct sk_buff *skb) +{ + if (skb->mono_delivery_time) + return 0; + + return skb->tstamp; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/core/dev.c b/net/core/dev.c index 2d6771075720..6d81b5a7ef3f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2020,7 +2020,8 @@ void net_dec_egress_queue(void) EXPORT_SYMBOL_GPL(net_dec_egress_queue); #endif -static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +EXPORT_SYMBOL(netstamp_needed_key); #ifdef CONFIG_JUMP_LABEL static atomic_t netstamp_needed_deferred; static atomic_t netstamp_wanted; @@ -2081,6 +2082,7 @@ EXPORT_SYMBOL(net_disable_timestamp); static inline void net_timestamp_set(struct sk_buff *skb) { skb->tstamp = 0; + skb->mono_delivery_time = 0; if (static_branch_unlikely(&netstamp_needed_key)) __net_timestamp(skb); } diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index ab87f22cc7ec..1b93ce1a5600 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts, return TP_STATUS_TS_RAW_HARDWARE; if ((flags & SOF_TIMESTAMPING_SOFTWARE) && - ktime_to_timespec64_cond(skb->tstamp, ts)) + ktime_to_timespec64_cond(skb_tstamp(skb), ts)) return TP_STATUS_TS_SOFTWARE; return 0; @@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, spin_lock(&sk->sk_receive_queue.lock); po->stats.stats1.tp_packets++; sock_skb_set_dropcount(sk, skb); + skb_clear_delivery_time(skb); __skb_queue_tail(&sk->sk_receive_queue, skb); spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk); @@ -2377,6 +2378,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, po->stats.stats1.tp_packets++; if (copy_skb) { status |= TP_STATUS_COPY; + skb_clear_delivery_time(copy_skb); __skb_queue_tail(&sk->sk_receive_queue, copy_skb); } spin_unlock(&sk->sk_receive_queue.lock); From patchwork Tue Mar 1 05:36:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764121 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B0D8C4332F for ; Tue, 1 Mar 2022 05:37:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232429AbiCAFhs (ORCPT ); Tue, 1 Mar 2022 00:37:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232434AbiCAFhp (ORCPT ); Tue, 1 Mar 2022 00:37:45 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A6A874856 for ; Mon, 28 Feb 2022 21:37:05 -0800 (PST) Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwb2l010594 for ; Mon, 28 Feb 2022 21:37:05 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=Xpom6SUyvE63iMeneZRPmXaSRtLBrkFiIEIQNxzxRS0=; b=NEQxRy0ZrOsFaKXvhj3m5PvQ3QpFCoCVmpnpgNYCxM3SEmjG4xQL7RKVtAV30+V7Hldu q39oN4ZeXQhNF6O88tvmuJ3bhRmfz4KghXsmlJ1+f38eWkFv0yCxgxFlJUzTcdU2GdCk xvaJReO1xt0ixfHx/KoWL0JwSg5tMQhnD+w= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3eh681ahkn-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:05 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:01 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id B19247A89584; Mon, 28 Feb 2022 21:36:56 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx() Date: Mon, 28 Feb 2022 21:36:56 -0800 Message-ID: <20220301053656.931606-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: O4b5UR7scX1tbhSyVS0layxxv00VpbWb X-Proofpoint-ORIG-GUID: O4b5UR7scX1tbhSyVS0layxxv00VpbWb X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 impostorscore=0 lowpriorityscore=0 spamscore=0 bulkscore=0 suspectscore=0 mlxlogscore=967 malwarescore=0 clxscore=1015 priorityscore=1501 adultscore=0 phishscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to the sk_error_queue. The outgoing skb may have the mono delivery_time while the (rcv) timestamp is expected for the clone, so the skb->mono_delivery_time bit needs to be cleared from the clone. This patch adds the skb->mono_delivery_time clearing to the existing __net_timestamp() and use it in __skb_tstamp_tx(). The __net_timestamp() fast path usage in dev.c is changed to directly call ktime_get_real() since the mono_delivery_time bit is not set at that point. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 1 + net/core/dev.c | 4 ++-- net/core/skbuff.c | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 7e2d796ece80..8e8a4af4f9e2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3981,6 +3981,7 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb, static inline void __net_timestamp(struct sk_buff *skb) { skb->tstamp = ktime_get_real(); + skb->mono_delivery_time = 0; } static inline ktime_t net_timedelta(ktime_t t) diff --git a/net/core/dev.c b/net/core/dev.c index 6d81b5a7ef3f..3ff686cc8c84 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2084,13 +2084,13 @@ static inline void net_timestamp_set(struct sk_buff *skb) skb->tstamp = 0; skb->mono_delivery_time = 0; if (static_branch_unlikely(&netstamp_needed_key)) - __net_timestamp(skb); + skb->tstamp = ktime_get_real(); } #define net_timestamp_check(COND, SKB) \ if (static_branch_unlikely(&netstamp_needed_key)) { \ if ((COND) && !(SKB)->tstamp) \ - __net_timestamp(SKB); \ + (SKB)->tstamp = ktime_get_real(); \ } \ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9abb0028309f..e5082836295b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4851,7 +4851,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb, if (hwtstamps) *skb_hwtstamps(skb) = *hwtstamps; else - skb->tstamp = ktime_get_real(); + __net_timestamp(skb); __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats); } From patchwork Tue Mar 1 05:37:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764123 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0ADC2C433FE for ; Tue, 1 Mar 2022 05:37:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232452AbiCAFiA (ORCPT ); Tue, 1 Mar 2022 00:38:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232434AbiCAFh6 (ORCPT ); Tue, 1 Mar 2022 00:37:58 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCF8875205 for ; Mon, 28 Feb 2022 21:37:14 -0800 (PST) Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwTad014012 for ; Mon, 28 Feb 2022 21:37:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=WYhiMgtL0PnkSBux1VzTdoO6AF6lDNDcFeOTE4ROfcg=; b=pG3dnm1vesUJm2rLqHGXD4WfQOLglfTc2eAfhp7oAfR4YXHz9zMfBFeSwi65abC/eE7b ARFgqtr0ZH0TT+V2Xuklj3M3lXOG8YFDTZnpZEIqptD+E90ys9aDkITwE7JLiRLIuNsN iO7+jDLPsY4DNzV7Rr245CF7JzdX9G3AuS8= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3eh3vhugc4-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:13 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:12 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 0427A7A89590; Mon, 28 Feb 2022 21:37:03 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Date: Mon, 28 Feb 2022 21:37:02 -0800 Message-ID: <20220301053702.931758-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: -IZKVQ_LLheCUy5cJPrFDEWAzvyI0DUs X-Proofpoint-GUID: -IZKVQ_LLheCUy5cJPrFDEWAzvyI0DUs X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 priorityscore=1501 adultscore=0 spamscore=0 clxscore=1015 mlxscore=0 mlxlogscore=999 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org The previous patches handled the delivery_time before sch_handle_ingress(). This patch can now set the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp and also clear it with skb_clear_delivery_time() after sch_handle_ingress(). This will make the bpf_redirect_*() to keep the mono delivery_time and used by a qdisc (fq) of the egress-ing interface. A latter patch will postpone the skb_clear_delivery_time() until the stack learns that the skb is being delivered locally and that will make other kernel forwarding paths (ip[6]_forward) able to keep the delivery_time also. Thus, like the previous patches on using the skb->mono_delivery_time bit, calling skb_clear_delivery_time() is not limited within the CONFIG_NET_INGRESS to avoid too many code churns among this set. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 3 +-- net/core/dev.c | 8 ++++++-- net/ipv4/ip_output.c | 3 +-- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8e8a4af4f9e2..0f5fd53059cd 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3993,8 +3993,7 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, bool mono) { skb->tstamp = kt; - /* Setting mono_delivery_time will be enabled later */ - skb->mono_delivery_time = 0; + skb->mono_delivery_time = kt && mono; } DECLARE_STATIC_KEY_FALSE(netstamp_needed_key); diff --git a/net/core/dev.c b/net/core/dev.c index 3ff686cc8c84..0fc02cf32476 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5193,8 +5193,10 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto out; } - if (skb_skip_tc_classify(skb)) + if (skb_skip_tc_classify(skb)) { + skb_clear_delivery_time(skb); goto skip_classify; + } if (pfmemalloc) goto skip_taps; @@ -5223,12 +5225,14 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto another_round; if (!skb) goto out; + skb_clear_delivery_time(skb); nf_skip_egress(skb, false); if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0) goto out; - } + } else #endif + skb_clear_delivery_time(skb); skb_reset_redirect(skb); skip_classify: if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index a9588e0c82c5..00b4bf26fd93 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1728,8 +1728,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; - /* Setting mono_delivery_time will be enabled later */ - nskb->mono_delivery_time = 0; + nskb->mono_delivery_time = !!transmit_time; ip_push_pending_frames(sk, &fl4); } out: From patchwork Tue Mar 1 05:37:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764122 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABB95C4332F for ; Tue, 1 Mar 2022 05:37:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232439AbiCAFiA (ORCPT ); Tue, 1 Mar 2022 00:38:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232440AbiCAFh6 (ORCPT ); Tue, 1 Mar 2022 00:37:58 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 06A3875202 for ; Mon, 28 Feb 2022 21:37:14 -0800 (PST) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwh3B020160 for ; Mon, 28 Feb 2022 21:37:13 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=2x/nBiUS6InGQfgOunfdEXrfkF38AVl3WtUh7K8yhaQ=; b=HazPfTJcEMAY2eUXduOh8G+cxwJyhMcg2bEYNvD533qRuVJesbx3IGRDHbLpsZeIPGiG HnYrLFsuCoJLi8ELcPePoMRI37wDjdJ8+z2sybPeXBdm2QgcsGMpcqk6vAmiLhShnk1N V2I8NCBY0WNOET0z06C8NAFGqbWIqh+LvRg= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7hp00u-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:13 -0800 Received: from twshared33837.14.frc2.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:12 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 43C6F7A895B6; Mon, 28 Feb 2022 21:37:09 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 06/13] net: ip: Handle delivery_time in ip defrag Date: Mon, 28 Feb 2022 21:37:09 -0800 Message-ID: <20220301053709.933219-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-GUID: eAdD9fSOAUvhXbYUYH0fNcDMU7mumtdx X-Proofpoint-ORIG-GUID: eAdD9fSOAUvhXbYUYH0fNcDMU7mumtdx X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 suspectscore=0 bulkscore=0 phishscore=0 impostorscore=0 malwarescore=0 clxscore=1015 lowpriorityscore=0 adultscore=0 priorityscore=1501 mlxscore=0 spamscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org A latter patch will postpone the delivery_time clearing until the stack knows the skb is being delivered locally. That will allow other kernel forwarding path (e.g. ip[6]_forward) to keep the delivery_time also. An earlier attempt was to do skb_clear_delivery_time() in ip_local_deliver() and ip6_input(). The discussion [0] requested to move it one step later into ip_local_deliver_finish() and ip6_input_finish() so that the delivery_time can be kept for the ip_vs forwarding path also. To do that, this patch also needs to take care of the (rcv) timestamp usecase in ip_is_fragment(). It needs to expect delivery_time in the skb->tstamp, so it needs to save the mono_delivery_time bit in inet_frag_queue such that the delivery_time (if any) can be restored in the final defragmented skb. [Note that it will only happen when the locally generated skb is looping from egress to ingress over a virtual interface (e.g. veth, loopback...), skb->tstamp may have the delivery time before it is known that it will be delivered locally and received by another sk.] [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/ Signed-off-by: Martin KaFai Lau --- include/net/inet_frag.h | 1 + net/ipv4/inet_fragment.c | 1 + net/ipv4/ip_fragment.c | 1 + 3 files changed, 3 insertions(+) diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 63540be0fc34..8ad0c1d6d024 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -90,6 +90,7 @@ struct inet_frag_queue { ktime_t stamp; int len; int meat; + __u8 mono_delivery_time; __u8 flags; u16 max_size; struct fqdir *fqdir; diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index 341096807100..63948f6aeca0 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -572,6 +572,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, skb_mark_not_on_list(head); head->prev = NULL; head->tstamp = q->stamp; + head->mono_delivery_time = q->mono_delivery_time; } EXPORT_SYMBOL(inet_frag_reasm_finish); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index fad803d2d711..fb153569889e 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -349,6 +349,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) qp->iif = dev->ifindex; qp->q.stamp = skb->tstamp; + qp->q.mono_delivery_time = skb->mono_delivery_time; qp->q.meat += skb->len; qp->ecn |= ecn; add_frag_mem_limit(qp->q.fqdir, skb->truesize); From patchwork Tue Mar 1 05:37:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764125 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A81F0C4332F for ; Tue, 1 Mar 2022 05:37:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232459AbiCAFiL (ORCPT ); Tue, 1 Mar 2022 00:38:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34604 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232442AbiCAFiJ (ORCPT ); Tue, 1 Mar 2022 00:38:09 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 77C5624F3C for ; Mon, 28 Feb 2022 21:37:28 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwTTn023010 for ; Mon, 28 Feb 2022 21:37:27 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=M9eQVFN2mH1U2+vJTtVIVlCo0M/09mhKoFPbozTDCBs=; b=M9fhI71hXPhCoiDedRC8ePPwhaTojeuI/Di5b+ZKxC+ng3JKaZc8RwR9cFL+GCgGgIlN Ufjbb3Q9fJi8Dbv8uqbNelf4QvG2C+0E7WbMPYX8jVSMJOHe/ORdGls/Fmeh56Dezo0z 8M1DkCHaVsryzAEXb6ez27FaCTFLTlAC9Qo= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7a5yeu-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:27 -0800 Received: from twshared7500.02.ash7.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:26 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 872987A895F1; Mon, 28 Feb 2022 21:37:15 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 07/13] net: ipv6: Handle delivery_time in ipv6 defrag Date: Mon, 28 Feb 2022 21:37:15 -0800 Message-ID: <20220301053715.933829-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: zYRvd6dgwrDmdzP5Snr0QMvtKh0WIJkL X-Proofpoint-ORIG-GUID: zYRvd6dgwrDmdzP5Snr0QMvtKh0WIJkL X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxlogscore=999 suspectscore=0 bulkscore=0 spamscore=0 mlxscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org A latter patch will postpone the delivery_time clearing until the stack knows the skb is being delivered locally (i.e. calling skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4 and at ip6_input_finish() for IPv6). That will allow other kernel forwarding path (e.g. ip[6]_forward) to keep the delivery_time also. A very similar IPv6 defrag codes have been duplicated in multiple places: regular IPv6, nf_conntrack, and 6lowpan. Unlike the IPv4 defrag which is done before ip_local_deliver_finish(), the regular IPv6 defrag is done after ip6_input_finish(). Thus, no change should be needed in the regular IPv6 defrag logic because skb_clear_delivery_time() should have been called. 6lowpan also does not need special handling on delivery_time because it is a non-inet packet_type. However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs to do the IPv6 defrag earlier. Thus, it needs to save the mono_delivery_time bit in the inet_frag_queue which is similar to how it is handled in the previous patch for the IPv4 defrag. This patch chooses to do it consistently and stores the mono_delivery_time in the inet_frag_queue for all cases such that it will be easier for the future refactoring effort on the IPv6 reasm code. Signed-off-by: Martin KaFai Lau --- net/ieee802154/6lowpan/reassembly.c | 1 + net/ipv6/netfilter/nf_conntrack_reasm.c | 1 + net/ipv6/reassembly.c | 1 + 3 files changed, 3 insertions(+) diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c index be6f06adefe0..a91283d1e5bf 100644 --- a/net/ieee802154/6lowpan/reassembly.c +++ b/net/ieee802154/6lowpan/reassembly.c @@ -130,6 +130,7 @@ static int lowpan_frag_queue(struct lowpan_frag_queue *fq, goto err; fq->q.stamp = skb->tstamp; + fq->q.mono_delivery_time = skb->mono_delivery_time; if (frag_type == LOWPAN_DISPATCH_FRAG1) fq->q.flags |= INET_FRAG_FIRST_IN; diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index 5c47be29b9ee..7dd3629dd19e 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -264,6 +264,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb, fq->iif = dev->ifindex; fq->q.stamp = skb->tstamp; + fq->q.mono_delivery_time = skb->mono_delivery_time; fq->q.meat += skb->len; fq->ecn |= ecn; if (payload_len > fq->q.max_size) diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index 28e44782c94d..ff866f2a879e 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -194,6 +194,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb, fq->iif = dev->ifindex; fq->q.stamp = skb->tstamp; + fq->q.mono_delivery_time = skb->mono_delivery_time; fq->q.meat += skb->len; fq->ecn |= ecn; add_frag_mem_limit(fq->q.fqdir, skb->truesize); From patchwork Tue Mar 1 05:37:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764124 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5507C433EF for ; Tue, 1 Mar 2022 05:37:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232434AbiCAFiK (ORCPT ); Tue, 1 Mar 2022 00:38:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34550 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232427AbiCAFiJ (ORCPT ); Tue, 1 Mar 2022 00:38:09 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7CA71166 for ; Mon, 28 Feb 2022 21:37:26 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwTTk023010 for ; Mon, 28 Feb 2022 21:37:26 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=dYmfrbOh/lEFzF9JAZtCjftFPCmiiD0gqSESwxwPk54=; b=OhMlxJKJfcaNXDbqMMO89PwvSL6RnGGscElcIqZdCLXPUhble3sAvwFyHZIDrPvequJS 56F50h0E+BIe8Ztj18h48aAp8u51ZBq5ibAyhk4ZrWmtAOfIHefv4YjOPg7zd1VoQggp gb7UYBASbwtvynDAkwvp+kdZAjohzqERH+A= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7a5yes-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:25 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:25 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id D03857A8962C; Mon, 28 Feb 2022 21:37:21 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option Date: Mon, 28 Feb 2022 21:37:21 -0800 Message-ID: <20220301053721.934275-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: REsKysWp1jEuJy0Mtdih5UsdnMXIOJxq X-Proofpoint-ORIG-GUID: REsKysWp1jEuJy0Mtdih5UsdnMXIOJxq X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxlogscore=999 suspectscore=0 bulkscore=0 spamscore=0 mlxscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org IOAM is a hop-by-hop option with a temporary iana allocation (49). Since it is hop-by-hop, it is done before the input routing decision. One of the traced data field is the (rcv) timestamp. When the locally generated skb is looping from egress to ingress over a virtual interface (e.g. veth, loopback...), skb->tstamp may have the delivery time before it is known that it will be delivered locally and received by another sk. Like handling the network tapping (tcpdump) in the earlier patch, this patch gets the timestamp if needed without over-writing the delivery_time in the skb->tstamp. skb_tstamp_cond() is added to do the ktime_get_real() with an extra cond arg to check on top of the netstamp_needed_key static key. skb_tstamp_cond() will also be used in a latter patch and it needs the netstamp_needed_key check. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 11 +++++++++++ net/ipv6/ioam6.c | 19 +++++++++---------- 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0f5fd53059cd..4b5b926a81f2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4028,6 +4028,17 @@ static inline ktime_t skb_tstamp(const struct sk_buff *skb) return skb->tstamp; } +static inline ktime_t skb_tstamp_cond(const struct sk_buff *skb, bool cond) +{ + if (!skb->mono_delivery_time && skb->tstamp) + return skb->tstamp; + + if (static_branch_unlikely(&netstamp_needed_key) || cond) + return ktime_get_real(); + + return 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c index e159eb4328a8..1098131ed90c 100644 --- a/net/ipv6/ioam6.c +++ b/net/ipv6/ioam6.c @@ -635,7 +635,8 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb, struct ioam6_schema *sc, u8 sclen, bool is_input) { - struct __kernel_sock_timeval ts; + struct timespec64 ts; + ktime_t tstamp; u64 raw64; u32 raw32; u16 raw16; @@ -680,10 +681,9 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb, if (!skb->dev) { *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE); } else { - if (!skb->tstamp) - __net_timestamp(skb); + tstamp = skb_tstamp_cond(skb, true); + ts = ktime_to_timespec64(tstamp); - skb_get_new_timestamp(skb, &ts); *(__be32 *)data = cpu_to_be32((u32)ts.tv_sec); } data += sizeof(__be32); @@ -694,13 +694,12 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb, if (!skb->dev) { *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE); } else { - if (!skb->tstamp) - __net_timestamp(skb); + if (!trace->type.bit2) { + tstamp = skb_tstamp_cond(skb, true); + ts = ktime_to_timespec64(tstamp); + } - if (!trace->type.bit2) - skb_get_new_timestamp(skb, &ts); - - *(__be32 *)data = cpu_to_be32((u32)ts.tv_usec); + *(__be32 *)data = cpu_to_be32((u32)(ts.tv_nsec / NSEC_PER_USEC)); } data += sizeof(__be32); } From patchwork Tue Mar 1 05:37:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764126 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B622C433FE for ; Tue, 1 Mar 2022 05:37:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232460AbiCAFiV (ORCPT ); Tue, 1 Mar 2022 00:38:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35226 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232469AbiCAFiT (ORCPT ); Tue, 1 Mar 2022 00:38:19 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32B952898A for ; Mon, 28 Feb 2022 21:37:38 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwSk4022945 for ; Mon, 28 Feb 2022 21:37:37 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=1xEk1X++uv3kxKRCeq0wSZwgM+EHtEHdbJ1AbZJwfgc=; b=VYkyq4G+kZRaoIIN65TFMIse6dVH+LagHlDGIPEa43l/WTsKfuq1fcQ5WpjzxfjjRLwt Q7ID4U8TrlfLhTUtWkmai8TI9FRXo4ehhdOxmDtnkSz986FiNeOnCVolF46mE6UtxPKV C6yG7oxhL8qLaJ2VcvMfx5kAuwjzDxMxXDI= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7a5yf9-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:37 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:36 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 214267A89633; Mon, 28 Feb 2022 21:37:28 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c Date: Mon, 28 Feb 2022 21:37:28 -0800 Message-ID: <20220301053728.934886-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: nrqqSutIxtx7BAvgHemKJGAPZdDWCLqo X-Proofpoint-ORIG-GUID: nrqqSutIxtx7BAvgHemKJGAPZdDWCLqo X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxlogscore=999 suspectscore=0 bulkscore=0 spamscore=0 mlxscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org If skb has the (rcv) timestamp available, nfnetlink_{log, queue}.c logs/outputs it to the userspace. When the locally generated skb is looping from egress to ingress over a virtual interface (e.g. veth, loopback...), skb->tstamp may have the delivery time before it is known that will be delivered locally and received by another sk. Like handling the delivery time in network tapping, use ktime_get_real() to get the (rcv) timestamp. The earlier added helper skb_tstamp_cond() is used to do this. false is passed to the second 'cond' arg such that doing ktime_get_real() or not only depends on the netstamp_needed_key static key. Signed-off-by: Martin KaFai Lau --- net/netfilter/nfnetlink_log.c | 6 ++++-- net/netfilter/nfnetlink_queue.c | 8 +++++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index ae9c0756bba5..d97eb280cb2e 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c @@ -460,6 +460,7 @@ __build_packet_message(struct nfnl_log_net *log, sk_buff_data_t old_tail = inst->skb->tail; struct sock *sk; const unsigned char *hwhdrp; + ktime_t tstamp; nlh = nfnl_msg_put(inst->skb, 0, 0, nfnl_msg_type(NFNL_SUBSYS_ULOG, NFULNL_MSG_PACKET), @@ -588,9 +589,10 @@ __build_packet_message(struct nfnl_log_net *log, goto nla_put_failure; } - if (hooknum <= NF_INET_FORWARD && skb->tstamp) { + tstamp = skb_tstamp_cond(skb, false); + if (hooknum <= NF_INET_FORWARD && tstamp) { struct nfulnl_msg_packet_timestamp ts; - struct timespec64 kts = ktime_to_timespec64(skb->tstamp); + struct timespec64 kts = ktime_to_timespec64(tstamp); ts.sec = cpu_to_be64(kts.tv_sec); ts.usec = cpu_to_be64(kts.tv_nsec / NSEC_PER_USEC); diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c index 8c15978d9258..db9b5357f2ca 100644 --- a/net/netfilter/nfnetlink_queue.c +++ b/net/netfilter/nfnetlink_queue.c @@ -392,6 +392,7 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, bool csum_verify; char *secdata = NULL; u32 seclen = 0; + ktime_t tstamp; size = nlmsg_total_size(sizeof(struct nfgenmsg)) + nla_total_size(sizeof(struct nfqnl_msg_packet_hdr)) @@ -407,7 +408,8 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, + nla_total_size(sizeof(u_int32_t)) /* skbinfo */ + nla_total_size(sizeof(u_int32_t)); /* cap_len */ - if (entskb->tstamp) + tstamp = skb_tstamp_cond(entskb, false); + if (tstamp) size += nla_total_size(sizeof(struct nfqnl_msg_packet_timestamp)); size += nfqnl_get_bridge_size(entry); @@ -582,9 +584,9 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue, if (nfqnl_put_bridge(entry, skb) < 0) goto nla_put_failure; - if (entry->state.hook <= NF_INET_FORWARD && entskb->tstamp) { + if (entry->state.hook <= NF_INET_FORWARD && tstamp) { struct nfqnl_msg_packet_timestamp ts; - struct timespec64 kts = ktime_to_timespec64(entskb->tstamp); + struct timespec64 kts = ktime_to_timespec64(tstamp); ts.sec = cpu_to_be64(kts.tv_sec); ts.usec = cpu_to_be64(kts.tv_nsec / NSEC_PER_USEC); From patchwork Tue Mar 1 05:37:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764127 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53E12C43219 for ; Tue, 1 Mar 2022 05:37:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232474AbiCAFic (ORCPT ); Tue, 1 Mar 2022 00:38:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232464AbiCAFib (ORCPT ); Tue, 1 Mar 2022 00:38:31 -0500 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5532D42A1D for ; Mon, 28 Feb 2022 21:37:49 -0800 (PST) Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.1.2/8.16.1.2) with ESMTP id 21SMwS22019487 for ; Mon, 28 Feb 2022 21:37:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=YZh7Lt4/K+gMdsbfAzuoWK/X9ZYx2nS+p0o7fF2LUvA=; b=isDZIfdzoSFy5Je2BU1SHSYizyE8NK8ZtCjWX6IWBcZgKqI8jcarzL/bxDh8DpdNRmkQ Q57fcrTldLI1PQWG5snzcyrR9nNYPiBEZ9cmlaPZJlb3g+SwaX42Ln2R4tcQkWatSrzt N/1qZYXRIpeKRJdDTkoQoZinjUyLfmslp2A= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net (PPS) with ESMTPS id 3eh11nmytx-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:48 -0800 Received: from twshared33860.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:47 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 64A2B7A8963E; Mon, 28 Feb 2022 21:37:34 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Date: Mon, 28 Feb 2022 21:37:34 -0800 Message-ID: <20220301053734.935008-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: pxTlzxzqCqCBJAtZXezog4TSt1B39Cvm X-Proofpoint-ORIG-GUID: pxTlzxzqCqCBJAtZXezog4TSt1B39Cvm X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 mlxlogscore=999 clxscore=1015 phishscore=0 impostorscore=0 lowpriorityscore=0 suspectscore=0 bulkscore=0 mlxscore=0 spamscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org The previous patches handled the delivery_time in the ingress path before the routing decision is made. This patch can postpone clearing delivery_time in a skb until knowing it is delivered locally and also set the (rcv) timestamp if needed. This patch moves the skb_clear_delivery_time() from dev.c to ip_local_deliver_finish() and ip6_input_finish(). Signed-off-by: Martin KaFai Lau --- net/core/dev.c | 8 ++------ net/ipv4/ip_input.c | 1 + net/ipv6/ip6_input.c | 1 + 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 0fc02cf32476..3ff686cc8c84 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto out; } - if (skb_skip_tc_classify(skb)) { - skb_clear_delivery_time(skb); + if (skb_skip_tc_classify(skb)) goto skip_classify; - } if (pfmemalloc) goto skip_taps; @@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, goto another_round; if (!skb) goto out; - skb_clear_delivery_time(skb); nf_skip_egress(skb, false); if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0) goto out; - } else + } #endif - skb_clear_delivery_time(skb); skb_reset_redirect(skb); skip_classify: if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index d94f9f7e60c3..95f7bb052784 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol) static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { + skb_clear_delivery_time(skb); __skb_pull(skb, skb_network_header_len(skb)); rcu_read_lock(); diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index d4b1e2c5aa76..5b5ea35635f9 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr, static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { + skb_clear_delivery_time(skb); rcu_read_lock(); ip6_protocol_deliver_rcu(net, skb, 0, false); rcu_read_unlock(); From patchwork Tue Mar 1 05:37:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764128 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F8CBC433EF for ; Tue, 1 Mar 2022 05:38:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232509AbiCAFij (ORCPT ); Tue, 1 Mar 2022 00:38:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35712 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232404AbiCAFie (ORCPT ); Tue, 1 Mar 2022 00:38:34 -0500 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 005394756D for ; Mon, 28 Feb 2022 21:37:52 -0800 (PST) Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.16.1.2/8.16.1.2) with ESMTP id 21SMwSUg007691 for ; Mon, 28 Feb 2022 21:37:51 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=gqoixQn5XvJ6Rw4nescxiFXcRHQjGvVuIl55oKjfWrI=; b=nEL+H5t4wLlS7Nj838yanEEH9oNdXoTmIrXtMfVm94TzJ8vxcb47V9RLcVCFMzNuWUYo 1hrvd7Ue9Xk0/xEJVzMCGnI9Lmp6DazbDy2PO/OuMiPbsgWuLXkUK5w42kwq5+XuaB1D 6CAZYpvEaujQV9/t8Uhcg+cx69LU0ww8+uM= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0001303.ppops.net (PPS) with ESMTPS id 3eh2jm471b-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:51 -0800 Received: from twshared7500.02.ash7.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:50 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id A8A657A89649; Mon, 28 Feb 2022 21:37:40 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Date: Mon, 28 Feb 2022 21:37:40 -0800 Message-ID: <20220301053740.935214-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: yeXLBz-x-SkHfBA0hUrS-6Bz1eDXsnIT X-Proofpoint-GUID: yeXLBz-x-SkHfBA0hUrS-6Bz1eDXsnIT X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 clxscore=1015 malwarescore=0 adultscore=0 impostorscore=0 spamscore=0 suspectscore=0 priorityscore=1501 phishscore=0 mlxscore=0 bulkscore=0 lowpriorityscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org The current tc-bpf@ingress reads and writes the __sk_buff->tstamp as a (rcv) timestamp which currently could either be 0 (not available) or ktime_get_real(). This patch is to backward compatible with the (rcv) timestamp expectation at ingress. If the skb->tstamp has the delivery_time, the bpf insn rewrite will read 0 for tc-bpf running at ingress as it is not available. When writing at ingress, it will also clear the skb->mono_delivery_time bit. /* BPF_READ: a = __sk_buff->tstamp */ if (!skb->tc_at_ingress || !skb->mono_delivery_time) a = skb->tstamp; else a = 0 /* BPF_WRITE: __sk_buff->tstamp = a */ if (skb->tc_at_ingress) skb->mono_delivery_time = 0; skb->tstamp = a; [ A note on the BPF_CGROUP_INET_INGRESS which can also access skb->tstamp. At that point, the skb is delivered locally and skb_clear_delivery_time() has already been done, so the skb->tstamp will only have the (rcv) timestamp. ] If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time has to be cleared also. It could be done together during convert_ctx_access(). However, the latter patch will also expose the skb->mono_delivery_time bit as __sk_buff->delivery_time_type. Changing the delivery_time_type in the background may surprise the user, e.g. the 2nd read on __sk_buff->delivery_time_type may need a READ_ONCE() to avoid compiler optimization. Thus, in expecting the needs in the latter patch, this patch does a check on !skb->tstamp after running the tc-bpf and clears the skb->mono_delivery_time bit if needed. The earlier discussion on v4 [0]. The bpf insn rewrite requires the skb's mono_delivery_time bit and tc_at_ingress bit. They are moved up in sk_buff so that bpf rewrite can be done at a fixed offset. tc_skip_classify is moved together with tc_at_ingress. To get one bit for mono_delivery_time, csum_not_inet is moved down and this bit is currently used by sctp. [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/ Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 18 +++++++---- net/core/filter.c | 71 ++++++++++++++++++++++++++++++++++++------ net/sched/act_bpf.c | 2 ++ net/sched/cls_bpf.c | 2 ++ 4 files changed, 77 insertions(+), 16 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4b5b926a81f2..5445860e1ba6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -941,8 +941,12 @@ struct sk_buff { __u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */ __u8 csum_complete_sw:1; __u8 csum_level:2; - __u8 csum_not_inet:1; __u8 dst_pending_confirm:1; + __u8 mono_delivery_time:1; +#ifdef CONFIG_NET_CLS_ACT + __u8 tc_skip_classify:1; + __u8 tc_at_ingress:1; +#endif #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif @@ -953,10 +957,6 @@ struct sk_buff { #ifdef CONFIG_NET_SWITCHDEV __u8 offload_fwd_mark:1; __u8 offload_l3_fwd_mark:1; -#endif -#ifdef CONFIG_NET_CLS_ACT - __u8 tc_skip_classify:1; - __u8 tc_at_ingress:1; #endif __u8 redirected:1; #ifdef CONFIG_NET_REDIRECT @@ -969,7 +969,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; - __u8 mono_delivery_time:1; + __u8 csum_not_inet:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -1047,10 +1047,16 @@ struct sk_buff { /* if you move pkt_vlan_present around you also must adapt these constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_VLAN_PRESENT_BIT 7 +#define TC_AT_INGRESS_MASK (1 << 0) +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 2) #else #define PKT_VLAN_PRESENT_BIT 0 +#define TC_AT_INGRESS_MASK (1 << 7) +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 5) #endif #define PKT_VLAN_PRESENT_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) #ifdef __KERNEL__ /* diff --git a/net/core/filter.c b/net/core/filter.c index cfcf9b4d1ec2..5072733743e9 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, return insn; } +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->dst_reg; + __u8 skb_reg = si->src_reg; + +#ifdef CONFIG_NET_CLS_ACT + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5); + /* @ingress, read __sk_buff->tstamp as the (rcv) timestamp, + * so check the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */ + *insn++ = BPF_MOV64_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); +#endif + + *insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg, + offsetof(struct sk_buff, tstamp)); + return insn; +} + +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->src_reg; + __u8 skb_reg = si->dst_reg; + +#ifdef CONFIG_NET_CLS_ACT + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3); + /* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp. + * Clear the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + ~SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); +#endif + + /* skb->tstamp = tstamp */ + *insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg, + offsetof(struct sk_buff, tstamp)); + return insn; +} + static u32 bpf_convert_ctx_access(enum bpf_access_type type, const struct bpf_insn *si, struct bpf_insn *insn_buf, @@ -9167,17 +9226,9 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8); if (type == BPF_WRITE) - *insn++ = BPF_STX_MEM(BPF_DW, - si->dst_reg, si->src_reg, - bpf_target_off(struct sk_buff, - tstamp, 8, - target_size)); + insn = bpf_convert_tstamp_write(si, insn); else - *insn++ = BPF_LDX_MEM(BPF_DW, - si->dst_reg, si->src_reg, - bpf_target_off(struct sk_buff, - tstamp, 8, - target_size)); + insn = bpf_convert_tstamp_read(si, insn); break; case offsetof(struct __sk_buff, gso_segs): diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index a77d8908e737..fea2d78b9ddc 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); } + if (unlikely(!skb->tstamp && skb->mono_delivery_time)) + skb->mono_delivery_time = 0; if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK) skb_orphan(skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index df19a847829e..c85b85a192bf 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -102,6 +102,8 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb); } + if (unlikely(!skb->tstamp && skb->mono_delivery_time)) + skb->mono_delivery_time = 0; if (prog->exts_integrated) { res->class = 0; From patchwork Tue Mar 1 05:37:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764129 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D423EC433EF for ; Tue, 1 Mar 2022 05:38:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232498AbiCAFiw (ORCPT ); Tue, 1 Mar 2022 00:38:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232501AbiCAFim (ORCPT ); Tue, 1 Mar 2022 00:38:42 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BCAD49F10 for ; Mon, 28 Feb 2022 21:37:58 -0800 (PST) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwccw019902 for ; Mon, 28 Feb 2022 21:37:58 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=hGFX2LDuDKaw6ZLgTyhmgP9Z1Lhpf8924/hIS5ltyz4=; b=Vyf6BgdHntErre3GDTyzNm8HWwDj2uvz5a9BoGV/TStcJFYQMxzE0PPcJaaEUGjzXGdt JlhWq1ee88si/e6yaWjdEgmjXf6ajM/5s1DFdL4cvJEJ6XXcO07KMGSmvsakciJvoBub CzSGpzJulKy0qH/84wq9zY42S1edy90+7DM= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7hp03u-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:37:58 -0800 Received: from twshared21672.25.frc3.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:37:56 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id EC4A77A89655; Mon, 28 Feb 2022 21:37:46 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() Date: Mon, 28 Feb 2022 21:37:46 -0800 Message-ID: <20220301053746.935420-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: 0HbH8INGO2xZMofpnQqwXdlnrRWMz9I8 X-Proofpoint-ORIG-GUID: 0HbH8INGO2xZMofpnQqwXdlnrRWMz9I8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 suspectscore=0 bulkscore=0 phishscore=0 impostorscore=0 malwarescore=0 clxscore=1015 lowpriorityscore=0 adultscore=0 priorityscore=1501 mlxscore=0 spamscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org * __sk_buff->delivery_time_type: This patch adds __sk_buff->delivery_time_type. It tells if the delivery_time is stored in __sk_buff->tstamp or not. It will be most useful for ingress to tell if the __sk_buff->tstamp has the (rcv) timestamp or delivery_time. If delivery_time_type is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp. Two non-zero types are defined for the delivery_time_type, BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC. For UNSPEC, it can only happen in egress because only mono delivery_time can be forwarded to ingress now. The clock of UNSPEC delivery_time can be deduced from the skb->sk->sk_clockid which is how the sch_etf doing it also. * Provide forwarded delivery_time to tc-bpf@ingress: With the help of the new delivery_time_type, the tc-bpf has a way to tell if the __sk_buff->tstamp has the (rcv) timestamp or the delivery_time. During bpf load time, the verifier will learn if the bpf prog has accessed the new __sk_buff->delivery_time_type. If it does, it means the tc-bpf@ingress is expecting the skb->tstamp could have the delivery_time. The kernel will then read the skb->tstamp as-is during bpf insn rewrite without checking the skb->mono_delivery_time. This is done by adding a new prog->delivery_time_access bit. The same goes for writing skb->tstamp. * bpf_skb_set_delivery_time(): The bpf_skb_set_delivery_time() helper is added to allow setting both delivery_time and the delivery_time_type at the same time. If the tc-bpf does not need to change the delivery_time_type, it can directly write to the __sk_buff->tstamp as the existing tc-bpf has already been doing. It will be most useful at ingress to change the __sk_buff->tstamp from the (rcv) timestamp to a mono delivery_time and then bpf_redirect_*(). bpf only has mono clock helper (bpf_ktime_get_ns), and the current known use case is the mono EDT for fq, and only mono delivery time can be kept during forward now, so bpf_skb_set_delivery_time() only supports setting BPF_SKB_DELIVERY_TIME_MONO. It can be extended later when use cases come up and the forwarding path also supports other clock bases. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 3 +- include/uapi/linux/bpf.h | 41 +++++++- net/core/filter.c | 169 ++++++++++++++++++++++++++------- tools/include/uapi/linux/bpf.h | 41 +++++++- 4 files changed, 216 insertions(+), 38 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 1cb1af917617..9bf26307247f 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -572,7 +572,8 @@ struct bpf_prog { has_callchain_buf:1, /* callchain buffer allocated? */ enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */ - call_get_func_ip:1; /* Do we call get_func_ip() */ + call_get_func_ip:1, /* Do we call get_func_ip() */ + delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index afe3d0d7f5f2..4eebea830613 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5086,6 +5086,37 @@ union bpf_attr { * Return * 0 on success, or a negative error in case of failure. On error * *dst* buffer is zeroed out. + * + * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type) + * Description + * Set a *dtime* (delivery time) to the __sk_buff->tstamp and also + * change the __sk_buff->delivery_time_type to *dtime_type*. + * + * When setting a delivery time (non zero *dtime*) to + * __sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type* + * is supported. It is the only delivery_time_type that will be + * kept after bpf_redirect_*(). + * + * If there is no need to change the __sk_buff->delivery_time_type, + * the delivery time can be directly written to __sk_buff->tstamp + * instead. + * + * *dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE + * can be used to clear any delivery time stored in + * __sk_buff->tstamp. + * + * Only IPv4 and IPv6 skb->protocol are supported. + * + * This function is most useful when it needs to set a + * mono delivery time to __sk_buff->tstamp and then + * bpf_redirect_*() to the egress of an iface. For example, + * changing the (rcv) timestamp in __sk_buff->tstamp at + * ingress to a mono delivery time and then bpf_redirect_*() + * to sch_fq@phy-dev. + * Return + * 0 on success. + * **-EINVAL** for invalid input + * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5280,6 +5311,7 @@ union bpf_attr { FN(xdp_load_bytes), \ FN(xdp_store_bytes), \ FN(copy_from_user_task), \ + FN(skb_set_delivery_time), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -5469,6 +5501,12 @@ union { \ __u64 :64; \ } __attribute__((aligned(8))) +enum { + BPF_SKB_DELIVERY_TIME_NONE, + BPF_SKB_DELIVERY_TIME_UNSPEC, + BPF_SKB_DELIVERY_TIME_MONO, +}; + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ @@ -5509,7 +5547,8 @@ struct __sk_buff { __u32 gso_segs; __bpf_md_ptr(struct bpf_sock *, sk); __u32 gso_size; - __u32 :32; /* Padding, future use. */ + __u8 delivery_time_type; + __u32 :24; /* Padding, future use. */ __u64 hwtstamp; }; diff --git a/net/core/filter.c b/net/core/filter.c index 5072733743e9..88767f7da150 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7388,6 +7388,43 @@ static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_3(bpf_skb_set_delivery_time, struct sk_buff *, skb, + u64, dtime, u32, dtime_type) +{ + /* skb_clear_delivery_time() is done for inet protocol */ + if (skb->protocol != htons(ETH_P_IP) && + skb->protocol != htons(ETH_P_IPV6)) + return -EOPNOTSUPP; + + switch (dtime_type) { + case BPF_SKB_DELIVERY_TIME_MONO: + if (!dtime) + return -EINVAL; + skb->tstamp = dtime; + skb->mono_delivery_time = 1; + break; + case BPF_SKB_DELIVERY_TIME_NONE: + if (dtime) + return -EINVAL; + skb->tstamp = 0; + skb->mono_delivery_time = 0; + break; + default: + return -EOPNOTSUPP; + } + + return 0; +} + +static const struct bpf_func_proto bpf_skb_set_delivery_time_proto = { + .func = bpf_skb_set_delivery_time, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; + #endif /* CONFIG_INET */ bool bpf_helper_changes_pkt_data(void *func) @@ -7749,6 +7786,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_tcp_gen_syncookie_proto; case BPF_FUNC_sk_assign: return &bpf_sk_assign_proto; + case BPF_FUNC_skb_set_delivery_time: + return &bpf_skb_set_delivery_time_proto; #endif default: return bpf_sk_base_func_proto(func_id); @@ -8088,7 +8127,9 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type return false; info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL; break; - case offsetofend(struct __sk_buff, gso_size) ... offsetof(struct __sk_buff, hwtstamp) - 1: + case offsetof(struct __sk_buff, delivery_time_type): + return false; + case offsetofend(struct __sk_buff, delivery_time_type) ... offsetof(struct __sk_buff, hwtstamp) - 1: /* Explicitly prohibit access to padding in __sk_buff. */ return false; default: @@ -8443,6 +8484,15 @@ static bool tc_cls_act_is_valid_access(int off, int size, break; case bpf_ctx_range_till(struct __sk_buff, family, local_port): return false; + case offsetof(struct __sk_buff, delivery_time_type): + /* The convert_ctx_access() on reading and writing + * __sk_buff->tstamp depends on whether the bpf prog + * has used __sk_buff->delivery_time_type or not. + * Thus, we need to set prog->delivery_time_access + * earlier during is_valid_access() here. + */ + ((struct bpf_prog *)prog)->delivery_time_access = 1; + return size == sizeof(__u8); } return bpf_skb_is_valid_access(off, size, type, prog, info); @@ -8838,6 +8888,45 @@ static u32 flow_dissector_convert_ctx_access(enum bpf_access_type type, return insn - insn_buf; } +static struct bpf_insn *bpf_convert_dtime_type_read(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->dst_reg; + __u8 skb_reg = si->src_reg; + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* value_reg = BPF_SKB_DELIVERY_TIME_MONO */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_MONO); + *insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 10 : 5); + + *insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, skb_reg, + offsetof(struct sk_buff, tstamp)); + *insn++ = BPF_JMP_IMM(BPF_JNE, tmp_reg, 0, 2); + /* value_reg = BPF_SKB_DELIVERY_TIME_NONE */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_NONE); + *insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 6 : 1); + +#ifdef CONFIG_NET_CLS_ACT + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* At ingress, value_reg = 0 */ + *insn++ = BPF_MOV32_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); +#endif + + /* value_reg = BPF_SKB_DELIVERYT_TIME_UNSPEC */ + *insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_UNSPEC); + + /* 15 insns with CONFIG_NET_CLS_ACT */ + return insn; +} + static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, struct bpf_insn *insn) { @@ -8859,29 +8948,32 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, return insn; } -static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si, +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog, + const struct bpf_insn *si, struct bpf_insn *insn) { __u8 value_reg = si->dst_reg; __u8 skb_reg = si->src_reg; #ifdef CONFIG_NET_CLS_ACT - __u8 tmp_reg = BPF_REG_AX; - - *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); - *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); - *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5); - /* @ingress, read __sk_buff->tstamp as the (rcv) timestamp, - * so check the skb->mono_delivery_time. - */ - *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, - SKB_MONO_DELIVERY_TIME_OFFSET); - *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, - SKB_MONO_DELIVERY_TIME_MASK); - *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); - /* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */ - *insn++ = BPF_MOV64_IMM(value_reg, 0); - *insn++ = BPF_JMP_A(1); + if (!prog->delivery_time_access) { + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5); + /* @ingress, read __sk_buff->tstamp as the (rcv) timestamp, + * so check the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */ + *insn++ = BPF_MOV64_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); + } #endif *insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg, @@ -8889,27 +8981,30 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si, return insn; } -static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si, +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog, + const struct bpf_insn *si, struct bpf_insn *insn) { __u8 value_reg = si->src_reg; __u8 skb_reg = si->dst_reg; #ifdef CONFIG_NET_CLS_ACT - __u8 tmp_reg = BPF_REG_AX; - - *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); - *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); - *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3); - /* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp. - * Clear the skb->mono_delivery_time. - */ - *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, - SKB_MONO_DELIVERY_TIME_OFFSET); - *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, - ~SKB_MONO_DELIVERY_TIME_MASK); - *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, - SKB_MONO_DELIVERY_TIME_OFFSET); + if (!prog->delivery_time_access) { + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3); + /* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp. + * Clear the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + ~SKB_MONO_DELIVERY_TIME_MASK); + *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + } #endif /* skb->tstamp = tstamp */ @@ -9226,9 +9321,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8); if (type == BPF_WRITE) - insn = bpf_convert_tstamp_write(si, insn); + insn = bpf_convert_tstamp_write(prog, si, insn); else - insn = bpf_convert_tstamp_read(si, insn); + insn = bpf_convert_tstamp_read(prog, si, insn); + break; + + case offsetof(struct __sk_buff, delivery_time_type): + insn = bpf_convert_dtime_type_read(si, insn); break; case offsetof(struct __sk_buff, gso_segs): diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index afe3d0d7f5f2..4eebea830613 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5086,6 +5086,37 @@ union bpf_attr { * Return * 0 on success, or a negative error in case of failure. On error * *dst* buffer is zeroed out. + * + * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type) + * Description + * Set a *dtime* (delivery time) to the __sk_buff->tstamp and also + * change the __sk_buff->delivery_time_type to *dtime_type*. + * + * When setting a delivery time (non zero *dtime*) to + * __sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type* + * is supported. It is the only delivery_time_type that will be + * kept after bpf_redirect_*(). + * + * If there is no need to change the __sk_buff->delivery_time_type, + * the delivery time can be directly written to __sk_buff->tstamp + * instead. + * + * *dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE + * can be used to clear any delivery time stored in + * __sk_buff->tstamp. + * + * Only IPv4 and IPv6 skb->protocol are supported. + * + * This function is most useful when it needs to set a + * mono delivery time to __sk_buff->tstamp and then + * bpf_redirect_*() to the egress of an iface. For example, + * changing the (rcv) timestamp in __sk_buff->tstamp at + * ingress to a mono delivery time and then bpf_redirect_*() + * to sch_fq@phy-dev. + * Return + * 0 on success. + * **-EINVAL** for invalid input + * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5280,6 +5311,7 @@ union bpf_attr { FN(xdp_load_bytes), \ FN(xdp_store_bytes), \ FN(copy_from_user_task), \ + FN(skb_set_delivery_time), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -5469,6 +5501,12 @@ union { \ __u64 :64; \ } __attribute__((aligned(8))) +enum { + BPF_SKB_DELIVERY_TIME_NONE, + BPF_SKB_DELIVERY_TIME_UNSPEC, + BPF_SKB_DELIVERY_TIME_MONO, +}; + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ @@ -5509,7 +5547,8 @@ struct __sk_buff { __u32 gso_segs; __bpf_md_ptr(struct bpf_sock *, sk); __u32 gso_size; - __u32 :32; /* Padding, future use. */ + __u8 delivery_time_type; + __u32 :24; /* Padding, future use. */ __u64 hwtstamp; }; From patchwork Tue Mar 1 05:37:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12764130 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8622C433FE for ; Tue, 1 Mar 2022 05:38:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232520AbiCAFi7 (ORCPT ); Tue, 1 Mar 2022 00:38:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36998 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232518AbiCAFi5 (ORCPT ); Tue, 1 Mar 2022 00:38:57 -0500 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1724517F7 for ; Mon, 28 Feb 2022 21:38:09 -0800 (PST) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 21SMwTbB023008 for ; Mon, 28 Feb 2022 21:38:08 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=MxX0lSSmDJE0TuvUqa7uSenDEh6DeagKOc2+fg/4yfI=; b=lywQNO9Rsxy7eNgmn9sP9cICHOHHyZcsM68S/jXjBIhJoyZSbGVC2gBZt4qu4VNK2/0y TUW2mf209UTbUeeQYXa7QlEAaR7JpWU7tNeHCuq+IZgKJ9MxJmWEWIZf8jzsJomKXhhn GEIMN8KR9Fz4dlxWm8cS/vCBDWHsUHKyo3I= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3egx7a5yh5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 28 Feb 2022 21:38:08 -0800 Received: from twshared11487.23.frc3.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Mon, 28 Feb 2022 21:38:07 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 3B0DB7A89664; Mon, 28 Feb 2022 21:37:53 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [PATCH v5 net-next 13/13] bpf: selftests: test skb->tstamp in redirect_neigh Date: Mon, 28 Feb 2022 21:37:53 -0800 Message-ID: <20220301053753.935762-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220301053631.930498-1-kafai@fb.com> References: <20220301053631.930498-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: wZMjTZUJBRhks0u10eWNfm8QOkSimmSZ X-Proofpoint-ORIG-GUID: wZMjTZUJBRhks0u10eWNfm8QOkSimmSZ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-02-28_10,2022-02-26_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 clxscore=1015 phishscore=0 mlxlogscore=999 suspectscore=0 bulkscore=0 spamscore=0 mlxscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2203010026 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org This patch adds tests on forwarding the delivery_time for the following cases - tcp/udp + ip4/ip6 + bpf_redirect_neigh - tcp/udp + ip4/ip6 + ip[6]_forward - bpf_skb_set_delivery_time - The old rcv timestamp expectation on tc-bpf@ingress Signed-off-by: Martin KaFai Lau --- .../selftests/bpf/prog_tests/tc_redirect.c | 434 ++++++++++++++++++ .../selftests/bpf/progs/test_tc_dtime.c | 347 ++++++++++++++ 2 files changed, 781 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c diff --git a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c index 647b0a833628..2b255e28ed26 100644 --- a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c +++ b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c @@ -17,6 +17,8 @@ #include #include #include +#include +#include #include #include #include @@ -29,6 +31,11 @@ #include "test_tc_neigh_fib.skel.h" #include "test_tc_neigh.skel.h" #include "test_tc_peer.skel.h" +#include "test_tc_dtime.skel.h" + +#ifndef TCP_TX_DELAY +#define TCP_TX_DELAY 37 +#endif #define NS_SRC "ns_src" #define NS_FWD "ns_fwd" @@ -61,6 +68,7 @@ #define CHK_PROG_PIN_FILE "/sys/fs/bpf/test_tc_chk" #define TIMEOUT_MILLIS 10000 +#define NSEC_PER_SEC 1000000000ULL #define log_err(MSG, ...) \ fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \ @@ -440,6 +448,431 @@ static int set_forwarding(bool enable) return 0; } +static void rcv_tstamp(int fd, const char *expected, size_t s) +{ + struct __kernel_timespec pkt_ts = {}; + char ctl[CMSG_SPACE(sizeof(pkt_ts))]; + struct timespec now_ts; + struct msghdr msg = {}; + __u64 now_ns, pkt_ns; + struct cmsghdr *cmsg; + struct iovec iov; + char data[32]; + int ret; + + iov.iov_base = data; + iov.iov_len = sizeof(data); + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = &ctl; + msg.msg_controllen = sizeof(ctl); + + ret = recvmsg(fd, &msg, 0); + if (!ASSERT_EQ(ret, s, "recvmsg")) + return; + ASSERT_STRNEQ(data, expected, s, "expected rcv data"); + + cmsg = CMSG_FIRSTHDR(&msg); + if (cmsg && cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SO_TIMESTAMPNS_NEW) + memcpy(&pkt_ts, CMSG_DATA(cmsg), sizeof(pkt_ts)); + + pkt_ns = pkt_ts.tv_sec * NSEC_PER_SEC + pkt_ts.tv_nsec; + ASSERT_NEQ(pkt_ns, 0, "pkt rcv tstamp"); + + ret = clock_gettime(CLOCK_REALTIME, &now_ts); + ASSERT_OK(ret, "clock_gettime"); + now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec; + + if (ASSERT_GE(now_ns, pkt_ns, "check rcv tstamp")) + ASSERT_LT(now_ns - pkt_ns, 5 * NSEC_PER_SEC, + "check rcv tstamp"); +} + +static void snd_tstamp(int fd, char *b, size_t s) +{ + struct sock_txtime opt = { .clockid = CLOCK_TAI }; + char ctl[CMSG_SPACE(sizeof(__u64))]; + struct timespec now_ts; + struct msghdr msg = {}; + struct cmsghdr *cmsg; + struct iovec iov; + __u64 now_ns; + int ret; + + ret = clock_gettime(CLOCK_TAI, &now_ts); + ASSERT_OK(ret, "clock_get_time(CLOCK_TAI)"); + now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec; + + iov.iov_base = b; + iov.iov_len = s; + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = &ctl; + msg.msg_controllen = sizeof(ctl); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_TXTIME; + cmsg->cmsg_len = CMSG_LEN(sizeof(now_ns)); + *(__u64 *)CMSG_DATA(cmsg) = now_ns; + + ret = setsockopt(fd, SOL_SOCKET, SO_TXTIME, &opt, sizeof(opt)); + ASSERT_OK(ret, "setsockopt(SO_TXTIME)"); + + ret = sendmsg(fd, &msg, 0); + ASSERT_EQ(ret, s, "sendmsg"); +} + +static void test_inet_dtime(int family, int type, const char *addr, __u16 port) +{ + int opt = 1, accept_fd = -1, client_fd = -1, listen_fd, err; + char buf[] = "testing testing"; + struct nstoken *nstoken; + + nstoken = open_netns(NS_DST); + if (!ASSERT_OK_PTR(nstoken, "setns dst")) + return; + listen_fd = start_server(family, type, addr, port, 0); + close_netns(nstoken); + + if (!ASSERT_GE(listen_fd, 0, "listen")) + return; + + /* Ensure the kernel puts the (rcv) timestamp for all skb */ + err = setsockopt(listen_fd, SOL_SOCKET, SO_TIMESTAMPNS_NEW, + &opt, sizeof(opt)); + if (!ASSERT_OK(err, "setsockopt(SO_TIMESTAMPNS_NEW)")) + goto done; + + if (type == SOCK_STREAM) { + /* Ensure the kernel set EDT when sending out rst/ack + * from the kernel's ctl_sk. + */ + err = setsockopt(listen_fd, SOL_TCP, TCP_TX_DELAY, &opt, + sizeof(opt)); + if (!ASSERT_OK(err, "setsockopt(TCP_TX_DELAY)")) + goto done; + } + + nstoken = open_netns(NS_SRC); + if (!ASSERT_OK_PTR(nstoken, "setns src")) + goto done; + client_fd = connect_to_fd(listen_fd, TIMEOUT_MILLIS); + close_netns(nstoken); + + if (!ASSERT_GE(client_fd, 0, "connect_to_fd")) + goto done; + + if (type == SOCK_STREAM) { + int n; + + accept_fd = accept(listen_fd, NULL, NULL); + if (!ASSERT_GE(accept_fd, 0, "accept")) + goto done; + + n = write(client_fd, buf, sizeof(buf)); + if (!ASSERT_EQ(n, sizeof(buf), "send to server")) + goto done; + rcv_tstamp(accept_fd, buf, sizeof(buf)); + } else { + snd_tstamp(client_fd, buf, sizeof(buf)); + rcv_tstamp(listen_fd, buf, sizeof(buf)); + } + +done: + close(listen_fd); + if (accept_fd != -1) + close(accept_fd); + if (client_fd != -1) + close(client_fd); +} + +static int netns_load_dtime_bpf(struct test_tc_dtime *skel) +{ + struct nstoken *nstoken; + +#define PIN_FNAME(__file) "/sys/fs/bpf/" #__file +#define PIN(__prog) ({ \ + int err = bpf_program__pin(skel->progs.__prog, PIN_FNAME(__prog)); \ + if (!ASSERT_OK(err, "pin " #__prog)) \ + goto fail; \ + }) + + /* setup ns_src tc progs */ + nstoken = open_netns(NS_SRC); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_SRC)) + return -1; + PIN(egress_host); + PIN(ingress_host); + SYS("tc qdisc add dev veth_src clsact"); + SYS("tc filter add dev veth_src ingress bpf da object-pinned " + PIN_FNAME(ingress_host)); + SYS("tc filter add dev veth_src egress bpf da object-pinned " + PIN_FNAME(egress_host)); + close_netns(nstoken); + + /* setup ns_dst tc progs */ + nstoken = open_netns(NS_DST); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_DST)) + return -1; + PIN(egress_host); + PIN(ingress_host); + SYS("tc qdisc add dev veth_dst clsact"); + SYS("tc filter add dev veth_dst ingress bpf da object-pinned " + PIN_FNAME(ingress_host)); + SYS("tc filter add dev veth_dst egress bpf da object-pinned " + PIN_FNAME(egress_host)); + close_netns(nstoken); + + /* setup ns_fwd tc progs */ + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns " NS_FWD)) + return -1; + PIN(ingress_fwdns_prio100); + PIN(egress_fwdns_prio100); + PIN(ingress_fwdns_prio101); + PIN(egress_fwdns_prio101); + SYS("tc qdisc add dev veth_dst_fwd clsact"); + SYS("tc filter add dev veth_dst_fwd ingress prio 100 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio100)); + SYS("tc filter add dev veth_dst_fwd ingress prio 101 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio101)); + SYS("tc filter add dev veth_dst_fwd egress prio 100 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio100)); + SYS("tc filter add dev veth_dst_fwd egress prio 101 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio101)); + SYS("tc qdisc add dev veth_src_fwd clsact"); + SYS("tc filter add dev veth_src_fwd ingress prio 100 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio100)); + SYS("tc filter add dev veth_src_fwd ingress prio 101 bpf da object-pinned " + PIN_FNAME(ingress_fwdns_prio101)); + SYS("tc filter add dev veth_src_fwd egress prio 100 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio100)); + SYS("tc filter add dev veth_src_fwd egress prio 101 bpf da object-pinned " + PIN_FNAME(egress_fwdns_prio101)); + close_netns(nstoken); + +#undef PIN + + return 0; + +fail: + close_netns(nstoken); + return -1; +} + +enum { + INGRESS_FWDNS_P100, + INGRESS_FWDNS_P101, + EGRESS_FWDNS_P100, + EGRESS_FWDNS_P101, + INGRESS_ENDHOST, + EGRESS_ENDHOST, + SET_DTIME, + __MAX_CNT, +}; + +const char *cnt_names[] = { + "ingress_fwdns_p100", + "ingress_fwdns_p101", + "egress_fwdns_p100", + "egress_fwdns_p101", + "ingress_endhost", + "egress_endhost", + "set_dtime", +}; + +enum { + TCP_IP6_CLEAR_DTIME, + TCP_IP4, + TCP_IP6, + UDP_IP4, + UDP_IP6, + TCP_IP4_RT_FWD, + TCP_IP6_RT_FWD, + UDP_IP4_RT_FWD, + UDP_IP6_RT_FWD, + UKN_TEST, + __NR_TESTS, +}; + +const char *test_names[] = { + "tcp ip6 clear dtime", + "tcp ip4", + "tcp ip6", + "udp ip4", + "udp ip6", + "tcp ip4 rt fwd", + "tcp ip6 rt fwd", + "udp ip4 rt fwd", + "udp ip6 rt fwd", +}; + +static const char *dtime_cnt_str(int test, int cnt) +{ + static char name[64]; + + snprintf(name, sizeof(name), "%s %s", test_names[test], cnt_names[cnt]); + + return name; +} + +static const char *dtime_err_str(int test, int cnt) +{ + static char name[64]; + + snprintf(name, sizeof(name), "%s %s errs", test_names[test], + cnt_names[cnt]); + + return name; +} + +static void test_tcp_clear_dtime(struct test_tc_dtime *skel) +{ + int i, t = TCP_IP6_CLEAR_DTIME; + __u32 *dtimes = skel->bss->dtimes[t]; + __u32 *errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(AF_INET6, SOCK_STREAM, IP6_DST, 0); + + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P101)); + ASSERT_GT(dtimes[EGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, EGRESS_FWDNS_P100)); + ASSERT_EQ(dtimes[EGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, EGRESS_FWDNS_P101)); + ASSERT_GT(dtimes[EGRESS_ENDHOST], 0, + dtime_cnt_str(t, EGRESS_ENDHOST)); + ASSERT_GT(dtimes[INGRESS_ENDHOST], 0, + dtime_cnt_str(t, INGRESS_ENDHOST)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_tcp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd) +{ + __u32 *dtimes, *errs; + const char *addr; + int i, t; + + if (family == AF_INET) { + t = bpf_fwd ? TCP_IP4 : TCP_IP4_RT_FWD; + addr = IP4_DST; + } else { + t = bpf_fwd ? TCP_IP6 : TCP_IP6_RT_FWD; + addr = IP6_DST; + } + + dtimes = skel->bss->dtimes[t]; + errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(family, SOCK_STREAM, addr, 0); + + /* fwdns_prio100 prog does not read delivery_time_type, so + * kernel puts the (rcv) timetamp in __sk_buff->tstamp + */ + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + for (i = INGRESS_FWDNS_P101; i < SET_DTIME; i++) + ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_udp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd) +{ + __u32 *dtimes, *errs; + const char *addr; + int i, t; + + if (family == AF_INET) { + t = bpf_fwd ? UDP_IP4 : UDP_IP4_RT_FWD; + addr = IP4_DST; + } else { + t = bpf_fwd ? UDP_IP6 : UDP_IP6_RT_FWD; + addr = IP6_DST; + } + + dtimes = skel->bss->dtimes[t]; + errs = skel->bss->errs[t]; + + skel->bss->test = t; + test_inet_dtime(family, SOCK_DGRAM, addr, 0); + + ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + /* non mono delivery time is not forwarded */ + ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0, + dtime_cnt_str(t, INGRESS_FWDNS_P100)); + for (i = EGRESS_FWDNS_P100; i < SET_DTIME; i++) + ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i)); + + for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++) + ASSERT_EQ(errs[i], 0, dtime_err_str(t, i)); +} + +static void test_tc_redirect_dtime(struct netns_setup_result *setup_result) +{ + struct test_tc_dtime *skel; + struct nstoken *nstoken; + int err; + + skel = test_tc_dtime__open(); + if (!ASSERT_OK_PTR(skel, "test_tc_dtime__open")) + return; + + skel->rodata->IFINDEX_SRC = setup_result->ifindex_veth_src_fwd; + skel->rodata->IFINDEX_DST = setup_result->ifindex_veth_dst_fwd; + + err = test_tc_dtime__load(skel); + if (!ASSERT_OK(err, "test_tc_dtime__load")) + goto done; + + if (netns_load_dtime_bpf(skel)) + goto done; + + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns fwd")) + goto done; + err = set_forwarding(false); + close_netns(nstoken); + if (!ASSERT_OK(err, "disable forwarding")) + goto done; + + test_tcp_clear_dtime(skel); + + test_tcp_dtime(skel, AF_INET, true); + test_tcp_dtime(skel, AF_INET6, true); + test_udp_dtime(skel, AF_INET, true); + test_udp_dtime(skel, AF_INET6, true); + + /* Test the kernel ip[6]_forward path instead + * of bpf_redirect_neigh(). + */ + nstoken = open_netns(NS_FWD); + if (!ASSERT_OK_PTR(nstoken, "setns fwd")) + goto done; + err = set_forwarding(true); + close_netns(nstoken); + if (!ASSERT_OK(err, "enable forwarding")) + goto done; + + test_tcp_dtime(skel, AF_INET, false); + test_tcp_dtime(skel, AF_INET6, false); + test_udp_dtime(skel, AF_INET, false); + test_udp_dtime(skel, AF_INET6, false); + +done: + test_tc_dtime__destroy(skel); +} + static void test_tc_redirect_neigh_fib(struct netns_setup_result *setup_result) { struct nstoken *nstoken = NULL; @@ -787,6 +1220,7 @@ static void *test_tc_redirect_run_tests(void *arg) RUN_TEST(tc_redirect_peer_l3); RUN_TEST(tc_redirect_neigh); RUN_TEST(tc_redirect_neigh_fib); + RUN_TEST(tc_redirect_dtime); return NULL; } diff --git a/tools/testing/selftests/bpf/progs/test_tc_dtime.c b/tools/testing/selftests/bpf/progs/test_tc_dtime.c new file mode 100644 index 000000000000..bbe3b158f061 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tc_dtime.c @@ -0,0 +1,347 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* veth_src --- veth_src_fwd --- veth_det_fwd --- veth_dst + * | | + * ns_src | ns_fwd | ns_dst + * + * ns_src and ns_dst: ENDHOST namespace + * ns_fwd: Fowarding namespace + */ + +#define ctx_ptr(field) (void *)(long)(field) + +#define ip4_src __bpf_htonl(0xac100164) /* 172.16.1.100 */ +#define ip4_dst __bpf_htonl(0xac100264) /* 172.16.2.100 */ + +#define ip6_src { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } +#define ip6_dst { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } + +#define v6_equal(a, b) (a.s6_addr32[0] == b.s6_addr32[0] && \ + a.s6_addr32[1] == b.s6_addr32[1] && \ + a.s6_addr32[2] == b.s6_addr32[2] && \ + a.s6_addr32[3] == b.s6_addr32[3]) + +volatile const __u32 IFINDEX_SRC; +volatile const __u32 IFINDEX_DST; + +#define EGRESS_ENDHOST_MAGIC 0x0b9fbeef +#define INGRESS_FWDNS_MAGIC 0x1b9fbeef +#define EGRESS_FWDNS_MAGIC 0x2b9fbeef + +enum { + INGRESS_FWDNS_P100, + INGRESS_FWDNS_P101, + EGRESS_FWDNS_P100, + EGRESS_FWDNS_P101, + INGRESS_ENDHOST, + EGRESS_ENDHOST, + SET_DTIME, + __MAX_CNT, +}; + +enum { + TCP_IP6_CLEAR_DTIME, + TCP_IP4, + TCP_IP6, + UDP_IP4, + UDP_IP6, + TCP_IP4_RT_FWD, + TCP_IP6_RT_FWD, + UDP_IP4_RT_FWD, + UDP_IP6_RT_FWD, + UKN_TEST, + __NR_TESTS, +}; + +enum { + SRC_NS = 1, + DST_NS, +}; + +__u32 dtimes[__NR_TESTS][__MAX_CNT] = {}; +__u32 errs[__NR_TESTS][__MAX_CNT] = {}; +__u32 test = 0; + +static void inc_dtimes(__u32 idx) +{ + if (test < __NR_TESTS) + dtimes[test][idx]++; + else + dtimes[UKN_TEST][idx]++; +} + +static void inc_errs(__u32 idx) +{ + if (test < __NR_TESTS) + errs[test][idx]++; + else + errs[UKN_TEST][idx]++; +} + +static int skb_proto(int type) +{ + return type & 0xff; +} + +static int skb_ns(int type) +{ + return (type >> 8) & 0xff; +} + +static bool fwdns_clear_dtime(void) +{ + return test == TCP_IP6_CLEAR_DTIME; +} + +static bool bpf_fwd(void) +{ + return test < TCP_IP4_RT_FWD; +} + +/* -1: parse error: TC_ACT_SHOT + * 0: not testing traffic: TC_ACT_OK + * >0: first byte is the inet_proto, second byte has the netns + * of the sender + */ +static int skb_get_type(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + __u8 inet_proto = 0, ns = 0; + struct ipv6hdr *ip6h; + struct iphdr *iph; + + switch (skb->protocol) { + case __bpf_htons(ETH_P_IP): + iph = data + sizeof(struct ethhdr); + if (iph + 1 > data_end) + return -1; + if (iph->saddr == ip4_src) + ns = SRC_NS; + else if (iph->saddr == ip4_dst) + ns = DST_NS; + inet_proto = iph->protocol; + break; + case __bpf_htons(ETH_P_IPV6): + ip6h = data + sizeof(struct ethhdr); + if (ip6h + 1 > data_end) + return -1; + if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_src)) + ns = SRC_NS; + else if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_dst)) + ns = DST_NS; + inet_proto = ip6h->nexthdr; + break; + default: + return 0; + } + + if ((inet_proto != IPPROTO_TCP && inet_proto != IPPROTO_UDP) || !ns) + return 0; + + return (ns << 8 | inet_proto); +} + +/* format: direction@iface@netns + * egress@veth_(src|dst)@ns_(src|dst) + */ +SEC("tc") +int egress_host(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + if (skb_proto(skb_type) == IPPROTO_TCP) { + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO && + skb->tstamp) + inc_dtimes(EGRESS_ENDHOST); + else + inc_errs(EGRESS_ENDHOST); + } else { + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_UNSPEC && + skb->tstamp) + inc_dtimes(EGRESS_ENDHOST); + else + inc_errs(EGRESS_ENDHOST); + } + + skb->tstamp = EGRESS_ENDHOST_MAGIC; + + return TC_ACT_OK; +} + +/* ingress@veth_(src|dst)@ns_(src|dst) */ +SEC("tc") +int ingress_host(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO && + skb->tstamp == EGRESS_FWDNS_MAGIC) + inc_dtimes(INGRESS_ENDHOST); + else + inc_errs(INGRESS_ENDHOST); + + return TC_ACT_OK; +} + +/* ingress@veth_(src|dst)_fwd@ns_fwd priority 100 */ +SEC("tc") +int ingress_fwdns_prio100(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + /* delivery_time is only available to the ingress + * if the tc-bpf checks the skb->delivery_time_type. + */ + if (skb->tstamp == EGRESS_ENDHOST_MAGIC) + inc_errs(INGRESS_FWDNS_P100); + + if (fwdns_clear_dtime()) + skb->tstamp = 0; + + return TC_ACT_UNSPEC; +} + +/* egress@veth_(src|dst)_fwd@ns_fwd priority 100 */ +SEC("tc") +int egress_fwdns_prio100(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1) + return TC_ACT_SHOT; + if (!skb_type) + return TC_ACT_OK; + + /* delivery_time is always available to egress even + * the tc-bpf did not use the delivery_time_type. + */ + if (skb->tstamp == INGRESS_FWDNS_MAGIC) + inc_dtimes(EGRESS_FWDNS_P100); + else + inc_errs(EGRESS_FWDNS_P100); + + if (fwdns_clear_dtime()) + skb->tstamp = 0; + + return TC_ACT_UNSPEC; +} + +/* ingress@veth_(src|dst)_fwd@ns_fwd priority 101 */ +SEC("tc") +int ingress_fwdns_prio101(struct __sk_buff *skb) +{ + __u64 expected_dtime = EGRESS_ENDHOST_MAGIC; + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1 || !skb_type) + /* Should have handled in prio100 */ + return TC_ACT_SHOT; + + if (skb_proto(skb_type) == IPPROTO_UDP) + expected_dtime = 0; + + if (skb->delivery_time_type) { + if (fwdns_clear_dtime() || + skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO || + skb->tstamp != expected_dtime) + inc_errs(INGRESS_FWDNS_P101); + else + inc_dtimes(INGRESS_FWDNS_P101); + } else { + if (!fwdns_clear_dtime() && expected_dtime) + inc_errs(INGRESS_FWDNS_P101); + } + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) { + skb->tstamp = INGRESS_FWDNS_MAGIC; + } else { + if (bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_MONO)) + inc_errs(SET_DTIME); + if (!bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_UNSPEC)) + inc_errs(SET_DTIME); + } + + if (skb_ns(skb_type) == SRC_NS) + return bpf_fwd() ? + bpf_redirect_neigh(IFINDEX_DST, NULL, 0, 0) : TC_ACT_OK; + else + return bpf_fwd() ? + bpf_redirect_neigh(IFINDEX_SRC, NULL, 0, 0) : TC_ACT_OK; +} + +/* egress@veth_(src|dst)_fwd@ns_fwd priority 101 */ +SEC("tc") +int egress_fwdns_prio101(struct __sk_buff *skb) +{ + int skb_type; + + skb_type = skb_get_type(skb); + if (skb_type == -1 || !skb_type) + /* Should have handled in prio100 */ + return TC_ACT_SHOT; + + if (skb->delivery_time_type) { + if (fwdns_clear_dtime() || + skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO || + skb->tstamp != INGRESS_FWDNS_MAGIC) + inc_errs(EGRESS_FWDNS_P101); + else + inc_dtimes(EGRESS_FWDNS_P101); + } else { + if (!fwdns_clear_dtime()) + inc_errs(EGRESS_FWDNS_P101); + } + + if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) { + skb->tstamp = EGRESS_FWDNS_MAGIC; + } else { + if (bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_MONO)) + inc_errs(SET_DTIME); + if (!bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC, + BPF_SKB_DELIVERY_TIME_UNSPEC)) + inc_errs(SET_DTIME); + } + + return TC_ACT_OK; +} + +char __license[] SEC("license") = "GPL";