From patchwork Fri Jan 21 07:30:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12719377 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF958C433F5 for ; Fri, 21 Jan 2022 07:30:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378988AbiAUHam (ORCPT ); Fri, 21 Jan 2022 02:30:42 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:56022 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378984AbiAUHal (ORCPT ); Fri, 21 Jan 2022 02:30:41 -0500 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 20L06hpG010918 for ; Thu, 20 Jan 2022 23:30:41 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=facebook; bh=zGcLKfg+7a+bE612doocqIiqggKz3lrLWiGk1EO+Ufs=; b=Yf0DXlF8nDAih2OzArtZgPLZbxiMORyGcz9XGFmCIQlZ0Ti5UukMaJ8cytBalUQSq7NV AGeTflGrtdx38MLuOP+Aod5mQacSzLR9BUS2E/+uL/WsTfKkpiyT9KM/gnud+dIu2qzd lqZ/yqWr5LIWXi0RVacehFGOtN2t1zp0a9I= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3dqj0ghnc1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 20 Jan 2022 23:30:41 -0800 Received: from twshared7634.08.ash8.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 20 Jan 2022 23:30:40 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 745C55BDEC00; Thu, 20 Jan 2022 23:30:32 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Date: Thu, 20 Jan 2022 23:30:32 -0800 Message-ID: <20220121073032.4176877-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220121073026.4173996-1-kafai@fb.com> References: <20220121073026.4173996-1-kafai@fb.com> X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: H1qyM9wsG3Si0Q6akOLo8Aryq5wvXvKx X-Proofpoint-GUID: H1qyM9wsG3Si0Q6akOLo8Aryq5wvXvKx X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-01-21_02,2022-01-20_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 mlxscore=0 adultscore=0 clxscore=1015 bulkscore=0 impostorscore=0 spamscore=0 phishscore=0 suspectscore=0 mlxlogscore=999 lowpriorityscore=0 malwarescore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2201210047 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC skb->tstamp was first used as the (rcv) timestamp in real time clock base. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. The later patch will also allow tc-bpf to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patch will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 13 +++++++++++++ net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- net/ipv4/ip_output.c | 5 +++-- net/ipv4/tcp_output.c | 16 +++++++++------- net/ipv6/ip6_output.c | 5 +++-- net/ipv6/netfilter.c | 5 +++-- 6 files changed, 34 insertions(+), 15 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bf11e1fbd69b..b9e20187242a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t; * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required + * @mono_delivery_time: When set, skb->tstamap has the + * delivery_time in mono clock base (i.e. EDT). Otherwise, the + * skb->tstamp has the (rcv) timestamp at ingress and + * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -890,6 +894,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void) return 0; } +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, + bool mono) +{ + skb->tstamp = kt; + /* Setting mono_delivery_time will be enabled later */ + /* skb->mono_delivery_time = kt && mono; */ +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index fdbed3158555..ebfb2a5c59e4 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; unsigned int hlen, ll_rs, mtu; ktime_t tstamp = skb->tstamp; struct ip_frag_state state; @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, if (iter.frag) ip_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 57c1d8431386..6ba08d9bdf8e 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -754,6 +754,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, { struct iphdr *iph; struct sk_buff *skb2; + bool mono_delivery_time = skb->mono_delivery_time; struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; @@ -836,7 +837,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, ip_fraglist_prepare(skb, &iter); } - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) @@ -892,7 +893,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, skb2); if (err) goto fail; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5079832af5c1..261e9584f81e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); if (clone_it) { oskb = skb; @@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, skb_split(skb, buff, len); - buff->tstamp = skb->tstamp; + skb_set_delivery_time(buff, skb->tstamp, true); tcp_fragment_tstamp(skb, buff); old_factor = tcp_skb_pcount(skb); @@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp_ns" is used as a start point for the retransmit timer */ - skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache; + tp->tcp_wstamp_ns = tp->tcp_clock_cache; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); goto repair; /* Skip network transmission */ @@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, now = tcp_clock_ns(); #ifdef CONFIG_SYN_COOKIES if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok)) - skb->skb_mstamp_ns = cookie_init_timestamp(req, now); + skb_set_delivery_time(skb, cookie_init_timestamp(req, now), + true); else #endif { - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */ tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb); } @@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, synack_type, &opts); - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); tcp_add_tx_delay(skb, tp); return skb; @@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation); - syn->skb_mstamp_ns = syn_data->skb_mstamp_ns; + skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true); /* Now full SYN+DATA was cloned and sent (or not), * remove the SYN from the original skb (syn_data) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 2995f8d89e7e..4b873413edc2 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; + bool mono_delivery_time = skb->mono_delivery_time; struct ip6_frag_state state; unsigned int mtu, hlen, nexthdr_offset; ktime_t tstamp = skb->tstamp; @@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), @@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - frag->tstamp = tstamp; + skb_set_delivery_time(frag, tstamp, mono_delivery_time); err = output(net, sk, frag); if (err) goto fail; diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6ab710b5a1a8..1da332450d98 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; ktime_t tstamp = skb->tstamp; struct ip6_frag_state state; u8 *prevhdr, nexthdr = 0; @@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; From patchwork Fri Jan 21 07:30:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12719378 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 361B4C433EF for ; Fri, 21 Jan 2022 07:30:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378995AbiAUHau (ORCPT ); Fri, 21 Jan 2022 02:30:50 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:5132 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378983AbiAUHat (ORCPT ); Fri, 21 Jan 2022 02:30:49 -0500 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 20L05BtS019552 for ; Thu, 20 Jan 2022 23:30:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=dCqdu9s+X+WtVUJZK56g6V/cYpOVw7ifi2FIjwnEeXY=; b=G//Uj+PDLPxy3EyZnXBwoZeLRHQTvgyEQq1ZAVeBv42mH59987DZmFmoDEp15Xre0ZUG 83Oobbg24Wg022ohQJRYzCoyOMKZcHQbeA05KEqa5xZQfdZgrWaVck5v0RGliM0gHrSB sM8Y/nshQR8VWx6n5FrGdak3qbKLPg/aBOA= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3dqhyvsn56-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 20 Jan 2022 23:30:48 -0800 Received: from twshared3205.02.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 20 Jan 2022 23:30:45 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id C53495BDEC3B; Thu, 20 Jan 2022 23:30:38 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [RFC PATCH v3 net-next 2/4] net: Add skb_clear_tstamp() to keep the mono delivery_time Date: Thu, 20 Jan 2022 23:30:38 -0800 Message-ID: <20220121073038.4178331-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220121073026.4173996-1-kafai@fb.com> References: <20220121073026.4173996-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: 2uv_OzutxHpTLYrVdbOhAawyxDo-x5nH X-Proofpoint-GUID: 2uv_OzutxHpTLYrVdbOhAawyxDo-x5nH X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-01-21_02,2022-01-20_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 bulkscore=0 phishscore=0 lowpriorityscore=0 priorityscore=1501 adultscore=0 suspectscore=0 spamscore=0 impostorscore=0 clxscore=1015 mlxscore=0 malwarescore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2201210047 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Right now, skb->tstamp is reset to 0 whenever the skb is forwarded. If skb->tstamp has the mono delivery_time, clearing it can hurt the performance when it finally transmits out to fq@phy-dev. The earlier patch added a skb->mono_delivery_time bit to flag the skb->tstamp carrying the mono delivery_time. This patch adds skb_clear_tstamp() helper which keeps the mono delivery_time and clear everything else. The delivery_time clearing is postponed until the stack knows the skb will be delivered locally. The postponed delivery_time clearing will be done together in a later patch when the skb->mono_delivery_time bit will finally be turned on. Signed-off-by: Martin KaFai Lau --- drivers/net/loopback.c | 2 +- include/linux/skbuff.h | 10 +++++++++- net/bridge/br_forward.c | 2 +- net/core/filter.c | 6 +++--- net/core/skbuff.c | 2 +- net/ipv4/ip_forward.c | 2 +- net/ipv6/ip6_output.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 6 +++--- net/netfilter/nf_dup_netdev.c | 2 +- net/netfilter/nf_flow_table_ip.c | 4 ++-- net/netfilter/nft_fwd_netdev.c | 2 +- net/openvswitch/vport.c | 2 +- net/xfrm/xfrm_interface.c | 2 +- 13 files changed, 26 insertions(+), 18 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index ed0edf5884ef..70a38fa09299 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb, skb_tx_timestamp(skb); /* do not fool net_timestamp_check() with various clock bases */ - skb->tstamp = 0; + skb_clear_tstamp(skb); skb_orphan(skb); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b9e20187242a..8de555513b94 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3916,6 +3916,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, /* skb->mono_delivery_time = kt && mono; */ } +static inline void skb_clear_tstamp(struct sk_buff *skb) +{ + if (skb->mono_delivery_time) + return; + + skb->tstamp = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; @@ -4772,7 +4780,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress) #ifdef CONFIG_NET_REDIRECT skb->from_ingress = from_ingress; if (skb->from_ingress) - skb->tstamp = 0; + skb_clear_tstamp(skb); #endif } diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index ec646656dbf1..02bb620d3b8d 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit); int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { - skb->tstamp = 0; + skb_clear_tstamp(skb); return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING, net, sk, skb, NULL, skb->dev, br_dev_queue_push_xmit); diff --git a/net/core/filter.c b/net/core/filter.c index 4603b7cd3cd1..4fc53d645a01 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_xmit_recursion_inc(); ret = dev_queue_xmit(skb); @@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); @@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb, } skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { skb = skb_expand_head(skb, hh_len); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0118f0afaa4f..3e3da8fdf8f5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5350,7 +5350,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) ipvs_reset(skb); skb->mark = 0; - skb->tstamp = 0; + skb_clear_tstamp(skb); } EXPORT_SYMBOL_GPL(skb_scrub_packet); diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 00ec819f949b..92ba3350274b 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s if (unlikely(opt->optlen)) ip_forward_options(skb); - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 4b873413edc2..0875ab61eb95 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk, } #endif - skb->tstamp = 0; + skb_clear_tstamp(skb); return dst_output(net, sk, skb); } diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index d2e5a8f644b8..029171379884 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, nf_reset_ct(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); } return ret; } @@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, if (!local) { skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else @@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); if (skb->dev) - skb->tstamp = 0; + skb_clear_tstamp(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c index a579e59ee5c5..7873bd1389c3 100644 --- a/net/netfilter/nf_dup_netdev.c +++ b/net/netfilter/nf_dup_netdev.c @@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev) skb_push(skb, skb->mac_len); skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); dev_queue_xmit(skb); } diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c index 889cf88d3dba..f1d387129f02 100644 --- a/net/netfilter/nf_flow_table_ip.c +++ b/net/netfilter/nf_flow_table_ip.c @@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ip(flow, skb, thoff, dir, iph); ip_decrease_ttl(iph); - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); @@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb, nf_flow_nat_ipv6(flow, skb, dir, ip6h); ip6h->hop_limit--; - skb->tstamp = 0; + skb_clear_tstamp(skb); if (flow_table->flags & NF_FLOWTABLE_COUNTER) nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len); diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c index fa9301ca6033..4b2b0946c0b6 100644 --- a/net/netfilter/nft_fwd_netdev.c +++ b/net/netfilter/nft_fwd_netdev.c @@ -140,7 +140,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr, return; skb->dev = dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); neigh_xmit(neigh_table, dev, addr, skb); out: regs->verdict.code = verdict; diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index cf2ce5812489..82a74f998966 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto) } skb->dev = vport->dev; - skb->tstamp = 0; + skb_clear_tstamp(skb); vport->ops->send(skb); return; diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 57448fc519fc..4991e99ced9a 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev) static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet) { - skb->tstamp = 0; + skb_clear_tstamp(skb); skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb->ignore_df = 0; From patchwork Fri Jan 21 07:30:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12719379 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E70C7C433F5 for ; Fri, 21 Jan 2022 07:30:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378984AbiAUHay (ORCPT ); Fri, 21 Jan 2022 02:30:54 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:11822 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378996AbiAUHax (ORCPT ); Fri, 21 Jan 2022 02:30:53 -0500 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 20L06frF010885 for ; Thu, 20 Jan 2022 23:30:53 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=ZjLc9/FhMWqSpoW30pSI3qdHvYoyCOCVI3xsBFk/tvU=; b=JRMX6SzZ/CxLdz7hFQuYaj8PGoFHbQf7YI/uG1xpk0OeT9DswddncYdBw7umBaP1q7GY Lgi4F1/wG+lf69UVHFzrRf1sIfyeynFsMXwGwPiE46oeNaGAzUw1tlv+mW30f8WRHUWy arIhKEF4O0A9moZOOBSamfR3UTCB8v6a2HA= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3dqj0ghncy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 20 Jan 2022 23:30:53 -0800 Received: from twshared3205.02.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 20 Jan 2022 23:30:52 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 1F3615BDEC49; Thu, 20 Jan 2022 23:30:45 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally Date: Thu, 20 Jan 2022 23:30:45 -0800 Message-ID: <20220121073045.4179438-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220121073026.4173996-1-kafai@fb.com> References: <20220121073026.4173996-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: lS2YRcsb6841knl9KA-6XQXUibBOOC5x X-Proofpoint-GUID: lS2YRcsb6841knl9KA-6XQXUibBOOC5x X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-01-21_02,2022-01-20_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 mlxscore=0 adultscore=0 clxscore=1015 bulkscore=0 impostorscore=0 spamscore=0 phishscore=0 suspectscore=0 mlxlogscore=999 lowpriorityscore=0 malwarescore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2201210047 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC This patch sets the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp. skb_clear_delivery_time() is added to clear the delivery_time and set back to the (rcv) timestamp if needed when the skb is being delivered locally (to a sk). skb_clear_delivery_time() is called in ip_local_deliver() and ip6_input(). In most of the regular ingress cases, the skb->tstamp should already have the (rcv) timestamp. For the egress loop back to ingress cases, the marking of the (rcv) timestamp is postponed from dev.c to ip_local_deliver() and ip6_input(). Another case needs to clear the delivery_time is the network tapping (e.g. af_packet by tcpdump). Regardless of tapping at the ingress or egress, the tapped skb is received by the af_packet socket, so it is ingress to the af_packet socket and it expects the (rcv) timestamp. When tapping at egress, dev_queue_xmit_nit() is used. It has already expected skb->tstamp may have delivery_time, so it does skb_clone()+net_timestamp_set() to ensure the cloned skb has the (rcv) timestamp before passing to the af_packet sk. This patch only adds to clear the skb->mono_delivery_time bit in net_timestamp_set(). When tapping at ingress, it currently expects the skb->tstamp is either 0 or has the (rcv) timestamp. Meaning, the tapping at ingress path has already expected the skb->tstamp could be 0 and it will get the (rcv) timestamp by ktime_get_real() when needed. There are two cases for tapping at ingress: One case is af_packet queues the skb to its sk_receive_queue. The skb is either not shared or new clone created. The skb_clear_delivery_time() is called to clear the delivery_time (if any) before it is queued to the sk_receive_queue. Another case, the ingress skb is directly copied to the rx_ring and tpacket_get_timestamp() is used to get the (rcv) timestamp. skb_tstamp() is used in tpacket_get_timestamp() to check the skb->mono_delivery_time bit before returning skb->tstamp. As mentioned earlier, the tapping@ingress has already expected the skb may not have the (rcv) timestamp (because no sk has asked for it) and has handled this case by directly calling ktime_get_real(). In __skb_tstamp_tx, it clones the egress skb and queues the clone to the sk_error_queue. The outgoing skb may have the mono delivery_time while the (rcv) timestamp is expected for the clone, so the skb->mono_delivery_time bit is also cleared from the clone. Signed-off-by: Martin KaFai Lau --- include/linux/skbuff.h | 27 +++++++++++++++++++++++++-- net/core/dev.c | 4 +++- net/core/skbuff.c | 6 ++++-- net/ipv4/ip_input.c | 1 + net/ipv6/ip6_input.c | 1 + net/packet/af_packet.c | 4 +++- 6 files changed, 37 insertions(+), 6 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8de555513b94..4677bb6c7279 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3912,8 +3912,23 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, bool mono) { skb->tstamp = kt; - /* Setting mono_delivery_time will be enabled later */ - /* skb->mono_delivery_time = kt && mono; */ + skb->mono_delivery_time = kt && mono; +} + +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key); + +/* skb is delivering locally. If needed, set it to the (rcv) timestamp. + * Otherwise, clear the delivery time. + */ +static inline void skb_clear_delivery_time(struct sk_buff *skb) +{ + if (unlikely(skb->mono_delivery_time)) { + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = ktime_get_real(); + else + skb->tstamp = 0; + } } static inline void skb_clear_tstamp(struct sk_buff *skb) @@ -3924,6 +3939,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb) skb->tstamp = 0; } +static inline ktime_t skb_tstamp(const struct sk_buff *skb) +{ + if (unlikely(skb->mono_delivery_time)) + return 0; + + return skb->tstamp; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/core/dev.c b/net/core/dev.c index 84a0d9542fe9..b4b392f5ef9f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2000,7 +2000,8 @@ void net_dec_egress_queue(void) EXPORT_SYMBOL_GPL(net_dec_egress_queue); #endif -static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +DEFINE_STATIC_KEY_FALSE(netstamp_needed_key); +EXPORT_SYMBOL(netstamp_needed_key); #ifdef CONFIG_JUMP_LABEL static atomic_t netstamp_needed_deferred; static atomic_t netstamp_wanted; @@ -2061,6 +2062,7 @@ EXPORT_SYMBOL(net_disable_timestamp); static inline void net_timestamp_set(struct sk_buff *skb) { skb->tstamp = 0; + skb->mono_delivery_time = 0; if (static_branch_unlikely(&netstamp_needed_key)) __net_timestamp(skb); } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3e3da8fdf8f5..93dc763da8cb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4817,10 +4817,12 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb, skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey; } - if (hwtstamps) + if (hwtstamps) { *skb_hwtstamps(skb) = *hwtstamps; - else + } else { skb->tstamp = ktime_get_real(); + skb->mono_delivery_time = 0; + } __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats); } diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 3a025c011971..35311ca75496 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb) */ struct net *net = dev_net(skb->dev); + skb_clear_delivery_time(skb); if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 80256717868e..84f93864b774 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk int ip6_input(struct sk_buff *skb) { + skb_clear_delivery_time(skb); return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN, dev_net(skb->dev), NULL, skb, skb->dev, NULL, ip6_input_finish); diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 5bd409ab4cc2..ab55adff3500 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts, return TP_STATUS_TS_RAW_HARDWARE; if ((flags & SOF_TIMESTAMPING_SOFTWARE) && - ktime_to_timespec64_cond(skb->tstamp, ts)) + ktime_to_timespec64_cond(skb_tstamp(skb), ts)) return TP_STATUS_TS_SOFTWARE; return 0; @@ -2195,6 +2195,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, spin_lock(&sk->sk_receive_queue.lock); po->stats.stats1.tp_packets++; sock_skb_set_dropcount(sk, skb); + skb_clear_delivery_time(skb); __skb_queue_tail(&sk->sk_receive_queue, skb); spin_unlock(&sk->sk_receive_queue.lock); sk->sk_data_ready(sk); @@ -2373,6 +2374,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, po->stats.stats1.tp_packets++; if (copy_skb) { status |= TP_STATUS_COPY; + skb_clear_delivery_time(copy_skb); __skb_queue_tail(&sk->sk_receive_queue, copy_skb); } spin_unlock(&sk->sk_receive_queue.lock); From patchwork Fri Jan 21 07:30:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12719380 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C012CC433F5 for ; Fri, 21 Jan 2022 07:31:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378997AbiAUHbC (ORCPT ); Fri, 21 Jan 2022 02:31:02 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:34690 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1378998AbiAUHa5 (ORCPT ); Fri, 21 Jan 2022 02:30:57 -0500 Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.16.1.2/8.16.1.2) with ESMTP id 20L04Rsf005971 for ; Thu, 20 Jan 2022 23:30:56 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=Sg0ix8wtmyhIXwiOi+i3PRlqwkIUNiicrkJlSL/oteI=; b=YEa341z0tBsRQ+QttUHztehf7cALFTdflI3zoWwF18bbfPF//jvM1skO4sJ06Dab9hF7 YAVIbbA3UOQaqC679UOtx+coIOXNpxZ2voXp8ZCzoWoZ1jh2bX+g0Rxt9xbcmNzmnH+/ rZMH9/7pP2IkAwGgpcZDhHOXSapErHtQifI= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0001303.ppops.net (PPS) with ESMTPS id 3dqhy4hmsp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 20 Jan 2022 23:30:56 -0800 Received: from twshared18912.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.20; Thu, 20 Jan 2022 23:30:55 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 69F975BDEC4B; Thu, 20 Jan 2022 23:30:51 -0800 (PST) From: Martin KaFai Lau To: , CC: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , David Miller , Eric Dumazet , Jakub Kicinski , , Willem de Bruijn Subject: [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress Date: Thu, 20 Jan 2022 23:30:51 -0800 Message-ID: <20220121073051.4180328-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220121073026.4173996-1-kafai@fb.com> References: <20220121073026.4173996-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: dEthWMJbrKln4HcFSt_cP1PPK9Q6atp8 X-Proofpoint-GUID: dEthWMJbrKln4HcFSt_cP1PPK9Q6atp8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-01-21_02,2022-01-20_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=fb_outbound_notspam policy=fb_outbound score=0 impostorscore=0 spamscore=0 malwarescore=0 adultscore=0 clxscore=1015 mlxlogscore=999 bulkscore=0 phishscore=0 suspectscore=0 lowpriorityscore=0 mlxscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2201210047 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC __sk_buff->mono_delivery_time: This patch adds __sk_buff->mono_delivery_time to read and write the mono delivery_time in skb->tstamp. The bpf rewrite is like: /* BPF_READ: __u64 a = __sk_buff->mono_delivery_time; */ if (skb->mono_delivery_time) a = skb->tstamp; else a = 0; /* BPF_WRITE: __sk_buff->mono_delivery_time = a; */ skb->tstamp = a; skb->mono_delivery_time = !!a; __sk_buff->tstamp: The bpf rewrite is like: /* BPF_READ: __u64 a = __sk_buff->tstamp; */ if (skb->tc_at_ingress && skb->mono_delivery_time) a = 0; else a = skb->tstamp; /* BPF_WRITE: __sk_buff->tstamp = a; */ skb->tstamp = a; if (skb->tc_at_ingress || !a) skb->mono_delivery_time = 0; At egress, reading is the same as before. All skb->tstamp is the delivery_time. Writing will not change the (kernel) skb->mono_delivery_time also unless 0 is being written. This will be the same behavior as before. (#) At ingress, the current bpf prog can only expect the (rcv) timestamp. Thus, both reading and writing are now treated as operating on the (rcv) timestamp for the existing bpf prog. During bpf load time, the verifier will learn if the bpf prog has accessed the new __sk_buff->mono_delivery_time. When reading at ingress, if the bpf prog does not access the new __sk_buff->mono_delivery_time, it will be treated as the existing behavior as mentioned in (#) above. If the (kernel) skb->tstamp currently has a delivery_time, it will temporary be saved first and then set the skb->tstamp to either the ktime_get_real() or zero. After the bpf prog finished running, if the bpf prog did not change the skb->tstamp, the saved delivery_time will be restored back to the skb->tstamp. When writing __sk_buff->tstamp at ingress, the skb->mono_delivery_time will be cleared because of the (#) mentioned above. If the bpf prog does access the new __sk_buff->mono_delivery_time at ingress, it indicates that the bpf prog is aware of this new kernel support: the (kernel) skb->tstamp can have the delivery_time or the (rcv) timestamp at ingress. If the __sk_buff->mono_delivery_time is available, the __sk_buff->tstamp will not be available and it will be zero. The bpf rewrite needs to access the skb's mono_delivery_time and tc_at_ingress bit. They are moved up in sk_buff so that bpf rewrite can be done at a fixed offset. tc_skip_classify is moved together with tc_at_ingress. To get one bit for mono_delivery_time, csum_not_inet is moved down and this bit is currently used by sctp. Signed-off-by: Martin KaFai Lau --- include/linux/filter.h | 31 +++++++- include/linux/skbuff.h | 20 +++-- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 134 ++++++++++++++++++++++++++++++--- net/sched/act_bpf.c | 5 +- net/sched/cls_bpf.c | 6 +- tools/include/uapi/linux/bpf.h | 1 + 7 files changed, 171 insertions(+), 27 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 71fa57b88bfc..5cef695d6575 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -572,7 +572,8 @@ struct bpf_prog { has_callchain_buf:1, /* callchain buffer allocated? */ enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */ - call_get_func_ip:1; /* Do we call get_func_ip() */ + call_get_func_ip:1, /* Do we call get_func_ip() */ + delivery_time_access:1; /* Accessed __sk_buff->mono_delivery_time */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ @@ -699,6 +700,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) cb->data_end = skb->data + skb_headlen(skb); } +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, + struct sk_buff *skb) +{ + ktime_t tstamp, delivery_time = 0; + int filter_res; + + if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) { + delivery_time = skb->tstamp; + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = tstamp = ktime_get_real(); + else + skb->tstamp = tstamp = 0; + } + + /* It is safe to push/pull even if skb_shared() */ + __skb_push(skb, skb->mac_len); + bpf_compute_data_pointers(skb); + filter_res = bpf_prog_run(prog, skb); + __skb_pull(skb, skb->mac_len); + + /* __sk_buff->tstamp was not changed, restore the delivery_time */ + if (unlikely(delivery_time) && skb_tstamp(skb) == tstamp) + skb_set_delivery_time(skb, delivery_time, true); + + return filter_res; +} + /* Similar to bpf_compute_data_pointers(), except that save orginal * data in cb->data and cb->meta_data for restore. */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4677bb6c7279..a14b04b86c13 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -866,22 +866,23 @@ struct sk_buff { __u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */ __u8 csum_complete_sw:1; __u8 csum_level:2; - __u8 csum_not_inet:1; __u8 dst_pending_confirm:1; + __u8 mono_delivery_time:1; + +#ifdef CONFIG_NET_CLS_ACT + __u8 tc_skip_classify:1; + __u8 tc_at_ingress:1; +#endif #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif - + __u8 csum_not_inet:1; __u8 ipvs_property:1; __u8 inner_protocol_type:1; __u8 remcsum_offload:1; #ifdef CONFIG_NET_SWITCHDEV __u8 offload_fwd_mark:1; __u8 offload_l3_fwd_mark:1; -#endif -#ifdef CONFIG_NET_CLS_ACT - __u8 tc_skip_classify:1; - __u8 tc_at_ingress:1; #endif __u8 redirected:1; #ifdef CONFIG_NET_REDIRECT @@ -894,7 +895,6 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; - __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -972,10 +972,16 @@ struct sk_buff { /* if you move pkt_vlan_present around you also must adapt these constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_VLAN_PRESENT_BIT 7 +#define TC_AT_INGRESS_SHIFT 0 +#define SKB_MONO_DELIVERY_TIME_SHIFT 2 #else #define PKT_VLAN_PRESENT_BIT 0 +#define TC_AT_INGRESS_SHIFT 7 +#define SKB_MONO_DELIVERY_TIME_SHIFT 5 #endif #define PKT_VLAN_PRESENT_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) #ifdef __KERNEL__ /* diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b0383d371b9a..83725c891f3c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5437,6 +5437,7 @@ struct __sk_buff { __u32 gso_size; __u32 :32; /* Padding, future use. */ __u64 hwtstamp; + __u64 mono_delivery_time; }; struct bpf_tunnel_key { diff --git a/net/core/filter.c b/net/core/filter.c index 4fc53d645a01..db17812f0f8c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7832,6 +7832,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type return false; break; case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): if (size != sizeof(__u64)) return false; break; @@ -7872,6 +7873,7 @@ static bool sk_filter_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tstamp): case bpf_ctx_range(struct __sk_buff, wire_len): case bpf_ctx_range(struct __sk_buff, hwtstamp): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): return false; } @@ -7911,6 +7913,7 @@ static bool cg_skb_is_valid_access(int off, int size, case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): break; case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): if (!bpf_capable()) return false; break; @@ -7943,6 +7946,7 @@ static bool lwt_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tstamp): case bpf_ctx_range(struct __sk_buff, wire_len): case bpf_ctx_range(struct __sk_buff, hwtstamp): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): return false; } @@ -8169,6 +8173,7 @@ static bool tc_cls_act_is_valid_access(int off, int size, case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): case bpf_ctx_range(struct __sk_buff, tstamp): case bpf_ctx_range(struct __sk_buff, queue_mapping): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): break; default: return false; @@ -8445,6 +8450,7 @@ static bool sk_skb_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tstamp): case bpf_ctx_range(struct __sk_buff, wire_len): case bpf_ctx_range(struct __sk_buff, hwtstamp): + case bpf_ctx_range(struct __sk_buff, mono_delivery_time): return false; } @@ -8603,6 +8609,114 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, return insn; } +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->dst_reg; + __u8 skb_reg = si->src_reg; + __u8 tmp_reg = BPF_REG_AX; + +#ifdef CONFIG_NET_CLS_ACT + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5); + /* @ingress, read __sk_buff->tstamp as the (rcv) timestamp, + * so check the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + 1 << SKB_MONO_DELIVERY_TIME_SHIFT); + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); + /* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */ + *insn++ = BPF_MOV64_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); +#endif + + *insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg, + offsetof(struct sk_buff, tstamp)); + return insn; +} + +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->src_reg; + __u8 skb_reg = si->dst_reg; + __u8 tmp_reg = BPF_REG_AX; + + /* skb->tstamp = tstamp */ + *insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg, + offsetof(struct sk_buff, tstamp)); + +#ifdef CONFIG_NET_CLS_ACT + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT); + *insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 1); +#endif + + /* test tstamp != 0 */ + *insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 3); + /* writing __sk_buff->tstamp at ingress or writing 0, + * clear the skb->mono_delivery_time. + */ + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + ~(1 << SKB_MONO_DELIVERY_TIME_SHIFT)); + *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + + return insn; +} + +static struct bpf_insn * +bpf_convert_mono_delivery_time_read(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->dst_reg; + __u8 skb_reg = si->src_reg; + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + 1 << SKB_MONO_DELIVERY_TIME_SHIFT); + *insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 2); + *insn++ = BPF_MOV64_IMM(value_reg, 0); + *insn++ = BPF_JMP_A(1); + *insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg, + offsetof(struct sk_buff, tstamp)); + + return insn; +} + +static struct bpf_insn * +bpf_convert_mono_delivery_time_write(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + __u8 value_reg = si->src_reg; + __u8 skb_reg = si->dst_reg; + __u8 tmp_reg = BPF_REG_AX; + + *insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg, + offsetof(struct sk_buff, tstamp)); + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + *insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 2); + /* write zero, clear the skb->mono_delivery_time */ + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, + ~(1 << SKB_MONO_DELIVERY_TIME_SHIFT)); + *insn++ = BPF_JMP_A(1); + /* write non zero, set skb->mono_delivery_time */ + *insn++ = BPF_ALU32_IMM(BPF_OR, tmp_reg, + 1 << SKB_MONO_DELIVERY_TIME_SHIFT); + *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, + SKB_MONO_DELIVERY_TIME_OFFSET); + + return insn; +} + static u32 bpf_convert_ctx_access(enum bpf_access_type type, const struct bpf_insn *si, struct bpf_insn *insn_buf, @@ -8911,17 +9025,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8); if (type == BPF_WRITE) - *insn++ = BPF_STX_MEM(BPF_DW, - si->dst_reg, si->src_reg, - bpf_target_off(struct sk_buff, - tstamp, 8, - target_size)); + insn = bpf_convert_tstamp_write(si, insn); else - *insn++ = BPF_LDX_MEM(BPF_DW, - si->dst_reg, si->src_reg, - bpf_target_off(struct sk_buff, - tstamp, 8, - target_size)); + insn = bpf_convert_tstamp_read(si, insn); + break; + + case offsetof(struct __sk_buff, mono_delivery_time): + if (type == BPF_WRITE) + insn = bpf_convert_mono_delivery_time_write(si, insn); + else + insn = bpf_convert_mono_delivery_time_read(si, insn); + prog->delivery_time_access = 1; break; case offsetof(struct __sk_buff, gso_segs): diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index a77d8908e737..14c3bd0a5088 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, filter = rcu_dereference(prog->filter); if (at_ingress) { - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index df19a847829e..036b2e1f74af 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, if (tc_skip_sw(prog->gen_flags)) { filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0; } else if (at_ingress) { - /* It is safe to push/pull even if skb_shared() */ - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(prog->filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(prog->filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index b0383d371b9a..83725c891f3c 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5437,6 +5437,7 @@ struct __sk_buff { __u32 gso_size; __u32 :32; /* Padding, future use. */ __u64 hwtstamp; + __u64 mono_delivery_time; }; struct bpf_tunnel_key {