Message ID | 20220211071238.885669-1-kafai@fb.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Preserve mono delivery time (EDT) in skb->tstamp | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net-next |
netdev/apply | fail | Patch does not apply to net-next |
On Thu, Feb 10, 2022 at 11:12:38PM -0800, Martin KaFai Lau wrote: > skb->tstamp was first used as the (rcv) timestamp. > The major usage is to report it to the user (e.g. SO_TIMESTAMP). > > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) > during egress and used by the qdisc (e.g. sch_fq) to make decision on when > the skb can be passed to the dev. > > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp > or the delivery_time, so it is always reset to 0 whenever forwarded > between egress and ingress. > > While it makes sense to always clear the (rcv) timestamp in skb->tstamp > to avoid confusing sch_fq that expects the delivery_time, it is a > performance issue [0] to clear the delivery_time if the skb finally > egress to a fq@phy-dev. For example, when forwarding from egress to > ingress and then finally back to egress: > > tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns > ^ ^ > reset rest > > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp > is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. > > The current use case is to keep the TCP mono delivery_time (EDT) and > to be used with sch_fq. A later patch will also allow tc-bpf@ingress > to read and change the mono delivery_time. Can you be more specific? How is the fq in the hostns even visible to container ns? More importantly, why the packets are always going out from container to eth0? From the sender's point of view, it can't see the hostns and can't event know whether the packets are routed to eth0 or other containers on the same host. So I don't see how this makes sense. Crossing netns is pretty much like delivering on wire, *generally speaking* if the skb meta data is not preserved on wire, it probably should not for crossing netns either. Thanks.
On Sat, Feb 12, 2022 at 11:13:53AM -0800, Cong Wang wrote: > On Thu, Feb 10, 2022 at 11:12:38PM -0800, Martin KaFai Lau wrote: > > skb->tstamp was first used as the (rcv) timestamp. > > The major usage is to report it to the user (e.g. SO_TIMESTAMP). > > > > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) > > during egress and used by the qdisc (e.g. sch_fq) to make decision on when > > the skb can be passed to the dev. > > > > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp > > or the delivery_time, so it is always reset to 0 whenever forwarded > > between egress and ingress. > > > > While it makes sense to always clear the (rcv) timestamp in skb->tstamp > > to avoid confusing sch_fq that expects the delivery_time, it is a > > performance issue [0] to clear the delivery_time if the skb finally > > egress to a fq@phy-dev. For example, when forwarding from egress to > > ingress and then finally back to egress: > > > > tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns > > ^ ^ > > reset rest > > > > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp > > is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. > > > > The current use case is to keep the TCP mono delivery_time (EDT) and > > to be used with sch_fq. A later patch will also allow tc-bpf@ingress > > to read and change the mono delivery_time. > > Can you be more specific? How is the fq in the hostns even visible to > container ns? More importantly, why the packets are always going out from > container to eth0? > > From the sender's point of view, it can't see the hostns and can't event > know whether the packets are routed to eth0 or other containers on the > same host. So I don't see how this makes sense. The sender does not need to know if there is fq installed anywhere or how the packet will be routed. It is completely orthogonal. Today, the TCP is always setting the EDT without knowing where it will be routed and if there is fq (or any lower layer code) installed anywhere in the routing path that will be using it. > Crossing netns is pretty much like delivering on wire, *generally speaking* > if the skb meta data is not preserved on wire, it probably should not for > crossing netns either. There are many fields in the skb that are not cleared. In general, it clears when it is needed. e.g. skb->sk in the veth case above and sk has info that is not even in the tcp/ip packet itself. The delivery time was needed to be cleared because there is no way to distinguish between the rcv timestamp and the delivery time.
On 2/11/22 8:12 AM, Martin KaFai Lau wrote: [...] > The current use case is to keep the TCP mono delivery_time (EDT) and > to be used with sch_fq. A later patch will also allow tc-bpf@ingress > to read and change the mono delivery_time. > > In the future, another bit (e.g. skb->user_delivery_time) can be added [...] > --- > include/linux/skbuff.h | 13 +++++++++++++ > net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- > net/ipv4/ip_output.c | 7 +++++-- > net/ipv4/tcp_output.c | 16 +++++++++------- > net/ipv6/ip6_output.c | 5 +++-- > net/ipv6/netfilter.c | 5 +++-- > net/ipv6/tcp_ipv6.c | 2 +- > 7 files changed, 37 insertions(+), 16 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index a5adbf6b51e8..32c793de3801 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t; > * @dst_pending_confirm: need to confirm neighbour > * @decrypted: Decrypted SKB > * @slow_gro: state present at GRO time, slower prepare step required > + * @mono_delivery_time: When set, skb->tstamp has the > + * delivery_time in mono clock base (i.e. EDT). Otherwise, the > + * skb->tstamp has the (rcv) timestamp at ingress and > + * delivery_time at egress. > * @napi_id: id of the NAPI struct this skb came from > * @sender_cpu: (aka @napi_id) source CPU in XPS > * @secmark: security marking > @@ -917,6 +921,7 @@ struct sk_buff { > __u8 decrypted:1; > #endif > __u8 slow_gro:1; > + __u8 mono_delivery_time:1; > Don't you also need to extend sch_fq to treat any non-mono_delivery_time marked skb similar as if it hadn't been marked with a delivery time?
On Tue, Feb 15, 2022 at 09:49:07PM +0100, Daniel Borkmann wrote: > On 2/11/22 8:12 AM, Martin KaFai Lau wrote: > [...] > > The current use case is to keep the TCP mono delivery_time (EDT) and > > to be used with sch_fq. A later patch will also allow tc-bpf@ingress > > to read and change the mono delivery_time. > > > > In the future, another bit (e.g. skb->user_delivery_time) can be added > [...] > > --- > > include/linux/skbuff.h | 13 +++++++++++++ > > net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- > > net/ipv4/ip_output.c | 7 +++++-- > > net/ipv4/tcp_output.c | 16 +++++++++------- > > net/ipv6/ip6_output.c | 5 +++-- > > net/ipv6/netfilter.c | 5 +++-- > > net/ipv6/tcp_ipv6.c | 2 +- > > 7 files changed, 37 insertions(+), 16 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index a5adbf6b51e8..32c793de3801 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t; > > * @dst_pending_confirm: need to confirm neighbour > > * @decrypted: Decrypted SKB > > * @slow_gro: state present at GRO time, slower prepare step required > > + * @mono_delivery_time: When set, skb->tstamp has the > > + * delivery_time in mono clock base (i.e. EDT). Otherwise, the > > + * skb->tstamp has the (rcv) timestamp at ingress and > > + * delivery_time at egress. > > * @napi_id: id of the NAPI struct this skb came from > > * @sender_cpu: (aka @napi_id) source CPU in XPS > > * @secmark: security marking > > @@ -917,6 +921,7 @@ struct sk_buff { > > __u8 decrypted:1; > > #endif > > __u8 slow_gro:1; > > + __u8 mono_delivery_time:1; > > Don't you also need to extend sch_fq to treat any non-mono_delivery_time marked > skb similar as if it hadn't been marked with a delivery time? The current skb does not have clock base info, so the sch_fq does not depend on it to work also. fq_packet_beyond_horizon() should already be a good guard on the clock base difference. An extra check on the new mono_delivery_time does not necessary add extra value on top.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a5adbf6b51e8..32c793de3801 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t; * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required + * @mono_delivery_time: When set, skb->tstamp has the + * delivery_time in mono clock base (i.e. EDT). Otherwise, the + * skb->tstamp has the (rcv) timestamp at ingress and + * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -917,6 +921,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -3925,6 +3930,14 @@ static inline ktime_t net_timedelta(ktime_t t) return ktime_sub(ktime_get_real(), t); } +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, + bool mono) +{ + skb->tstamp = kt; + /* Setting mono_delivery_time will be enabled later */ + skb->mono_delivery_time = 0; +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index fdbed3158555..ebfb2a5c59e4 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; unsigned int hlen, ll_rs, mtu; ktime_t tstamp = skb->tstamp; struct ip_frag_state state; @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, if (iter.frag) ip_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 0c0574eb5f5b..7af5d1849bc9 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -761,6 +761,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, { struct iphdr *iph; struct sk_buff *skb2; + bool mono_delivery_time = skb->mono_delivery_time; struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; @@ -852,7 +853,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, } } - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) @@ -908,7 +909,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, skb2); if (err) goto fail; @@ -1727,6 +1728,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; + /* Setting mono_delivery_time will be enabled later */ + nskb->mono_delivery_time = 0; ip_push_pending_frames(sk, &fl4); } out: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e76bf1e9251e..2319531267c6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); if (clone_it) { oskb = skb; @@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, skb_split(skb, buff, len); - buff->tstamp = skb->tstamp; + skb_set_delivery_time(buff, skb->tstamp, true); tcp_fragment_tstamp(skb, buff); old_factor = tcp_skb_pcount(skb); @@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp_ns" is used as a start point for the retransmit timer */ - skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache; + tp->tcp_wstamp_ns = tp->tcp_clock_cache; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); goto repair; /* Skip network transmission */ @@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, now = tcp_clock_ns(); #ifdef CONFIG_SYN_COOKIES if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok)) - skb->skb_mstamp_ns = cookie_init_timestamp(req, now); + skb_set_delivery_time(skb, cookie_init_timestamp(req, now), + true); else #endif { - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */ tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb); } @@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, synack_type, &opts); - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); tcp_add_tx_delay(skb, tp); return skb; @@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation); - syn->skb_mstamp_ns = syn_data->skb_mstamp_ns; + skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true); /* Now full SYN+DATA was cloned and sent (or not), * remove the SYN from the original skb (syn_data) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 0c6c971ce0a5..55665f3f7a77 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; + bool mono_delivery_time = skb->mono_delivery_time; struct ip6_frag_state state; unsigned int mtu, hlen, nexthdr_offset; ktime_t tstamp = skb->tstamp; @@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), @@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - frag->tstamp = tstamp; + skb_set_delivery_time(frag, tstamp, mono_delivery_time); err = output(net, sk, frag); if (err) goto fail; diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6ab710b5a1a8..1da332450d98 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; ktime_t tstamp = skb->tstamp; struct ip6_frag_state state; u8 *prevhdr, nexthdr = 0; @@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0c648bf07f39..d4fc6641afa4 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -992,7 +992,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 } else { mark = sk->sk_mark; } - buff->tstamp = tcp_transmit_time(sk); + skb_set_delivery_time(buff, tcp_transmit_time(sk), true); } fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark; fl6.fl6_dport = t1->dest;
skb->tstamp was first used as the (rcv) timestamp. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. A later patch will also allow tc-bpf@ingress to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patch will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. A note on the change in ip_send_unicast_reply() in ip_output.c. It is only used by TCP to send reset/ack out of a ctl_sk. Like the new skb_set_delivery_time(), this patch sets the skb->mono_delivery_time to 0 for now as a place holder. It will be enabled in a later patch. A similar case in tcp_ipv6 can be done with skb_set_delivery_time() in tcp_v6_send_response(). [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf Signed-off-by: Martin KaFai Lau <kafai@fb.com> --- include/linux/skbuff.h | 13 +++++++++++++ net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- net/ipv4/ip_output.c | 7 +++++-- net/ipv4/tcp_output.c | 16 +++++++++------- net/ipv6/ip6_output.c | 5 +++-- net/ipv6/netfilter.c | 5 +++-- net/ipv6/tcp_ipv6.c | 2 +- 7 files changed, 37 insertions(+), 16 deletions(-)