Message ID | 20220121073032.4176877-1-kafai@fb.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Preserve mono delivery time (EDT) in skb->tstamp | expand |
On Fri, Jan 21, 2022 at 2:30 AM Martin KaFai Lau <kafai@fb.com> wrote: > > skb->tstamp was first used as the (rcv) timestamp in real time clock base. > The major usage is to report it to the user (e.g. SO_TIMESTAMP). > > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) > during egress and used by the qdisc (e.g. sch_fq) to make decision on when > the skb can be passed to the dev. > > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp > or the delivery_time, so it is always reset to 0 whenever forwarded > between egress and ingress. > > While it makes sense to always clear the (rcv) timestamp in skb->tstamp > to avoid confusing sch_fq that expects the delivery_time, it is a > performance issue [0] to clear the delivery_time if the skb finally > egress to a fq@phy-dev. For example, when forwarding from egress to > ingress and then finally back to egress: > > tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns > ^ ^ > reset rest > > [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf > > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp > is storing the mono delivery_time instead of the (rcv) timestamp. > > The current use case is to keep the TCP mono delivery_time (EDT) and > to be used with sch_fq. The later patch will also allow tc-bpf to read > and change the mono delivery_time. > > In the future, another bit (e.g. skb->user_delivery_time) can be added > for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. > > [ This patch is a prep work. The following patch will > get the other parts of the stack ready first. Then another patch > after that will finally set the skb->mono_delivery_time. ] > > skb_set_delivery_time() function is added. It is used by the tcp_output.c > and during ip[6] fragmentation to assign the delivery_time to > the skb->tstamp and also set the skb->mono_delivery_time. > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > --- > include/linux/skbuff.h | 13 +++++++++++++ > net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- > net/ipv4/ip_output.c | 5 +++-- > net/ipv4/tcp_output.c | 16 +++++++++------- > net/ipv6/ip6_output.c | 5 +++-- > net/ipv6/netfilter.c | 5 +++-- > 6 files changed, 34 insertions(+), 15 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index bf11e1fbd69b..b9e20187242a 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t; > * @dst_pending_confirm: need to confirm neighbour > * @decrypted: Decrypted SKB > * @slow_gro: state present at GRO time, slower prepare step required > + * @mono_delivery_time: When set, skb->tstamap has the tstamp > + * delivery_time in mono clock base (i.e. EDT). Otherwise, the > + * skb->tstamp has the (rcv) timestamp at ingress and > + * delivery_time at egress. > * @napi_id: id of the NAPI struct this skb came from > * @sender_cpu: (aka @napi_id) source CPU in XPS > * @secmark: security marking > @@ -890,6 +894,7 @@ struct sk_buff { > __u8 decrypted:1; > #endif > __u8 slow_gro:1; > + __u8 mono_delivery_time:1; This bit fills a hole, does not change sk_buff size, right? > > #ifdef CONFIG_NET_SCHED > __u16 tc_index; /* traffic control index */ > @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void) > return 0; > } > > +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, > + bool mono) > +{ > + skb->tstamp = kt; > + /* Setting mono_delivery_time will be enabled later */ > + /* skb->mono_delivery_time = kt && mono; */ > +} > + > static inline u8 skb_metadata_len(const struct sk_buff *skb) > { > return skb_shinfo(skb)->meta_len; > diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c > index fdbed3158555..ebfb2a5c59e4 100644 > --- a/net/bridge/netfilter/nf_conntrack_bridge.c > +++ b/net/bridge/netfilter/nf_conntrack_bridge.c > @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, > struct sk_buff *)) > { > int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; > + bool mono_delivery_time = skb->mono_delivery_time; This use of a local variable is just a style choice, not needed for correctness, correct?
On Sat, Jan 22, 2022 at 10:32:16AM -0500, Willem de Bruijn wrote: > On Fri, Jan 21, 2022 at 2:30 AM Martin KaFai Lau <kafai@fb.com> wrote: > > > > skb->tstamp was first used as the (rcv) timestamp in real time clock base. > > The major usage is to report it to the user (e.g. SO_TIMESTAMP). > > > > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) > > during egress and used by the qdisc (e.g. sch_fq) to make decision on when > > the skb can be passed to the dev. > > > > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp > > or the delivery_time, so it is always reset to 0 whenever forwarded > > between egress and ingress. > > > > While it makes sense to always clear the (rcv) timestamp in skb->tstamp > > to avoid confusing sch_fq that expects the delivery_time, it is a > > performance issue [0] to clear the delivery_time if the skb finally > > egress to a fq@phy-dev. For example, when forwarding from egress to > > ingress and then finally back to egress: > > > > tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns > > ^ ^ > > reset rest > > > > [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf > > > > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp > > is storing the mono delivery_time instead of the (rcv) timestamp. > > > > The current use case is to keep the TCP mono delivery_time (EDT) and > > to be used with sch_fq. The later patch will also allow tc-bpf to read > > and change the mono delivery_time. > > > > In the future, another bit (e.g. skb->user_delivery_time) can be added > > for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. > > > > [ This patch is a prep work. The following patch will > > get the other parts of the stack ready first. Then another patch > > after that will finally set the skb->mono_delivery_time. ] > > > > skb_set_delivery_time() function is added. It is used by the tcp_output.c > > and during ip[6] fragmentation to assign the delivery_time to > > the skb->tstamp and also set the skb->mono_delivery_time. > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > --- > > include/linux/skbuff.h | 13 +++++++++++++ > > net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- > > net/ipv4/ip_output.c | 5 +++-- > > net/ipv4/tcp_output.c | 16 +++++++++------- > > net/ipv6/ip6_output.c | 5 +++-- > > net/ipv6/netfilter.c | 5 +++-- > > 6 files changed, 34 insertions(+), 15 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index bf11e1fbd69b..b9e20187242a 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t; > > * @dst_pending_confirm: need to confirm neighbour > > * @decrypted: Decrypted SKB > > * @slow_gro: state present at GRO time, slower prepare step required > > + * @mono_delivery_time: When set, skb->tstamap has the > > tstamp > > > + * delivery_time in mono clock base (i.e. EDT). Otherwise, the > > + * skb->tstamp has the (rcv) timestamp at ingress and > > + * delivery_time at egress. > > * @napi_id: id of the NAPI struct this skb came from > > * @sender_cpu: (aka @napi_id) source CPU in XPS > > * @secmark: security marking > > @@ -890,6 +894,7 @@ struct sk_buff { > > __u8 decrypted:1; > > #endif > > __u8 slow_gro:1; > > + __u8 mono_delivery_time:1; > > This bit fills a hole, does not change sk_buff size, right? > > > > > #ifdef CONFIG_NET_SCHED > > __u16 tc_index; /* traffic control index */ > > @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void) > > return 0; > > } > > > > +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, > > + bool mono) > > +{ > > + skb->tstamp = kt; > > + /* Setting mono_delivery_time will be enabled later */ > > + /* skb->mono_delivery_time = kt && mono; */ > > +} > > + > > static inline u8 skb_metadata_len(const struct sk_buff *skb) > > { > > return skb_shinfo(skb)->meta_len; > > diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c > > index fdbed3158555..ebfb2a5c59e4 100644 > > --- a/net/bridge/netfilter/nf_conntrack_bridge.c > > +++ b/net/bridge/netfilter/nf_conntrack_bridge.c > > @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, > > struct sk_buff *)) > > { > > int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; > > + bool mono_delivery_time = skb->mono_delivery_time; > > This use of a local variable is just a style choice, not needed for > correctness, correct? It is needed. The current code also saves the very first skb->tstamp (two lines below): > > unsigned int hlen, ll_rs, mtu; > > ktime_t tstamp = skb->tstamp; My understanding is all frags reuse the first (/original) skb tstamp info. The frags output() is one-by-one, meaning when sending the later frags, the first skb has already been output()-ed, so the first skb tstamp info needs to be saved in local variables.
On Sat, Jan 22, 2022 at 10:32:16AM -0500, Willem de Bruijn wrote: > > + * delivery_time in mono clock base (i.e. EDT). Otherwise, the > > + * skb->tstamp has the (rcv) timestamp at ingress and > > + * delivery_time at egress. > > * @napi_id: id of the NAPI struct this skb came from > > * @sender_cpu: (aka @napi_id) source CPU in XPS > > * @secmark: security marking > > @@ -890,6 +894,7 @@ struct sk_buff { > > __u8 decrypted:1; > > #endif > > __u8 slow_gro:1; > > + __u8 mono_delivery_time:1; > > This bit fills a hole, does not change sk_buff size, right? [ Sorry, cut too much when sending the last reply ] Correct. With this +1 bit, it still has a 3 bits and a 1 byte hole before tc_index when every CONFIG_* among these bits are turned on.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bf11e1fbd69b..b9e20187242a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t; * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required + * @mono_delivery_time: When set, skb->tstamap has the + * delivery_time in mono clock base (i.e. EDT). Otherwise, the + * skb->tstamp has the (rcv) timestamp at ingress and + * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -890,6 +894,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 mono_delivery_time:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void) return 0; } +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt, + bool mono) +{ + skb->tstamp = kt; + /* Setting mono_delivery_time will be enabled later */ + /* skb->mono_delivery_time = kt && mono; */ +} + static inline u8 skb_metadata_len(const struct sk_buff *skb) { return skb_shinfo(skb)->meta_len; diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c index fdbed3158555..ebfb2a5c59e4 100644 --- a/net/bridge/netfilter/nf_conntrack_bridge.c +++ b/net/bridge/netfilter/nf_conntrack_bridge.c @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; unsigned int hlen, ll_rs, mtu; ktime_t tstamp = skb->tstamp; struct ip_frag_state state; @@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, if (iter.frag) ip_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 57c1d8431386..6ba08d9bdf8e 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -754,6 +754,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, { struct iphdr *iph; struct sk_buff *skb2; + bool mono_delivery_time = skb->mono_delivery_time; struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; @@ -836,7 +837,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, ip_fraglist_prepare(skb, &iter); } - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) @@ -892,7 +893,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, skb2); if (err) goto fail; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5079832af5c1..261e9584f81e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, tp = tcp_sk(sk); prior_wstamp = tp->tcp_wstamp_ns; tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache); - skb->skb_mstamp_ns = tp->tcp_wstamp_ns; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); if (clone_it) { oskb = skb; @@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, skb_split(skb, buff, len); - buff->tstamp = skb->tstamp; + skb_set_delivery_time(buff, skb->tstamp, true); tcp_fragment_tstamp(skb, buff); old_factor = tcp_skb_pcount(skb); @@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp_ns" is used as a start point for the retransmit timer */ - skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache; + tp->tcp_wstamp_ns = tp->tcp_clock_cache; + skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true); list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); tcp_init_tso_segs(skb, mss_now); goto repair; /* Skip network transmission */ @@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, now = tcp_clock_ns(); #ifdef CONFIG_SYN_COOKIES if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok)) - skb->skb_mstamp_ns = cookie_init_timestamp(req, now); + skb_set_delivery_time(skb, cookie_init_timestamp(req, now), + true); else #endif { - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */ tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb); } @@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, synack_type, &opts); - skb->skb_mstamp_ns = now; + skb_set_delivery_time(skb, now, true); tcp_add_tx_delay(skb, tp); return skb; @@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation); - syn->skb_mstamp_ns = syn_data->skb_mstamp_ns; + skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true); /* Now full SYN+DATA was cloned and sent (or not), * remove the SYN from the original skb (syn_data) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 2995f8d89e7e..4b873413edc2 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; + bool mono_delivery_time = skb->mono_delivery_time; struct ip6_frag_state state; unsigned int mtu, hlen, nexthdr_offset; ktime_t tstamp = skb->tstamp; @@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), @@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, /* * Put this fragment into the sending queue. */ - frag->tstamp = tstamp; + skb_set_delivery_time(frag, tstamp, mono_delivery_time); err = output(net, sk, frag); if (err) goto fail; diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6ab710b5a1a8..1da332450d98 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct sk_buff *)) { int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size; + bool mono_delivery_time = skb->mono_delivery_time; ktime_t tstamp = skb->tstamp; struct ip6_frag_state state; u8 *prevhdr, nexthdr = 0; @@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, if (iter.frag) ip6_fraglist_prepare(skb, &iter); - skb->tstamp = tstamp; + skb_set_delivery_time(skb, tstamp, mono_delivery_time); err = output(net, sk, data, skb); if (err || !iter.frag) break; @@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, goto blackhole; } - skb2->tstamp = tstamp; + skb_set_delivery_time(skb2, tstamp, mono_delivery_time); err = output(net, sk, data, skb2); if (err) goto blackhole;
skb->tstamp was first used as the (rcv) timestamp in real time clock base. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. The later patch will also allow tc-bpf to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patch will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. Signed-off-by: Martin KaFai Lau <kafai@fb.com> --- include/linux/skbuff.h | 13 +++++++++++++ net/bridge/netfilter/nf_conntrack_bridge.c | 5 +++-- net/ipv4/ip_output.c | 5 +++-- net/ipv4/tcp_output.c | 16 +++++++++------- net/ipv6/ip6_output.c | 5 +++-- net/ipv6/netfilter.c | 5 +++-- 6 files changed, 34 insertions(+), 15 deletions(-)