diff mbox series

[RFC,v3,net-next,3/4] net: Set skb->mono_delivery_time and clear it when delivering locally

Message ID 20220121073045.4179438-1-kafai@fb.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Preserve mono delivery time (EDT) in skb->tstamp | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6057 this patch: 6057
netdev/cc_maintainers warning 9 maintainers not CCed: pablo@netfilter.org alobakin@pm.me kpsingh@kernel.org john.fastabend@gmail.com songliubraving@fb.com dsahern@kernel.org keescook@chromium.org yhs@fb.com yoshfuji@linux-ipv6.org
netdev/build_clang success Errors and warnings before: 927 this patch: 927
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6211 this patch: 6211
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 105 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Martin KaFai Lau Jan. 21, 2022, 7:30 a.m. UTC
This patch sets the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.

skb_clear_delivery_time() is added to clear the delivery_time and set
back to the (rcv) timestamp if needed when the skb is being delivered
locally (to a sk).  skb_clear_delivery_time() is called in
ip_local_deliver() and ip6_input().  In most of the regular ingress
cases, the skb->tstamp should already have the (rcv) timestamp.
For the egress loop back to ingress cases, the marking of the (rcv)
timestamp is postponed from dev.c to ip_local_deliver() and
ip6_input().

Another case needs to clear the delivery_time is the network
tapping (e.g. af_packet by tcpdump).  Regardless of tapping at the ingress
or egress,  the tapped skb is received by the af_packet socket, so
it is ingress to the af_packet socket and it expects
the (rcv) timestamp.

When tapping at egress, dev_queue_xmit_nit() is used.  It has already
expected skb->tstamp may have delivery_time,  so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().

When tapping at ingress, it currently expects the skb->tstamp is either 0
or has the (rcv) timestamp.  Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.

There are two cases for tapping at ingress:

One case is af_packet queues the skb to its sk_receive_queue.  The skb
is either not shared or new clone created.  The skb_clear_delivery_time()
is called to clear the delivery_time (if any) before it is queued to the
sk_receive_queue.

Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
skb_tstamp() is used in tpacket_get_timestamp() to check
the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().

In __skb_tstamp_tx, it clones the egress skb and queues the clone to the
sk_error_queue.  The outgoing skb may have the mono delivery_time while
the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit is also cleared from the clone.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 27 +++++++++++++++++++++++++--
 net/core/dev.c         |  4 +++-
 net/core/skbuff.c      |  6 ++++--
 net/ipv4/ip_input.c    |  1 +
 net/ipv6/ip6_input.c   |  1 +
 net/packet/af_packet.c |  4 +++-
 6 files changed, 37 insertions(+), 6 deletions(-)

Comments

Julian Anastasov Jan. 21, 2022, 12:02 p.m. UTC | #1
Hello,

On Thu, 20 Jan 2022, Martin KaFai Lau wrote:

> This patch sets the skb->mono_delivery_time to flag the skb->tstamp
> is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
> 
> skb_clear_delivery_time() is added to clear the delivery_time and set
> back to the (rcv) timestamp if needed when the skb is being delivered
> locally (to a sk).  skb_clear_delivery_time() is called in
> ip_local_deliver() and ip6_input().  In most of the regular ingress
> cases, the skb->tstamp should already have the (rcv) timestamp.
> For the egress loop back to ingress cases, the marking of the (rcv)
> timestamp is postponed from dev.c to ip_local_deliver() and
> ip6_input().
> 
> Another case needs to clear the delivery_time is the network
> tapping (e.g. af_packet by tcpdump).  Regardless of tapping at the ingress
> or egress,  the tapped skb is received by the af_packet socket, so
> it is ingress to the af_packet socket and it expects
> the (rcv) timestamp.
> 
> When tapping at egress, dev_queue_xmit_nit() is used.  It has already
> expected skb->tstamp may have delivery_time,  so it does
> skb_clone()+net_timestamp_set() to ensure the cloned skb has
> the (rcv) timestamp before passing to the af_packet sk.
> This patch only adds to clear the skb->mono_delivery_time
> bit in net_timestamp_set().
> 
> When tapping at ingress, it currently expects the skb->tstamp is either 0
> or has the (rcv) timestamp.  Meaning, the tapping at ingress path
> has already expected the skb->tstamp could be 0 and it will get
> the (rcv) timestamp by ktime_get_real() when needed.
> 
> There are two cases for tapping at ingress:
> 
> One case is af_packet queues the skb to its sk_receive_queue.  The skb
> is either not shared or new clone created.  The skb_clear_delivery_time()
> is called to clear the delivery_time (if any) before it is queued to the
> sk_receive_queue.
> 
> Another case, the ingress skb is directly copied to the rx_ring
> and tpacket_get_timestamp() is used to get the (rcv) timestamp.
> skb_tstamp() is used in tpacket_get_timestamp() to check
> the skb->mono_delivery_time bit before returning skb->tstamp.
> As mentioned earlier, the tapping@ingress has already expected
> the skb may not have the (rcv) timestamp (because no sk has asked
> for it) and has handled this case by directly calling ktime_get_real().
> 
> In __skb_tstamp_tx, it clones the egress skb and queues the clone to the
> sk_error_queue.  The outgoing skb may have the mono delivery_time while
> the (rcv) timestamp is expected for the clone, so the
> skb->mono_delivery_time bit is also cleared from the clone.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h | 27 +++++++++++++++++++++++++--
>  net/core/dev.c         |  4 +++-
>  net/core/skbuff.c      |  6 ++++--
>  net/ipv4/ip_input.c    |  1 +
>  net/ipv6/ip6_input.c   |  1 +
>  net/packet/af_packet.c |  4 +++-
>  6 files changed, 37 insertions(+), 6 deletions(-)
> 

> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 3a025c011971..35311ca75496 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb)
>  	 */
>  	struct net *net = dev_net(skb->dev);
>  
> +	skb_clear_delivery_time(skb);

	Is it safe to move this line into ip_local_deliver_finish ?

>  	if (ip_is_fragment(ip_hdr(skb))) {
>  		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
>  			return 0;
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 80256717868e..84f93864b774 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
>  
>  int ip6_input(struct sk_buff *skb)
>  {
> +	skb_clear_delivery_time(skb);

	Is it safe to move this line into ip6_input_finish?
The problem for both cases is that IPVS hooks at LOCAL_IN and
can decide to forward the packet by returning NF_STOLEN and
avoiding the _finish code. In short, before reaching the
_finish code it is still not decided that packet reaches the
sockets.

>  	return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
>  		       dev_net(skb->dev), NULL, skb, skb->dev, NULL,
>  		       ip6_input_finish);

Regards

--
Julian Anastasov <ja@ssi.bg>
Martin KaFai Lau Jan. 22, 2022, 3:28 a.m. UTC | #2
On Fri, Jan 21, 2022 at 02:02:23PM +0200, Julian Anastasov wrote:
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index 3a025c011971..35311ca75496 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb)
> >  	 */
> >  	struct net *net = dev_net(skb->dev);
> >  
> > +	skb_clear_delivery_time(skb);
> 
> 	Is it safe to move this line into ip_local_deliver_finish ?
> 
> >  	if (ip_is_fragment(ip_hdr(skb))) {
> >  		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
> >  			return 0;
> > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > index 80256717868e..84f93864b774 100644
> > --- a/net/ipv6/ip6_input.c
> > +++ b/net/ipv6/ip6_input.c
> > @@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
> >  
> >  int ip6_input(struct sk_buff *skb)
> >  {
> > +	skb_clear_delivery_time(skb);
> 
> 	Is it safe to move this line into ip6_input_finish?
> The problem for both cases is that IPVS hooks at LOCAL_IN and
> can decide to forward the packet by returning NF_STOLEN and
> avoiding the _finish code. In short, before reaching the
> _finish code it is still not decided that packet reaches the
> sockets.
hmm...

Theoretically, it should be doable to push it later because the
ingress path cannot assume the (rcv) timestamp is always available,
so it should be expecting to handle the 0 case and do ktime_get_real(),
e.g. the tapping case used by af_packet.  The tradeoff is just
a later (rcv) timestamp and also more code churns.  e.g.
Somewhere in ip_is_fragment() may need to change.

My initial attempt was to call skb_clear_delivery_time()
right after sch_handle_ingress() in dev.c.  However, it seems not taking
much to make ip[6]_forward work also, so I pushed it here.  However, it
seems that will make other kernel forward paths not consistent in terms of
the expectation in keeping the delivery_time.

I will give it a try in v4 but not very sure for now before looking
closer.  The worst is to move it back to just after sch_handle_ingress()
so that the kernel forward path will still behave consistently
but I will give it a try first.

Thanks for the review !
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8de555513b94..4677bb6c7279 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3912,8 +3912,23 @@  static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 					 bool mono)
 {
 	skb->tstamp = kt;
-	/* Setting mono_delivery_time will be enabled later */
-	/* skb->mono_delivery_time = kt && mono; */
+	skb->mono_delivery_time = kt && mono;
+}
+
+DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
+
+/* skb is delivering locally.  If needed, set it to the (rcv) timestamp.
+ * Otherwise, clear the delivery time.
+ */
+static inline void skb_clear_delivery_time(struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time)) {
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = ktime_get_real();
+		else
+			skb->tstamp = 0;
+	}
 }
 
 static inline void skb_clear_tstamp(struct sk_buff *skb)
@@ -3924,6 +3939,14 @@  static inline void skb_clear_tstamp(struct sk_buff *skb)
 	skb->tstamp = 0;
 }
 
+static inline ktime_t skb_tstamp(const struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time))
+		return 0;
+
+	return skb->tstamp;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/core/dev.c b/net/core/dev.c
index 84a0d9542fe9..b4b392f5ef9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2000,7 +2000,8 @@  void net_dec_egress_queue(void)
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 #endif
 
-static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+EXPORT_SYMBOL(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
 static atomic_t netstamp_needed_deferred;
 static atomic_t netstamp_wanted;
@@ -2061,6 +2062,7 @@  EXPORT_SYMBOL(net_disable_timestamp);
 static inline void net_timestamp_set(struct sk_buff *skb)
 {
 	skb->tstamp = 0;
+	skb->mono_delivery_time = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
 		__net_timestamp(skb);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3e3da8fdf8f5..93dc763da8cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4817,10 +4817,12 @@  void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey;
 	}
 
-	if (hwtstamps)
+	if (hwtstamps) {
 		*skb_hwtstamps(skb) = *hwtstamps;
-	else
+	} else {
 		skb->tstamp = ktime_get_real();
+		skb->mono_delivery_time = 0;
+	}
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 3a025c011971..35311ca75496 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -244,6 +244,7 @@  int ip_local_deliver(struct sk_buff *skb)
 	 */
 	struct net *net = dev_net(skb->dev);
 
+	skb_clear_delivery_time(skb);
 	if (ip_is_fragment(ip_hdr(skb))) {
 		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
 			return 0;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 80256717868e..84f93864b774 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -469,6 +469,7 @@  static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
 
 int ip6_input(struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
 		       dev_net(skb->dev), NULL, skb, skb->dev, NULL,
 		       ip6_input_finish);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5bd409ab4cc2..ab55adff3500 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -460,7 +460,7 @@  static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 		return TP_STATUS_TS_RAW_HARDWARE;
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec64_cond(skb->tstamp, ts))
+	    ktime_to_timespec64_cond(skb_tstamp(skb), ts))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -2195,6 +2195,7 @@  static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 	spin_lock(&sk->sk_receive_queue.lock);
 	po->stats.stats1.tp_packets++;
 	sock_skb_set_dropcount(sk, skb);
+	skb_clear_delivery_time(skb);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	spin_unlock(&sk->sk_receive_queue.lock);
 	sk->sk_data_ready(sk);
@@ -2373,6 +2374,7 @@  static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
+		skb_clear_delivery_time(copy_skb);
 		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
 	}
 	spin_unlock(&sk->sk_receive_queue.lock);