diff mbox series

[RFC,v2,net-next] net: Preserve skb delivery time during forward

Message ID 20211215201158.271976-1-kafai@fb.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v2,net-next] net: Preserve skb delivery time during forward | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6011 this patch: 6011
netdev/cc_maintainers warning 14 maintainers not CCed: bpf@vger.kernel.org nikolay@nvidia.com alobakin@pm.me roopa@nvidia.com yoshfuji@linux-ipv6.org songliubraving@fb.com dsahern@kernel.org bridge@lists.linux-foundation.org kpsingh@kernel.org john.fastabend@gmail.com pablo@netfilter.org yhs@fb.com jonathan.lemon@gmail.com andrii@kernel.org
netdev/build_clang success Errors and warnings before: 1022 this patch: 1022
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6162 this patch: 6162
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 221 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Martin KaFai Lau Dec. 15, 2021, 8:11 p.m. UTC
The skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
in TCP.  skb->skb_mstamp_ns is a union member of skb->tstamp.

When the skb traveling veth and being forwarded like below, the skb->tstamp
is reset to 0 at multiple points.

                                                                        (c: skb->tstamp = 0)
                                                                         vv
tcp-sender => veth@netns => veth@hostns(b: rx: skb->tstamp = real_clock) => fq@eth0
                         ^^
                        (a: skb->tstamp = 0)

(a) veth@netns TX to veth@hostns:
    skb->tstamp (mono clock) is a EDT and it is in future time.
    Reset to 0 so that it won't skip the net_timestamp_check at the
    RX side in (b).
(b) RX (netif_rx) in veth@hostns:
    net_timestamp_check puts a current time (real clock) in skb->tstamp.
(c) veth@hostns forward to fq@eth0:
    skb->tstamp is reset back to 0 again because fq is using
    mono clock.

This leads to an unstable TCP throughput issue described by Daniel in [0].

We also have a use case that a bpf runs at ingress@veth@hostns
to set EDT in skb->tstamp to limit the bandwidth usage
of a particular netns.  This EDT currently also gets
reset in step (c) as described above.

Unlike RFC v1 trying to migrate rx tstamp to mono first,
this patch is to preserve the EDT in skb->skb_mstamp_ns during forward.

The idea is to temporarily store skb->skb_mstamp_ns during forward.
skb_shinfo(skb)->hwtstamps is used as a temporary store and
it is union-ed with the newly added "u64 tx_delivery_tstamp".
hwtstamps should only be used when a packet is received or
sent out of a hw device.

During forward, skb->tstamp will be temporarily stored in
skb_shinfo(skb)->tx_delivery_tstamp and a new bit
(SKBTX_DELIVERY_TSTAMP) in skb_shinfo(skb)->tx_flags
will also be set to tell tx_delivery_tstamp is in use.
hwtstamps is accessed through the skb_hwtstamps() getter,
so unlikely(tx_flags & SKBTX_DELIVERY_TSTAMP) can
be tested in there and reset tx_delivery_tstamp to 0
before hwtstamps is used.

After moving the skb->tstamp to skb_shinfo(skb)->tx_delivery_tstamp,
the skb->tstamp will still be reset to 0 during forward.  Thus,
on the RX side (__netif_receive_skb_core), all existing code paths
will still get the received time in real clock and will work as-is.

When this skb finally xmit-ing out in __dev_queue_xmit(),
it will check the SKBTX_DELIVERY_TSTAMP bit in skb_shinfo(skb)->tx_flags
and restore the skb->tstamp from skb_shinfo(skb)->tx_delivery_tstamp
if needed.  This bit test is done immediately after another existing
bit test 'skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP'.

Another bit SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is added
to skb_shinfo(skb)->tx_flags.  It is used to specify
the skb->tstamp is set as a delivery time and can be
temporarily stored during forward.  This bit is now set
when EDT is stored in skb->skb_mstamp_ns in tcp_output.c
This will avoid packet received from a NIC with real-clock
in skb->tstamp being forwarded without reset.

The change in af_packet.c is to avoid it calling skb_hwtstamps()
which will reset the skb_shinfo(skb)->tx_delivery_tstamp.
af_packet.c only wants to read the hwtstamps instead of
storing a time in it, so a new read only getter skb_hwtstamps_ktime()
is added.  Otherwise, a tcpdump will trigger this code path
and unnecessarily reset the EDT stored in tx_delivery_tstamp.

[Note: not all skb->tstamp=0 reset has been changed in this RFC yet]

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h  | 52 ++++++++++++++++++++++++++++++++++++++++-
 net/bridge/br_forward.c |  2 +-
 net/core/dev.c          |  1 +
 net/core/filter.c       |  6 ++---
 net/core/skbuff.c       |  2 +-
 net/ipv4/ip_forward.c   |  2 +-
 net/ipv4/tcp_output.c   | 21 +++++++++++------
 net/ipv6/ip6_output.c   |  2 +-
 net/packet/af_packet.c  |  8 +++----
 9 files changed, 77 insertions(+), 19 deletions(-)

Comments

Julian Anastasov Dec. 15, 2021, 10:07 p.m. UTC | #1
Hello,

On Wed, 15 Dec 2021, Martin KaFai Lau wrote:

> diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
> index 00ec819f949b..f216fc97c6ce 100644
> --- a/net/ipv4/ip_forward.c
> +++ b/net/ipv4/ip_forward.c
> @@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
>  	if (unlikely(opt->optlen))
>  		ip_forward_options(skb);
>  
> -	skb->tstamp = 0;
> +	skb_scrub_tstamp(skb);

	Just to let you know, you can consider all places in
net/netfilter/ipvs/ip_vs_xmit.c that reset tstamp as a forwarding path
like this one.

>  	return dst_output(net, sk, skb);
>  }

Regards

--
Julian Anastasov <ja@ssi.bg>
Willem de Bruijn Dec. 16, 2021, 3:32 p.m. UTC | #2
h

On Wed, Dec 15, 2021 at 3:12 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
> in TCP.  skb->skb_mstamp_ns is a union member of skb->tstamp.
>
> When the skb traveling veth and being forwarded like below, the skb->tstamp
> is reset to 0 at multiple points.
>
>                                                                         (c: skb->tstamp = 0)
>                                                                          vv
> tcp-sender => veth@netns => veth@hostns(b: rx: skb->tstamp = real_clock) => fq@eth0
>                          ^^
>                         (a: skb->tstamp = 0)
>
> (a) veth@netns TX to veth@hostns:
>     skb->tstamp (mono clock) is a EDT and it is in future time.
>     Reset to 0 so that it won't skip the net_timestamp_check at the
>     RX side in (b).
> (b) RX (netif_rx) in veth@hostns:
>     net_timestamp_check puts a current time (real clock) in skb->tstamp.
> (c) veth@hostns forward to fq@eth0:
>     skb->tstamp is reset back to 0 again because fq is using
>     mono clock.
>
> This leads to an unstable TCP throughput issue described by Daniel in [0].
>
> We also have a use case that a bpf runs at ingress@veth@hostns
> to set EDT in skb->tstamp to limit the bandwidth usage
> of a particular netns.  This EDT currently also gets
> reset in step (c) as described above.
>
> Unlike RFC v1 trying to migrate rx tstamp to mono first,
> this patch is to preserve the EDT in skb->skb_mstamp_ns during forward.

Sucks we have to do this complex dance, because there is no room for
an skb->delivery_time.

> The idea is to temporarily store skb->skb_mstamp_ns during forward.
> skb_shinfo(skb)->hwtstamps is used as a temporary store and
> it is union-ed with the newly added "u64 tx_delivery_tstamp".
> hwtstamps should only be used when a packet is received or
> sent out of a hw device.
>
> During forward, skb->tstamp will be temporarily stored in
> skb_shinfo(skb)->tx_delivery_tstamp and a new bit
> (SKBTX_DELIVERY_TSTAMP) in skb_shinfo(skb)->tx_flags
> will also be set to tell tx_delivery_tstamp is in use.
> hwtstamps is accessed through the skb_hwtstamps() getter,
> so unlikely(tx_flags & SKBTX_DELIVERY_TSTAMP) can
> be tested in there and reset tx_delivery_tstamp to 0
> before hwtstamps is used.
>
> After moving the skb->tstamp to skb_shinfo(skb)->tx_delivery_tstamp,
> the skb->tstamp will still be reset to 0 during forward.  Thus,
> on the RX side (__netif_receive_skb_core), all existing code paths
> will still get the received time in real clock and will work as-is.
>
> When this skb finally xmit-ing out in __dev_queue_xmit(),
> it will check the SKBTX_DELIVERY_TSTAMP bit in skb_shinfo(skb)->tx_flags
> and restore the skb->tstamp from skb_shinfo(skb)->tx_delivery_tstamp
> if needed.  This bit test is done immediately after another existing
> bit test 'skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP'.
>
> Another bit SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is added
> to skb_shinfo(skb)->tx_flags.  It is used to specify
> the skb->tstamp is set as a delivery time and can be
> temporarily stored during forward.  This bit is now set
> when EDT is stored in skb->skb_mstamp_ns in tcp_output.c
> This will avoid packet received from a NIC with real-clock
> in skb->tstamp being forwarded without reset.
>
> The change in af_packet.c is to avoid it calling skb_hwtstamps()
> which will reset the skb_shinfo(skb)->tx_delivery_tstamp.
> af_packet.c only wants to read the hwtstamps instead of
> storing a time in it, so a new read only getter skb_hwtstamps_ktime()
> is added.  Otherwise, a tcpdump will trigger this code path
> and unnecessarily reset the EDT stored in tx_delivery_tstamp.
>
> [Note: not all skb->tstamp=0 reset has been changed in this RFC yet]
>
> [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h  | 52 ++++++++++++++++++++++++++++++++++++++++-
>  net/bridge/br_forward.c |  2 +-
>  net/core/dev.c          |  1 +
>  net/core/filter.c       |  6 ++---
>  net/core/skbuff.c       |  2 +-
>  net/ipv4/ip_forward.c   |  2 +-
>  net/ipv4/tcp_output.c   | 21 +++++++++++------
>  net/ipv6/ip6_output.c   |  2 +-
>  net/packet/af_packet.c  |  8 +++----
>  9 files changed, 77 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6535294f6a48..9bf0a1e2a1bd 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -435,9 +435,17 @@ enum {
>         /* device driver is going to provide hardware time stamp */
>         SKBTX_IN_PROGRESS = 1 << 2,
>
> +       /* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
> +       SKBTX_DELIVERY_TSTAMP = 1 << 3,
> +
>         /* generate wifi status information (where possible) */
>         SKBTX_WIFI_STATUS = 1 << 4,
>
> +       /* skb->tstamp stored a future delivery time which
> +        * was set by a local sk and it can be fowarded.
> +        */
> +       SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
> +
>         /* generate software time stamp when entering packet scheduling */
>         SKBTX_SCHED_TSTAMP = 1 << 6,
>  };
> @@ -530,7 +538,14 @@ struct skb_shared_info {
>         /* Warning: this field is not always filled in (UFO)! */
>         unsigned short  gso_segs;
>         struct sk_buff  *frag_list;
> -       struct skb_shared_hwtstamps hwtstamps;
> +       union {
> +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> +                * tx_delivery_tstamp is stored instead of
> +                * hwtstamps.
> +                */

Should we just encode the timebase and/or type { timestamp,
delivery_time } in th lower bits of the timestamp field? Its
resolution is higher than actual clock precision.

> +               struct skb_shared_hwtstamps hwtstamps;
> +               u64 tx_delivery_tstamp;
> +       };
>         unsigned int    gso_type;
>         u32             tskey;
>
> @@ -1463,9 +1478,44 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>
>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>  {
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> +       }
>         return &skb_shinfo(skb)->hwtstamps;
>  }
>
> +/* Caller only needs to read the hwtstamps as ktime.
> + * To update hwtstamps,  HW device driver should call the writable
> + * version skb_hwtstamps() that returns a pointer.
> + */
> +static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
> +               return 0;
> +       return skb_shinfo(skb)->hwtstamps.hwtstamp;
> +}
> +
> +static inline void skb_scrub_tstamp(struct sk_buff *skb)

skb_save_delivery_time?

is non-zero skb->tstamp test not sufficient, instead of
SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.

It is if only called on the egress path. Is bpf on ingress the only
reason for this?

> +{
> +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> +       }

Is this only called when there are no clones/shares?

> +       skb->tstamp = 0;
> +}
> +
> +static inline void skb_restore_delivery_time(struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> +               skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
> +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> +       }
> +}
> +
>  static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
>  {
>         bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index ec646656dbf1..a3ba6195f2e3 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
>
>  int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> -       skb->tstamp = 0;
> +       skb_scrub_tstamp(skb);
>         return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
>                        net, sk, skb, NULL, skb->dev,
>                        br_dev_queue_push_xmit);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a855e41bbe39..e9e7de758cba 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
Martin KaFai Lau Dec. 16, 2021, 10:23 p.m. UTC | #3
On Thu, Dec 16, 2021 at 10:32:23AM -0500, Willem de Bruijn wrote:
[ ... ]

> >                                                                         (c: skb->tstamp = 0)
> >                                                                          vv
> > tcp-sender => veth@netns => veth@hostns(b: rx: skb->tstamp = real_clock) => fq@eth0
> >                          ^^
> >                         (a: skb->tstamp = 0)
> >
> > (a) veth@netns TX to veth@hostns:
> >     skb->tstamp (mono clock) is a EDT and it is in future time.
> >     Reset to 0 so that it won't skip the net_timestamp_check at the
> >     RX side in (b).
> > (b) RX (netif_rx) in veth@hostns:
> >     net_timestamp_check puts a current time (real clock) in skb->tstamp.
> > (c) veth@hostns forward to fq@eth0:
> >     skb->tstamp is reset back to 0 again because fq is using
> >     mono clock.
> >
> > This leads to an unstable TCP throughput issue described by Daniel in [0].
> >
> > We also have a use case that a bpf runs at ingress@veth@hostns
> > to set EDT in skb->tstamp to limit the bandwidth usage
> > of a particular netns.  This EDT currently also gets
> > reset in step (c) as described above.

[ ... ]

> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 6535294f6a48..9bf0a1e2a1bd 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -435,9 +435,17 @@ enum {
> >         /* device driver is going to provide hardware time stamp */
> >         SKBTX_IN_PROGRESS = 1 << 2,
> >
> > +       /* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
> > +       SKBTX_DELIVERY_TSTAMP = 1 << 3,
> > +
> >         /* generate wifi status information (where possible) */
> >         SKBTX_WIFI_STATUS = 1 << 4,
> >
> > +       /* skb->tstamp stored a future delivery time which
> > +        * was set by a local sk and it can be fowarded.
> > +        */
> > +       SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
> > +
> >         /* generate software time stamp when entering packet scheduling */
> >         SKBTX_SCHED_TSTAMP = 1 << 6,
> >  };
> > @@ -530,7 +538,14 @@ struct skb_shared_info {
> >         /* Warning: this field is not always filled in (UFO)! */
> >         unsigned short  gso_segs;
> >         struct sk_buff  *frag_list;
> > -       struct skb_shared_hwtstamps hwtstamps;
> > +       union {
> > +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> > +                * tx_delivery_tstamp is stored instead of
> > +                * hwtstamps.
> > +                */
> 
> Should we just encode the timebase and/or type { timestamp,
> delivery_time } in th lower bits of the timestamp field? Its
> resolution is higher than actual clock precision.
In skb->tstamp ?

> > +               struct skb_shared_hwtstamps hwtstamps;
> > +               u64 tx_delivery_tstamp;
> > +       };
> >         unsigned int    gso_type;
> >         u32             tskey;
> >
> > @@ -1463,9 +1478,44 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
> >
> >  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
> >  {
> > +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> > +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> > +       }
> >         return &skb_shinfo(skb)->hwtstamps;
> >  }
> >
> > +/* Caller only needs to read the hwtstamps as ktime.
> > + * To update hwtstamps,  HW device driver should call the writable
> > + * version skb_hwtstamps() that returns a pointer.
> > + */
> > +static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
> > +{
> > +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
> > +               return 0;
> > +       return skb_shinfo(skb)->hwtstamps.hwtstamp;
> > +}
> > +
> > +static inline void skb_scrub_tstamp(struct sk_buff *skb)
> 
> skb_save_delivery_time?
yep. ok.

> 
> is non-zero skb->tstamp test not sufficient, instead of
> SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
>
> It is if only called on the egress path. Is bpf on ingress the only
> reason for this?
Ah. ic.  meaning testing non-zero skb->tstamp and then call
skb_save_delivery_time() only during the veth-egress-path:
somewhere in veth_xmit() => veth_forward_skb() but before
skb->tstamp was reset to 0 in __dev_forward_skb().

Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
because the skb->tstamp could be stamped by net_timestamp_check().

Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.

Did I understand your suggestion correctly?

However, we still need a bit to distinguish tx_delivery_tstamp
from hwtstamps.

> 
> > +{
> > +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> > +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> > +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> > +       }
> 
> Is this only called when there are no clones/shares?
No, I don't think so.  TCP clone it.  I also started thinking about
this after noticing a mistake in the change in  __tcp_transmit_skb().

There are other places that change tx_flags, e.g. tcp_offload.c.
It is not shared at those places or there is some specific points
in the stack that is safe to change ?

> 
> > +       skb->tstamp = 0;
> > +}
> > +
> > +static inline void skb_restore_delivery_time(struct sk_buff *skb)
> > +{
> > +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> > +               skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
> > +               skb_shinfo(skb)->tx_delivery_tstamp = 0;
> > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> > +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> > +       }
> > +}
> > +
> >  static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
> >  {
> >         bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
> > diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> > index ec646656dbf1..a3ba6195f2e3 100644
> > --- a/net/bridge/br_forward.c
> > +++ b/net/bridge/br_forward.c
> > @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
> >
> >  int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  {
> > -       skb->tstamp = 0;
> > +       skb_scrub_tstamp(skb);
> >         return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
> >                        net, sk, skb, NULL, skb->dev,
> >                        br_dev_queue_push_xmit);
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index a855e41bbe39..e9e7de758cba 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
Willem de Bruijn Dec. 16, 2021, 10:58 p.m. UTC | #4
> > > @@ -530,7 +538,14 @@ struct skb_shared_info {
> > >         /* Warning: this field is not always filled in (UFO)! */
> > >         unsigned short  gso_segs;
> > >         struct sk_buff  *frag_list;
> > > -       struct skb_shared_hwtstamps hwtstamps;
> > > +       union {
> > > +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> > > +                * tx_delivery_tstamp is stored instead of
> > > +                * hwtstamps.
> > > +                */
> >
> > Should we just encode the timebase and/or type { timestamp,
> > delivery_time } in th lower bits of the timestamp field? Its
> > resolution is higher than actual clock precision.
> In skb->tstamp ?

Yes. Arguably a hack, but those bits are in the noise now, and it
avoids the clone issue with skb_shinfo (and scarcity of flag bits
there).

> >
> > is non-zero skb->tstamp test not sufficient, instead of
> > SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
> >
> > It is if only called on the egress path. Is bpf on ingress the only
> > reason for this?
> Ah. ic.  meaning testing non-zero skb->tstamp and then call
> skb_save_delivery_time() only during the veth-egress-path:
> somewhere in veth_xmit() => veth_forward_skb() but before
> skb->tstamp was reset to 0 in __dev_forward_skb().

Right. If delivery_time is the only use of skb->tstamp on egress, and
timestamp is the only use on ingress, then the only time the
delivery_time needs to be cached if when looping from egress to
ingress and this field is non-zero.

>
> Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
> because the skb->tstamp could be stamped by net_timestamp_check().
>
> Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
>
> Did I understand your suggestion correctly?

I think so.

But the reality is complicated if something may be setting a delivery
time on ingress (a BPF filter?)
>
> However, we still need a bit to distinguish tx_delivery_tstamp
> from hwtstamps.
>
> >
> > > +{
> > > +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> > > +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> > > +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> > > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> > > +       }
> >
> > Is this only called when there are no clones/shares?
> No, I don't think so.  TCP clone it.  I also started thinking about
> this after noticing a mistake in the change in  __tcp_transmit_skb().
>
> There are other places that change tx_flags, e.g. tcp_offload.c.
> It is not shared at those places or there is some specific points
> in the stack that is safe to change ?

The packet probably is not yet shared. Until the TCP stack gives a
packet to the IP layer, it can treat it as exclusive.

Though it does seem that these fields are accessed in a possibly racy
manner. Drivers with hardware tx timestamp offload may set
skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS without checking
whether the skb may be cloned.
Daniel Borkmann Dec. 16, 2021, 11:42 p.m. UTC | #5
On 12/16/21 11:58 PM, Willem de Bruijn wrote:
>>>> @@ -530,7 +538,14 @@ struct skb_shared_info {
>>>>          /* Warning: this field is not always filled in (UFO)! */
>>>>          unsigned short  gso_segs;
>>>>          struct sk_buff  *frag_list;
>>>> -       struct skb_shared_hwtstamps hwtstamps;
>>>> +       union {
>>>> +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
>>>> +                * tx_delivery_tstamp is stored instead of
>>>> +                * hwtstamps.
>>>> +                */
>>>
>>> Should we just encode the timebase and/or type { timestamp,
>>> delivery_time } in th lower bits of the timestamp field? Its
>>> resolution is higher than actual clock precision.
>> In skb->tstamp ?
> 
> Yes. Arguably a hack, but those bits are in the noise now, and it
> avoids the clone issue with skb_shinfo (and scarcity of flag bits
> there).
> 
>>> is non-zero skb->tstamp test not sufficient, instead of
>>> SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
>>>
>>> It is if only called on the egress path. Is bpf on ingress the only
>>> reason for this?
>> Ah. ic.  meaning testing non-zero skb->tstamp and then call
>> skb_save_delivery_time() only during the veth-egress-path:
>> somewhere in veth_xmit() => veth_forward_skb() but before
>> skb->tstamp was reset to 0 in __dev_forward_skb().
> 
> Right. If delivery_time is the only use of skb->tstamp on egress, and
> timestamp is the only use on ingress, then the only time the
> delivery_time needs to be cached if when looping from egress to
> ingress and this field is non-zero.
> 
>> Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
>> because the skb->tstamp could be stamped by net_timestamp_check().
>>
>> Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
>>
>> Did I understand your suggestion correctly?
> 
> I think so.
> 
> But the reality is complicated if something may be setting a delivery
> time on ingress (a BPF filter?)

I'm not quite following the 'bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)'
part yet; in our case we would need to preserve it as well, for example, we are
redirecting via bpf from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
the egress path and fq sits on phys-dev.. (I mean if needed we could easily do
that as shown in my prev diff with a flag for the helper).

>> However, we still need a bit to distinguish tx_delivery_tstamp
>> from hwtstamps.
>>
>>>
>>>> +{
>>>> +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
>>>> +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
>>>> +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
>>>> +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
>>>> +       }
>>>
>>> Is this only called when there are no clones/shares?
>> No, I don't think so.  TCP clone it.  I also started thinking about
>> this after noticing a mistake in the change in  __tcp_transmit_skb().
>>
>> There are other places that change tx_flags, e.g. tcp_offload.c.
>> It is not shared at those places or there is some specific points
>> in the stack that is safe to change ?
> 
> The packet probably is not yet shared. Until the TCP stack gives a
> packet to the IP layer, it can treat it as exclusive.
> 
> Though it does seem that these fields are accessed in a possibly racy
> manner. Drivers with hardware tx timestamp offload may set
> skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS without checking
> whether the skb may be cloned.
Martin KaFai Lau Dec. 17, 2021, 12:33 a.m. UTC | #6
On Thu, Dec 16, 2021 at 05:58:49PM -0500, Willem de Bruijn wrote:
> > > > @@ -530,7 +538,14 @@ struct skb_shared_info {
> > > >         /* Warning: this field is not always filled in (UFO)! */
> > > >         unsigned short  gso_segs;
> > > >         struct sk_buff  *frag_list;
> > > > -       struct skb_shared_hwtstamps hwtstamps;
> > > > +       union {
> > > > +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> > > > +                * tx_delivery_tstamp is stored instead of
> > > > +                * hwtstamps.
> > > > +                */
> > >
> > > Should we just encode the timebase and/or type { timestamp,
> > > delivery_time } in th lower bits of the timestamp field? Its
> > > resolution is higher than actual clock precision.
> > In skb->tstamp ?
> 
> Yes. Arguably a hack, but those bits are in the noise now, and it
> avoids the clone issue with skb_shinfo (and scarcity of flag bits
> there).
> 
> > >
> > > is non-zero skb->tstamp test not sufficient, instead of
> > > SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
> > >
> > > It is if only called on the egress path. Is bpf on ingress the only
> > > reason for this?
> > Ah. ic.  meaning testing non-zero skb->tstamp and then call
> > skb_save_delivery_time() only during the veth-egress-path:
> > somewhere in veth_xmit() => veth_forward_skb() but before
> > skb->tstamp was reset to 0 in __dev_forward_skb().
> 
> Right. If delivery_time is the only use of skb->tstamp on egress, and
> timestamp is the only use on ingress, then the only time the
> delivery_time needs to be cached if when looping from egress to
> ingress and this field is non-zero.
> 
> >
> > Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
> > because the skb->tstamp could be stamped by net_timestamp_check().
> >
> > Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
> >
> > Did I understand your suggestion correctly?
> 
> I think so.
> 
> But the reality is complicated if something may be setting a delivery
> time on ingress (a BPF filter?)
If bpf@ingress needs to set a delivery_time, the only reasonable
usecase is to finally egress it out by calling bpf_redirect_neigh().
One option is to have a new bpf_redirect_*() helper to take the fifth
'delivery_time' argument and have the skb_do_redirect() to set
the delivery_time in skb.  An extra helper is not ideal but probably
acceptable considering other tricky constraints we are working with.

Another potential issue is,
after looping from egress to ingress, the skb->tstamp has the delivery_time.
If it is passing up to the stack, it needs to be reset back to
timestamp (skb->tstamp = ktime_get_real()) before it is used.
Not sure what is the best place to do it (?) ....hmm 

> >
> > However, we still need a bit to distinguish tx_delivery_tstamp
> > from hwtstamps.
> >
> > >
> > > > +{
> > > > +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> > > > +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> > > > +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> > > > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> > > > +       }
> > >
> > > Is this only called when there are no clones/shares?
> > No, I don't think so.  TCP clone it.  I also started thinking about
> > this after noticing a mistake in the change in  __tcp_transmit_skb().
> >
> > There are other places that change tx_flags, e.g. tcp_offload.c.
> > It is not shared at those places or there is some specific points
> > in the stack that is safe to change ?
> 
> The packet probably is not yet shared. Until the TCP stack gives a
> packet to the IP layer, it can treat it as exclusive.
> 
> Though it does seem that these fields are accessed in a possibly racy
> manner. Drivers with hardware tx timestamp offload may set
> skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS without checking
> whether the skb may be cloned.
Martin KaFai Lau Dec. 17, 2021, 7:33 a.m. UTC | #7
On Fri, Dec 17, 2021 at 12:42:30AM +0100, Daniel Borkmann wrote:
> On 12/16/21 11:58 PM, Willem de Bruijn wrote:
> > > > > @@ -530,7 +538,14 @@ struct skb_shared_info {
> > > > >          /* Warning: this field is not always filled in (UFO)! */
> > > > >          unsigned short  gso_segs;
> > > > >          struct sk_buff  *frag_list;
> > > > > -       struct skb_shared_hwtstamps hwtstamps;
> > > > > +       union {
> > > > > +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> > > > > +                * tx_delivery_tstamp is stored instead of
> > > > > +                * hwtstamps.
> > > > > +                */
> > > > 
> > > > Should we just encode the timebase and/or type { timestamp,
> > > > delivery_time } in th lower bits of the timestamp field? Its
> > > > resolution is higher than actual clock precision.
> > > In skb->tstamp ?
> > 
> > Yes. Arguably a hack, but those bits are in the noise now, and it
> > avoids the clone issue with skb_shinfo (and scarcity of flag bits
> > there).
> > 
> > > > is non-zero skb->tstamp test not sufficient, instead of
> > > > SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
> > > > 
> > > > It is if only called on the egress path. Is bpf on ingress the only
> > > > reason for this?
> > > Ah. ic.  meaning testing non-zero skb->tstamp and then call
> > > skb_save_delivery_time() only during the veth-egress-path:
> > > somewhere in veth_xmit() => veth_forward_skb() but before
> > > skb->tstamp was reset to 0 in __dev_forward_skb().
> > 
> > Right. If delivery_time is the only use of skb->tstamp on egress, and
> > timestamp is the only use on ingress, then the only time the
> > delivery_time needs to be cached if when looping from egress to
> > ingress and this field is non-zero.
> > 
> > > Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
> > > because the skb->tstamp could be stamped by net_timestamp_check().
> > > 
> > > Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
> > > 
> > > Did I understand your suggestion correctly?
> > 
> > I think so.
> > 
> > But the reality is complicated if something may be setting a delivery
> > time on ingress (a BPF filter?)
> 
> I'm not quite following the 'bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)'
> part yet; in our case we would need to preserve it as well, for example, we are
> redirecting via bpf from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
> the egress path and fq sits on phys-dev.. (I mean if needed we could easily do
> that as shown in my prev diff with a flag for the helper).
Right, we have the same use case:
    redirecting from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
    the egress path and fq sits on phys-dev

My earlier comment was on having the delivery_time preserved in
the skb_shared_hwtstamps.  The delivery_time (e.g. EDT) and
timestamp (timestamp as RX timestamp) are separately stored when
looping from veth-egress to veth-ingress:

	delivery_time in skb_shared_hwtstamps
	rx timestamp in skb->tstamp

Thus, when bpf_redirect_neigh(phys-dev) happens, bpf_out_*() can
continue to reset skb->tstamp as-is while delivery_time will
automatically be kept in skb_shared_hwtstamps.  When the skb
reaches the egress@phys-dev (__dev_queue_xmit), the delivery_time
in skb_shared_hwtstamps will be restored into skb->tstamp (done
in skb_restore_delivery_time in this patch).

> > > However, we still need a bit to distinguish tx_delivery_tstamp
> > > from hwtstamps.
> > > 
> > > > 
> > > > > +{
> > > > > +       if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> > > > > +               skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> > > > > +               skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> > > > > +               skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> > > > > +       }
> > > > 
> > > > Is this only called when there are no clones/shares?
> > > No, I don't think so.  TCP clone it.  I also started thinking about
> > > this after noticing a mistake in the change in  __tcp_transmit_skb().
> > > 
> > > There are other places that change tx_flags, e.g. tcp_offload.c.
> > > It is not shared at those places or there is some specific points
> > > in the stack that is safe to change ?
> > 
> > The packet probably is not yet shared. Until the TCP stack gives a
> > packet to the IP layer, it can treat it as exclusive.
> > 
> > Though it does seem that these fields are accessed in a possibly racy
> > manner. Drivers with hardware tx timestamp offload may set
> > skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS without checking
> > whether the skb may be cloned.
Daniel Borkmann Dec. 17, 2021, 11:13 a.m. UTC | #8
On 12/17/21 8:33 AM, Martin KaFai Lau wrote:
> On Fri, Dec 17, 2021 at 12:42:30AM +0100, Daniel Borkmann wrote:
>> On 12/16/21 11:58 PM, Willem de Bruijn wrote:
>>>>>> @@ -530,7 +538,14 @@ struct skb_shared_info {
>>>>>>           /* Warning: this field is not always filled in (UFO)! */
>>>>>>           unsigned short  gso_segs;
>>>>>>           struct sk_buff  *frag_list;
>>>>>> -       struct skb_shared_hwtstamps hwtstamps;
>>>>>> +       union {
>>>>>> +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
>>>>>> +                * tx_delivery_tstamp is stored instead of
>>>>>> +                * hwtstamps.
>>>>>> +                */
>>>>>
>>>>> Should we just encode the timebase and/or type { timestamp,
>>>>> delivery_time } in th lower bits of the timestamp field? Its
>>>>> resolution is higher than actual clock precision.
>>>> In skb->tstamp ?
>>>
>>> Yes. Arguably a hack, but those bits are in the noise now, and it
>>> avoids the clone issue with skb_shinfo (and scarcity of flag bits
>>> there).
>>>
>>>>> is non-zero skb->tstamp test not sufficient, instead of
>>>>> SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
>>>>>
>>>>> It is if only called on the egress path. Is bpf on ingress the only
>>>>> reason for this?
>>>> Ah. ic.  meaning testing non-zero skb->tstamp and then call
>>>> skb_save_delivery_time() only during the veth-egress-path:
>>>> somewhere in veth_xmit() => veth_forward_skb() but before
>>>> skb->tstamp was reset to 0 in __dev_forward_skb().
>>>
>>> Right. If delivery_time is the only use of skb->tstamp on egress, and
>>> timestamp is the only use on ingress, then the only time the
>>> delivery_time needs to be cached if when looping from egress to
>>> ingress and this field is non-zero.
>>>
>>>> Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
>>>> because the skb->tstamp could be stamped by net_timestamp_check().
>>>>
>>>> Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
>>>>
>>>> Did I understand your suggestion correctly?
>>>
>>> I think so.
>>>
>>> But the reality is complicated if something may be setting a delivery
>>> time on ingress (a BPF filter?)
>>
>> I'm not quite following the 'bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)'
>> part yet; in our case we would need to preserve it as well, for example, we are
>> redirecting via bpf from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
>> the egress path and fq sits on phys-dev.. (I mean if needed we could easily do
>> that as shown in my prev diff with a flag for the helper).
> Right, we have the same use case:
>      redirecting from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
>      the egress path and fq sits on phys-dev
> 
> My earlier comment was on having the delivery_time preserved in
> the skb_shared_hwtstamps.  The delivery_time (e.g. EDT) and
> timestamp (timestamp as RX timestamp) are separately stored when
> looping from veth-egress to veth-ingress:
> 
> 	delivery_time in skb_shared_hwtstamps
> 	rx timestamp in skb->tstamp
> 
> Thus, when bpf_redirect_neigh(phys-dev) happens, bpf_out_*() can
> continue to reset skb->tstamp as-is while delivery_time will
> automatically be kept in skb_shared_hwtstamps.  When the skb
> reaches the egress@phys-dev (__dev_queue_xmit), the delivery_time
> in skb_shared_hwtstamps will be restored into skb->tstamp (done
> in skb_restore_delivery_time in this patch).

I think that could probably work, also in particular if it's restored once
on stacked devs e.g. when instead of phys-dev we're dealing with upper
tunnel dev (e.g. vxlan/geneve + BPF with collect_md). Wouldn't we still
need something like SKBTX_DELIVERY_TSTAMP_ALLOW_FWD, e.g. when the phys
driver sets skb_hwtstamps(skb)->hwtstamp on RX, and this gets carried on
the ingress path to the target namespaces' socket?

Thanks,
Daniel
Martin KaFai Lau Dec. 17, 2021, 5:57 p.m. UTC | #9
On Fri, Dec 17, 2021 at 12:13:14PM +0100, Daniel Borkmann wrote:
> On 12/17/21 8:33 AM, Martin KaFai Lau wrote:
> > On Fri, Dec 17, 2021 at 12:42:30AM +0100, Daniel Borkmann wrote:
> > > On 12/16/21 11:58 PM, Willem de Bruijn wrote:
> > > > > > > @@ -530,7 +538,14 @@ struct skb_shared_info {
> > > > > > >           /* Warning: this field is not always filled in (UFO)! */
> > > > > > >           unsigned short  gso_segs;
> > > > > > >           struct sk_buff  *frag_list;
> > > > > > > -       struct skb_shared_hwtstamps hwtstamps;
> > > > > > > +       union {
> > > > > > > +               /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> > > > > > > +                * tx_delivery_tstamp is stored instead of
> > > > > > > +                * hwtstamps.
> > > > > > > +                */
> > > > > > 
> > > > > > Should we just encode the timebase and/or type { timestamp,
> > > > > > delivery_time } in th lower bits of the timestamp field? Its
> > > > > > resolution is higher than actual clock precision.
> > > > > In skb->tstamp ?
> > > > 
> > > > Yes. Arguably a hack, but those bits are in the noise now, and it
> > > > avoids the clone issue with skb_shinfo (and scarcity of flag bits
> > > > there).
> > > > 
> > > > > > is non-zero skb->tstamp test not sufficient, instead of
> > > > > > SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
> > > > > > 
> > > > > > It is if only called on the egress path. Is bpf on ingress the only
> > > > > > reason for this?
> > > > > Ah. ic.  meaning testing non-zero skb->tstamp and then call
> > > > > skb_save_delivery_time() only during the veth-egress-path:
> > > > > somewhere in veth_xmit() => veth_forward_skb() but before
> > > > > skb->tstamp was reset to 0 in __dev_forward_skb().
> > > > 
> > > > Right. If delivery_time is the only use of skb->tstamp on egress, and
> > > > timestamp is the only use on ingress, then the only time the
> > > > delivery_time needs to be cached if when looping from egress to
> > > > ingress and this field is non-zero.
> > > > 
> > > > > Keep *_forward() and bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)
> > > > > because the skb->tstamp could be stamped by net_timestamp_check().
> > > > > 
> > > > > Then SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is not needed.
> > > > > 
> > > > > Did I understand your suggestion correctly?
> > > > 
> > > > I think so.
> > > > 
> > > > But the reality is complicated if something may be setting a delivery
> > > > time on ingress (a BPF filter?)
> > > 
> > > I'm not quite following the 'bpf_out_*() unchanged (i.e. keep skb->tstamp = 0)'
> > > part yet; in our case we would need to preserve it as well, for example, we are
> > > redirecting via bpf from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
> > > the egress path and fq sits on phys-dev.. (I mean if needed we could easily do
> > > that as shown in my prev diff with a flag for the helper).
> > Right, we have the same use case:
> >      redirecting from bpf@tc-ingress@host-veth to bpf@tc-egress@phys-dev in
> >      the egress path and fq sits on phys-dev
> > 
> > My earlier comment was on having the delivery_time preserved in
> > the skb_shared_hwtstamps.  The delivery_time (e.g. EDT) and
> > timestamp (timestamp as RX timestamp) are separately stored when
> > looping from veth-egress to veth-ingress:
> > 
> > 	delivery_time in skb_shared_hwtstamps
> > 	rx timestamp in skb->tstamp
> > 
> > Thus, when bpf_redirect_neigh(phys-dev) happens, bpf_out_*() can
> > continue to reset skb->tstamp as-is while delivery_time will
> > automatically be kept in skb_shared_hwtstamps.  When the skb
> > reaches the egress@phys-dev (__dev_queue_xmit), the delivery_time
> > in skb_shared_hwtstamps will be restored into skb->tstamp (done
> > in skb_restore_delivery_time in this patch).
> 
> I think that could probably work, also in particular if it's restored once
> on stacked devs e.g. when instead of phys-dev we're dealing with upper
> tunnel dev (e.g. vxlan/geneve + BPF with collect_md). Wouldn't we still
> need something like SKBTX_DELIVERY_TSTAMP_ALLOW_FWD, e.g. when the phys
> driver sets skb_hwtstamps(skb)->hwtstamp on RX, and this gets carried on
> the ingress path to the target namespaces' socket?
In the opposite direction phys-dev => host-veth => netns-veth,
the phys driver can set the hwtstamp but it won't set the
SKBTX_DELIVERY_TSTAMP.  This hwstamp gets carried onto the ingress
path of the target namesapce which I think is the current behavior
also ?

From phys-dev => host-veth, the skb is forwarded and its
skb->tstamp reset to 0.  veth_xmit() won't save zero
skb->tstamp into hwstamp and SKBTX_DELIVERY_TSTAMP won't
be set also.
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6535294f6a48..9bf0a1e2a1bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -435,9 +435,17 @@  enum {
 	/* device driver is going to provide hardware time stamp */
 	SKBTX_IN_PROGRESS = 1 << 2,
 
+	/* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
+	SKBTX_DELIVERY_TSTAMP = 1 << 3,
+
 	/* generate wifi status information (where possible) */
 	SKBTX_WIFI_STATUS = 1 << 4,
 
+	/* skb->tstamp stored a future delivery time which
+	 * was set by a local sk and it can be fowarded.
+	 */
+	SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
+
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
 };
@@ -530,7 +538,14 @@  struct skb_shared_info {
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
 	struct sk_buff	*frag_list;
-	struct skb_shared_hwtstamps hwtstamps;
+	union {
+		/* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
+		 * tx_delivery_tstamp is stored instead of
+		 * hwtstamps.
+		 */
+		struct skb_shared_hwtstamps hwtstamps;
+		u64 tx_delivery_tstamp;
+	};
 	unsigned int	gso_type;
 	u32		tskey;
 
@@ -1463,9 +1478,44 @@  static inline unsigned int skb_end_offset(const struct sk_buff *skb)
 
 static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 {
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_delivery_tstamp = 0;
+	}
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+/* Caller only needs to read the hwtstamps as ktime.
+ * To update hwtstamps,  HW device driver should call the writable
+ * version skb_hwtstamps() that returns a pointer.
+ */
+static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
+		return 0;
+	return skb_shinfo(skb)->hwtstamps.hwtstamp;
+}
+
+static inline void skb_scrub_tstamp(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
+		skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+	}
+	skb->tstamp = 0;
+}
+
+static inline void skb_restore_delivery_time(struct sk_buff *skb)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
+		skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
+		skb_shinfo(skb)->tx_delivery_tstamp = 0;
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+	}
+}
+
 static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 {
 	bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..a3ba6195f2e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@  EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/dev.c b/net/core/dev.c
index a855e41bbe39..e9e7de758cba 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4039,6 +4039,7 @@  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
+	skb_restore_delivery_time(skb);
 
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
diff --git a/net/core/filter.c b/net/core/filter.c
index e4cc3aff5bf7..7c3c6abc25a9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,7 @@  static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2176,7 +2176,7 @@  static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2274,7 @@  static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a33247fdb8f5..a53ca773afb5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5324,7 +5324,7 @@  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..f216fc97c6ce 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@  static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5079832af5c1..d5524ea5fcfa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1223,6 +1223,12 @@  INDIRECT_CALLABLE_DECLARE(int ip_queue_xmit(struct sock *sk, struct sk_buff *skb
 INDIRECT_CALLABLE_DECLARE(int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl));
 INDIRECT_CALLABLE_DECLARE(void tcp_v4_send_check(struct sock *sk, struct sk_buff *skb));
 
+static void tcp_set_skb_edt(struct sk_buff *skb, u64 edt)
+{
+	skb->skb_mstamp_ns = edt;
+	skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -1253,7 +1259,7 @@  static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	tcp_set_skb_edt(skb, tp->tcp_wstamp_ns);
 	if (clone_it) {
 		oskb = skb;
 
@@ -1589,7 +1595,7 @@  int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 
 	skb_split(skb, buff, len);
 
-	buff->tstamp = skb->tstamp;
+	tcp_set_skb_edt(buff, skb->skb_mstamp_ns);
 	tcp_fragment_tstamp(skb, buff);
 
 	old_factor = tcp_skb_pcount(skb);
@@ -2616,7 +2622,8 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tcp_set_skb_edt(skb, tp->tcp_wstamp_ns);
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3548,11 @@  struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		tcp_set_skb_edt(skb, cookie_init_timestamp(req, now));
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		tcp_set_skb_edt(skb, now);
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3601,7 @@  struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	tcp_set_skb_edt(skb, now);
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3778,7 @@  static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	tcp_set_skb_edt(syn, syn_data->skb_mstamp_ns);
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2995f8d89e7e..f5436afdafc7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@  static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a1ffdb48cc47..79a57836853a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -451,12 +451,12 @@  static int __packet_get_status(const struct packet_sock *po, void *frame)
 static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 				   unsigned int flags)
 {
-	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+	ktime_t hwtstamp = skb_hwtstamps_ktime(skb);
 
-	if (shhwtstamps &&
-	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
+	if (hwtstamp && (flags & SOF_TIMESTAMPING_RAW_HARDWARE)) {
+		*ts = ktime_to_timespec64(hwtstamp);
 		return TP_STATUS_TS_RAW_HARDWARE;
+	}
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
 	    ktime_to_timespec64_cond(skb->tstamp, ts))