Message ID | 20220211071303.890169-1-kafai@fb.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Preserve mono delivery time (EDT) in skb->tstamp | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net-next |
netdev/apply | fail | Patch does not apply to net-next |
On 2/11/22 8:13 AM, Martin KaFai Lau wrote: > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp > as a (rcv) timestamp. This patch is to backward compatible with the > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. > > If needed, the patch first saves the mono delivery_time. Depending on > the static key "netstamp_needed_key", it then resets the skb->tstamp to > either 0 or ktime_get_real() before running the tc-bpf@ingress. After > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not > been changed, it will restore the earlier saved mono delivery_time. > > The current logic to run tc-bpf@ingress is refactored to a new > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. > The above new delivery_time save/restore logic is also done together in > this function. > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > --- > include/linux/filter.h | 28 ++++++++++++++++++++++++++++ > net/sched/act_bpf.c | 5 +---- > net/sched/cls_bpf.c | 6 +----- > 3 files changed, 30 insertions(+), 9 deletions(-) > > diff --git a/include/linux/filter.h b/include/linux/filter.h > index d23e999dc032..e43e1701a80e 100644 > --- a/include/linux/filter.h > +++ b/include/linux/filter.h > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) > cb->data_end = skb->data + skb_headlen(skb); > } > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, > + struct sk_buff *skb) > +{ > + ktime_t tstamp, saved_mono_dtime = 0; > + int filter_res; > + > + if (unlikely(skb->mono_delivery_time)) { > + saved_mono_dtime = skb->tstamp; > + skb->mono_delivery_time = 0; > + if (static_branch_unlikely(&netstamp_needed_key)) > + skb->tstamp = tstamp = ktime_get_real(); > + else > + skb->tstamp = tstamp = 0; > + } > + > + /* It is safe to push/pull even if skb_shared() */ > + __skb_push(skb, skb->mac_len); > + bpf_compute_data_pointers(skb); > + filter_res = bpf_prog_run(prog, skb); > + __skb_pull(skb, skb->mac_len); > + > + /* __sk_buff->tstamp was not changed, restore the delivery_time */ > + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) > + skb_set_delivery_time(skb, saved_mono_dtime, true); So above detour is for skb->tstamp backwards compatibility so users will see real time. I don't see why we special case {cls,act}_bpf-only, given this will also be the case for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) > + return filter_res; > +} > + > /* Similar to bpf_compute_data_pointers(), except that save orginal > * data in cb->data and cb->meta_data for restore. > */ > diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c > index a77d8908e737..14c3bd0a5088 100644 > --- a/net/sched/act_bpf.c > +++ b/net/sched/act_bpf.c > @@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, > > filter = rcu_dereference(prog->filter); > if (at_ingress) { > - __skb_push(skb, skb->mac_len); > - bpf_compute_data_pointers(skb); > - filter_res = bpf_prog_run(filter, skb); > - __skb_pull(skb, skb->mac_len); > + filter_res = bpf_prog_run_at_ingress(filter, skb); > } else { > bpf_compute_data_pointers(skb); > filter_res = bpf_prog_run(filter, skb); > diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c > index df19a847829e..036b2e1f74af 100644 > --- a/net/sched/cls_bpf.c > +++ b/net/sched/cls_bpf.c > @@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, > if (tc_skip_sw(prog->gen_flags)) { > filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0; > } else if (at_ingress) { > - /* It is safe to push/pull even if skb_shared() */ > - __skb_push(skb, skb->mac_len); > - bpf_compute_data_pointers(skb); > - filter_res = bpf_prog_run(prog->filter, skb); > - __skb_pull(skb, skb->mac_len); > + filter_res = bpf_prog_run_at_ingress(prog->filter, skb); > } else { > bpf_compute_data_pointers(skb); > filter_res = bpf_prog_run(prog->filter, skb); >
On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote: > On 2/11/22 8:13 AM, Martin KaFai Lau wrote: > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp > > as a (rcv) timestamp. This patch is to backward compatible with the > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. > > > > If needed, the patch first saves the mono delivery_time. Depending on > > the static key "netstamp_needed_key", it then resets the skb->tstamp to > > either 0 or ktime_get_real() before running the tc-bpf@ingress. After > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not > > been changed, it will restore the earlier saved mono delivery_time. > > > > The current logic to run tc-bpf@ingress is refactored to a new > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. > > The above new delivery_time save/restore logic is also done together in > > this function. > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > --- > > include/linux/filter.h | 28 ++++++++++++++++++++++++++++ > > net/sched/act_bpf.c | 5 +---- > > net/sched/cls_bpf.c | 6 +----- > > 3 files changed, 30 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > index d23e999dc032..e43e1701a80e 100644 > > --- a/include/linux/filter.h > > +++ b/include/linux/filter.h > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) > > cb->data_end = skb->data + skb_headlen(skb); > > } > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, > > + struct sk_buff *skb) > > +{ > > + ktime_t tstamp, saved_mono_dtime = 0; > > + int filter_res; > > + > > + if (unlikely(skb->mono_delivery_time)) { > > + saved_mono_dtime = skb->tstamp; > > + skb->mono_delivery_time = 0; > > + if (static_branch_unlikely(&netstamp_needed_key)) > > + skb->tstamp = tstamp = ktime_get_real(); > > + else > > + skb->tstamp = tstamp = 0; > > + } > > + > > + /* It is safe to push/pull even if skb_shared() */ > > + __skb_push(skb, skb->mac_len); > > + bpf_compute_data_pointers(skb); > > + filter_res = bpf_prog_run(prog, skb); > > + __skb_pull(skb, skb->mac_len); > > + > > + /* __sk_buff->tstamp was not changed, restore the delivery_time */ > > + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) > > + skb_set_delivery_time(skb, saved_mono_dtime, true); > > So above detour is for skb->tstamp backwards compatibility so users will see real time. > I don't see why we special case {cls,act}_bpf-only, given this will also be the case > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) The limitation here is there is only one skb->tstamp field. I don't see a bpf-only solution or not will make a difference here. Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c and nfnetlink_queue.c. Like the tapping cases (earlier than the bpf run-point) and in general other ingress cases, it cannot assume the rcv timestamp is always there, so they can be changed like af_packet in patch 3 which is a straight forward change. I can make the change in v5. Going back to the cls_bpf at ingress. If the concern is on code cleanliness, how about removing this dance for now while the current rcv tstamp usage is unclear at ingress. Meaning keep the delivery_time (if any) in skb->tstamp. This dance could be brought in later when there was breakage and legit usecase reported. The new bpf prog will have to use the __sk_buff->delivery_time_type regardless if it wants to use skb->tstamp as the delivery_time, so they won't assume delivery_time is always in skb->tstamp and it will be fine even this dance would be brought back in later. I prefer to keep it as-is to avoid unlikely breakage but open to remove it for now. Regarding patch 6, it is unrelated. It needs to clear the mono_delivery_time bit if the bpf writes 0 to the skb->tstamp.
On 2/16/22 6:51 AM, Martin KaFai Lau wrote: > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote: >> On 2/11/22 8:13 AM, Martin KaFai Lau wrote: >>> The current tc-bpf@ingress reads and writes the __sk_buff->tstamp >>> as a (rcv) timestamp. This patch is to backward compatible with the >>> (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. >>> >>> If needed, the patch first saves the mono delivery_time. Depending on >>> the static key "netstamp_needed_key", it then resets the skb->tstamp to >>> either 0 or ktime_get_real() before running the tc-bpf@ingress. After >>> the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not >>> been changed, it will restore the earlier saved mono delivery_time. >>> >>> The current logic to run tc-bpf@ingress is refactored to a new >>> bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. >>> The above new delivery_time save/restore logic is also done together in >>> this function. >>> >>> Signed-off-by: Martin KaFai Lau <kafai@fb.com> >>> --- >>> include/linux/filter.h | 28 ++++++++++++++++++++++++++++ >>> net/sched/act_bpf.c | 5 +---- >>> net/sched/cls_bpf.c | 6 +----- >>> 3 files changed, 30 insertions(+), 9 deletions(-) >>> >>> diff --git a/include/linux/filter.h b/include/linux/filter.h >>> index d23e999dc032..e43e1701a80e 100644 >>> --- a/include/linux/filter.h >>> +++ b/include/linux/filter.h >>> @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) >>> cb->data_end = skb->data + skb_headlen(skb); >>> } >>> +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, >>> + struct sk_buff *skb) >>> +{ >>> + ktime_t tstamp, saved_mono_dtime = 0; >>> + int filter_res; >>> + >>> + if (unlikely(skb->mono_delivery_time)) { >>> + saved_mono_dtime = skb->tstamp; >>> + skb->mono_delivery_time = 0; >>> + if (static_branch_unlikely(&netstamp_needed_key)) >>> + skb->tstamp = tstamp = ktime_get_real(); >>> + else >>> + skb->tstamp = tstamp = 0; >>> + } >>> + >>> + /* It is safe to push/pull even if skb_shared() */ >>> + __skb_push(skb, skb->mac_len); >>> + bpf_compute_data_pointers(skb); >>> + filter_res = bpf_prog_run(prog, skb); >>> + __skb_pull(skb, skb->mac_len); >>> + >>> + /* __sk_buff->tstamp was not changed, restore the delivery_time */ >>> + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) >>> + skb_set_delivery_time(skb, saved_mono_dtime, true); >> >> So above detour is for skb->tstamp backwards compatibility so users will see real time. >> I don't see why we special case {cls,act}_bpf-only, given this will also be the case >> for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get >> the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). >> >> If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore >> detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special >> case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) > The limitation here is there is only one skb->tstamp field. I don't see > a bpf-only solution or not will make a difference here. A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque, meaning, there're no further bits on clock type needed in skb, but given the environment is controlled by an orchestrator it can decide which tstamps to retain or which to reset (e.g. by looking at skb->sk). (The other approach is exposing info on clock base as done here to some degree for mono/real.) > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c > and nfnetlink_queue.c. Like the tapping cases (earlier than the bpf run-point) > and in general other ingress cases, it cannot assume the rcv timestamp is > always there, so they can be changed like af_packet in patch 3 > which is a straight forward change. I can make the change in v5. > > Going back to the cls_bpf at ingress. If the concern is on code cleanliness, > how about removing this dance for now while the current rcv tstamp usage is > unclear at ingress. Meaning keep the delivery_time (if any) in skb->tstamp. > This dance could be brought in later when there was breakage and legit usecase > reported. The new bpf prog will have to use the __sk_buff->delivery_time_type > regardless if it wants to use skb->tstamp as the delivery_time, so they won't > assume delivery_time is always in skb->tstamp and it will be fine even this > dance would be brought back in later. Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround. Ideally, we know when we call helpers like ktime_get_ns() that the clock will be mono. We could track that on verifier side in the register type, and when we end up writing to skb->tstamp, we could implicitly also set the clock base bits in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading, we could add __sk_buff->tstamp_type which program can access (imo tstamp_type is more generic than a __sk_buff->delivery_time_type). If someone needs ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking mechanism. Also setting skb->tstamp to 0 ... > Regarding patch 6, it is unrelated. It needs to clear the > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp. ... doesn't need to be done as code after bpf_prog_run(), but should be brought closer to when we write to the ctx where verifier generates the relevant insns. Imo, that's better than having this outside in bpf_prog_run() which is then checked no matter what program was doing or even accessing tstamp. Thanks, Daniel
On Thu, Feb 17, 2022 at 01:03:21AM +0100, Daniel Borkmann wrote: > On 2/16/22 6:51 AM, Martin KaFai Lau wrote: > > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote: > > > On 2/11/22 8:13 AM, Martin KaFai Lau wrote: > > > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp > > > > as a (rcv) timestamp. This patch is to backward compatible with the > > > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. > > > > > > > > If needed, the patch first saves the mono delivery_time. Depending on > > > > the static key "netstamp_needed_key", it then resets the skb->tstamp to > > > > either 0 or ktime_get_real() before running the tc-bpf@ingress. After > > > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not > > > > been changed, it will restore the earlier saved mono delivery_time. > > > > > > > > The current logic to run tc-bpf@ingress is refactored to a new > > > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. > > > > The above new delivery_time save/restore logic is also done together in > > > > this function. > > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > > > --- > > > > include/linux/filter.h | 28 ++++++++++++++++++++++++++++ > > > > net/sched/act_bpf.c | 5 +---- > > > > net/sched/cls_bpf.c | 6 +----- > > > > 3 files changed, 30 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > > > index d23e999dc032..e43e1701a80e 100644 > > > > --- a/include/linux/filter.h > > > > +++ b/include/linux/filter.h > > > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) > > > > cb->data_end = skb->data + skb_headlen(skb); > > > > } > > > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, > > > > + struct sk_buff *skb) > > > > +{ > > > > + ktime_t tstamp, saved_mono_dtime = 0; > > > > + int filter_res; > > > > + > > > > + if (unlikely(skb->mono_delivery_time)) { > > > > + saved_mono_dtime = skb->tstamp; > > > > + skb->mono_delivery_time = 0; > > > > + if (static_branch_unlikely(&netstamp_needed_key)) > > > > + skb->tstamp = tstamp = ktime_get_real(); > > > > + else > > > > + skb->tstamp = tstamp = 0; > > > > + } > > > > + > > > > + /* It is safe to push/pull even if skb_shared() */ > > > > + __skb_push(skb, skb->mac_len); > > > > + bpf_compute_data_pointers(skb); > > > > + filter_res = bpf_prog_run(prog, skb); > > > > + __skb_pull(skb, skb->mac_len); > > > > + > > > > + /* __sk_buff->tstamp was not changed, restore the delivery_time */ > > > > + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) > > > > + skb_set_delivery_time(skb, saved_mono_dtime, true); > > > > > > So above detour is for skb->tstamp backwards compatibility so users will see real time. > > > I don't see why we special case {cls,act}_bpf-only, given this will also be the case > > > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get > > > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). > > > > > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore > > > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special > > > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) > > The limitation here is there is only one skb->tstamp field. I don't see > > a bpf-only solution or not will make a difference here. > > A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque, > meaning, there're no further bits on clock type needed in skb, but given the > environment is controlled by an orchestrator it can decide which tstamps to > retain or which to reset (e.g. by looking at skb->sk). (The other approach is > exposing info on clock base as done here to some degree for mono/real.) hmm... I think we may be talking about different things. Using a bit or not still does not change the fact that there is only one skb->tstamp field which may have a delivery time or rcv tstamp. If the delivery time is reset before forwarding to ingress or the delivery time was never there, then it will be stamped with the rcv timestamp at ingress. The bpf needs a way to distinguish between them. skb->sk can at most tell the clock base if skb->tstamp does indeed have the delivery_time. > > > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c > > and nfnetlink_queue.c. Like the tapping cases (earlier than the bpf run-point) > > and in general other ingress cases, it cannot assume the rcv timestamp is > > always there, so they can be changed like af_packet in patch 3 > > which is a straight forward change. I can make the change in v5. > > > > Going back to the cls_bpf at ingress. If the concern is on code cleanliness, > > how about removing this dance for now while the current rcv tstamp usage is > > unclear at ingress. Meaning keep the delivery_time (if any) in skb->tstamp. > > This dance could be brought in later when there was breakage and legit usecase > > reported. The new bpf prog will have to use the __sk_buff->delivery_time_type > > regardless if it wants to use skb->tstamp as the delivery_time, so they won't > > assume delivery_time is always in skb->tstamp and it will be fine even this > > dance would be brought back in later. > > Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround. ic. so it is ok to remove the mono dtime save/restore logic here and only brought back in if there was legit breakage reported? btw, did you look at patch 7 which added the __sk_buff->delivery_time_type and bpf_set_delivery_time()? > Ideally, we know when we call helpers like ktime_get_ns() that the clock will > be mono. We could track that on verifier side in the register type, and when we > end up writing to skb->tstamp, we could implicitly also set the clock base bits > in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading, > we could add __sk_buff->tstamp_type which program can access (imo tstamp_type > is more generic than a __sk_buff->delivery_time_type). If someone needs > ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking > mechanism. Also setting skb->tstamp to 0 ... hmm... I think it is talking about a way to automatically update the __sk_buff->delivery_time_type (mono_delivery_time bit) and also avoid adding the new bpf_set_delivery_time() helper in patch 7? It may have case that time is not always from helper ktime_get_ns() and cannot be decided statically. e.g. what if we want to set the current skb->tstamp based on when the previous skb was sent in a cgroup. There will be cases coming up that require runtime decision. Also, imo, it may be a surprise behavior for the user who only changed __skb_buff->tstamp but then figured out __sk_buff->delivery_time_type is also changed in the background. Beside, not sure if the compiler will optimize the 2nd read on __sk_buff->delivery_time_type. The bpf may need a READ_ONCE(__sk_buff->delivery_time_type) after writing __skb_buff->tstamp. We can add volatile to the delivery_time_type in the UAPI but I think it is overkill. It is better to treat tstamp and delivery_time_type separately for direct access, and have the skb_set_delivery_time() to change both of them together. Also, more checks can be done in skb_set_delivery_time() which is more flexible to return errors. For TCP, it will be already in mono, so skb_set_delivery_time() is usually not needed. Regarding the name delivery_time_type vs tstamp_type, I thought about that and finally picked the delivery_time_type because I want a clear separation from rcv tstamp for now until there is more clarity on how rcv tstamp is used in tc-bpf. > > Regarding patch 6, it is unrelated. It needs to clear the > > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp. > > ... doesn't need to be done as code after bpf_prog_run(), but should be brought > closer to when we write to the ctx where verifier generates the relevant insns. > Imo, that's better than having this outside in bpf_prog_run() which is then > checked no matter what program was doing or even accessing tstamp. This will also change the mono_delivery_time bit in the background and will be similar to the above points.
On Wed, Feb 16, 2022 at 05:50:43PM -0800, Martin KaFai Lau wrote: > On Thu, Feb 17, 2022 at 01:03:21AM +0100, Daniel Borkmann wrote: > > On 2/16/22 6:51 AM, Martin KaFai Lau wrote: > > > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote: > > > > On 2/11/22 8:13 AM, Martin KaFai Lau wrote: > > > > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp > > > > > as a (rcv) timestamp. This patch is to backward compatible with the > > > > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. > > > > > > > > > > If needed, the patch first saves the mono delivery_time. Depending on > > > > > the static key "netstamp_needed_key", it then resets the skb->tstamp to > > > > > either 0 or ktime_get_real() before running the tc-bpf@ingress. After > > > > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not > > > > > been changed, it will restore the earlier saved mono delivery_time. > > > > > > > > > > The current logic to run tc-bpf@ingress is refactored to a new > > > > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. > > > > > The above new delivery_time save/restore logic is also done together in > > > > > this function. > > > > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > > > > --- > > > > > include/linux/filter.h | 28 ++++++++++++++++++++++++++++ > > > > > net/sched/act_bpf.c | 5 +---- > > > > > net/sched/cls_bpf.c | 6 +----- > > > > > 3 files changed, 30 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > > > > index d23e999dc032..e43e1701a80e 100644 > > > > > --- a/include/linux/filter.h > > > > > +++ b/include/linux/filter.h > > > > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) > > > > > cb->data_end = skb->data + skb_headlen(skb); > > > > > } > > > > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, > > > > > + struct sk_buff *skb) > > > > > +{ > > > > > + ktime_t tstamp, saved_mono_dtime = 0; > > > > > + int filter_res; > > > > > + > > > > > + if (unlikely(skb->mono_delivery_time)) { > > > > > + saved_mono_dtime = skb->tstamp; > > > > > + skb->mono_delivery_time = 0; > > > > > + if (static_branch_unlikely(&netstamp_needed_key)) > > > > > + skb->tstamp = tstamp = ktime_get_real(); > > > > > + else > > > > > + skb->tstamp = tstamp = 0; > > > > > + } > > > > > + > > > > > + /* It is safe to push/pull even if skb_shared() */ > > > > > + __skb_push(skb, skb->mac_len); > > > > > + bpf_compute_data_pointers(skb); > > > > > + filter_res = bpf_prog_run(prog, skb); > > > > > + __skb_pull(skb, skb->mac_len); > > > > > + > > > > > + /* __sk_buff->tstamp was not changed, restore the delivery_time */ > > > > > + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) > > > > > + skb_set_delivery_time(skb, saved_mono_dtime, true); > > > > > > > > So above detour is for skb->tstamp backwards compatibility so users will see real time. > > > > I don't see why we special case {cls,act}_bpf-only, given this will also be the case > > > > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get > > > > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). > > > > > > > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore > > > > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special > > > > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) > > > The limitation here is there is only one skb->tstamp field. I don't see > > > a bpf-only solution or not will make a difference here. > > > > A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque, > > meaning, there're no further bits on clock type needed in skb, but given the > > environment is controlled by an orchestrator it can decide which tstamps to > > retain or which to reset (e.g. by looking at skb->sk). (The other approach is > > exposing info on clock base as done here to some degree for mono/real.) > hmm... I think we may be talking about different things. > > Using a bit or not still does not change the fact that > there is only one skb->tstamp field which may have a delivery > time or rcv tstamp. If the delivery time is reset before > forwarding to ingress or the delivery time was never there, then > it will be stamped with the rcv timestamp at ingress. > The bpf needs a way to distinguish between them. > skb->sk can at most tell the clock base if skb->tstamp > does indeed have the delivery_time. > > > > > > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c > > > and nfnetlink_queue.c. Like the tapping cases (earlier than the bpf run-point) > > > and in general other ingress cases, it cannot assume the rcv timestamp is > > > always there, so they can be changed like af_packet in patch 3 > > > which is a straight forward change. I can make the change in v5. > > > > > > Going back to the cls_bpf at ingress. If the concern is on code cleanliness, > > > how about removing this dance for now while the current rcv tstamp usage is > > > unclear at ingress. Meaning keep the delivery_time (if any) in skb->tstamp. > > > This dance could be brought in later when there was breakage and legit usecase > > > reported. The new bpf prog will have to use the __sk_buff->delivery_time_type > > > regardless if it wants to use skb->tstamp as the delivery_time, so they won't > > > assume delivery_time is always in skb->tstamp and it will be fine even this > > > dance would be brought back in later. > > > > Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround. > ic. so it is ok to remove the mono dtime save/restore logic here and only brought > back in if there was legit breakage reported? Another idea on this which I think is a better middle ground solution to remove the dance here. When reading __sk_buff->tstamp, the verifier can do a rewrite if needed. The rewrite will depend on whether the __sk_buff->delivery_time_type has been read or not. If delivery_time_type is not read, it will rewrite to this: /* BPF_READ: __u64 a = __sk_buff->tstamp; */ if (!skb->tc_at_ingress || !skb->mono_delivery_time) a = skb->tstamp; else a = 0; That will be consistent with other kernel ingress path expectation (either 0 or rcv tstamp). If __sk_buff->delivery_time_type is read, no rewrite is needed and skb->tstamp will be read as is. > btw, did you look at patch 7 which added the __sk_buff->delivery_time_type > and bpf_set_delivery_time()? > > > Ideally, we know when we call helpers like ktime_get_ns() that the clock will > > be mono. We could track that on verifier side in the register type, and when we > > end up writing to skb->tstamp, we could implicitly also set the clock base bits > > in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading, > > we could add __sk_buff->tstamp_type which program can access (imo tstamp_type > > is more generic than a __sk_buff->delivery_time_type). If someone needs > > ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking > > mechanism. Also setting skb->tstamp to 0 ... > hmm... I think it is talking about a way to automatically > update the __sk_buff->delivery_time_type (mono_delivery_time bit) and > also avoid adding the new bpf_set_delivery_time() helper in patch 7? > > It may have case that time is not always from helper ktime_get_ns() and > cannot be decided statically. e.g. what if we want to set the current > skb->tstamp based on when the previous skb was sent in a cgroup. There > will be cases coming up that require runtime decision. > > Also, imo, it may be a surprise behavior for the user who only > changed __skb_buff->tstamp but then figured out > __sk_buff->delivery_time_type is also changed in > the background. > > Beside, not sure if the compiler will optimize the 2nd read on > __sk_buff->delivery_time_type. The bpf may need a > READ_ONCE(__sk_buff->delivery_time_type) after writing __skb_buff->tstamp. > We can add volatile to the delivery_time_type in the UAPI but I think > it is overkill. > > It is better to treat tstamp and delivery_time_type separately > for direct access, and have the skb_set_delivery_time() to change > both of them together. Also, more checks can be done in > skb_set_delivery_time() which is more flexible to return > errors. > > For TCP, it will be already in mono, so skb_set_delivery_time() > is usually not needed. > > Regarding the name delivery_time_type vs tstamp_type, I thought about > that and finally picked the delivery_time_type because I want > a clear separation from rcv tstamp for now until there is more > clarity on how rcv tstamp is used in tc-bpf. > > > > Regarding patch 6, it is unrelated. It needs to clear the > > > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp. > > > > ... doesn't need to be done as code after bpf_prog_run(), but should be brought > > closer to when we write to the ctx where verifier generates the relevant insns. > > Imo, that's better than having this outside in bpf_prog_run() which is then > > checked no matter what program was doing or even accessing tstamp. > This will also change the mono_delivery_time bit in the background > and will be similar to the above points.
diff --git a/include/linux/filter.h b/include/linux/filter.h index d23e999dc032..e43e1701a80e 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) cb->data_end = skb->data + skb_headlen(skb); } +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, + struct sk_buff *skb) +{ + ktime_t tstamp, saved_mono_dtime = 0; + int filter_res; + + if (unlikely(skb->mono_delivery_time)) { + saved_mono_dtime = skb->tstamp; + skb->mono_delivery_time = 0; + if (static_branch_unlikely(&netstamp_needed_key)) + skb->tstamp = tstamp = ktime_get_real(); + else + skb->tstamp = tstamp = 0; + } + + /* It is safe to push/pull even if skb_shared() */ + __skb_push(skb, skb->mac_len); + bpf_compute_data_pointers(skb); + filter_res = bpf_prog_run(prog, skb); + __skb_pull(skb, skb->mac_len); + + /* __sk_buff->tstamp was not changed, restore the delivery_time */ + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) + skb_set_delivery_time(skb, saved_mono_dtime, true); + + return filter_res; +} + /* Similar to bpf_compute_data_pointers(), except that save orginal * data in cb->data and cb->meta_data for restore. */ diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index a77d8908e737..14c3bd0a5088 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, filter = rcu_dereference(prog->filter); if (at_ingress) { - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(filter, skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index df19a847829e..036b2e1f74af 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, if (tc_skip_sw(prog->gen_flags)) { filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0; } else if (at_ingress) { - /* It is safe to push/pull even if skb_shared() */ - __skb_push(skb, skb->mac_len); - bpf_compute_data_pointers(skb); - filter_res = bpf_prog_run(prog->filter, skb); - __skb_pull(skb, skb->mac_len); + filter_res = bpf_prog_run_at_ingress(prog->filter, skb); } else { bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb);
The current tc-bpf@ingress reads and writes the __sk_buff->tstamp as a (rcv) timestamp. This patch is to backward compatible with the (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. If needed, the patch first saves the mono delivery_time. Depending on the static key "netstamp_needed_key", it then resets the skb->tstamp to either 0 or ktime_get_real() before running the tc-bpf@ingress. After the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not been changed, it will restore the earlier saved mono delivery_time. The current logic to run tc-bpf@ingress is refactored to a new bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. The above new delivery_time save/restore logic is also done together in this function. Signed-off-by: Martin KaFai Lau <kafai@fb.com> --- include/linux/filter.h | 28 ++++++++++++++++++++++++++++ net/sched/act_bpf.c | 5 +---- net/sched/cls_bpf.c | 6 +----- 3 files changed, 30 insertions(+), 9 deletions(-)