Message ID | 20210928095538.114207-1-pablo@netfilter.org (mailing list archive) |
---|---|
Headers | show |
Series | Netfilter egress hook | expand |
On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote: > Hi, > > This patchset v5 that re-adds the Netfilter egress: > > 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h > from Lukas Wunner. > > 2) Generalize ingress hook file to accomodate egress support, > from Lukas Wunner. > > 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel > Borkmann is requesting for a mechanism to allow to blacklist > Netfilter, this allows users to blacklist this new module that > includes ingress chain and the new egress chain for the netdev > family. There is no other in-tree user of the ingress and egress > hooks than this which might interfer with his matter. > > 4) Place the egress hook again before the tc egress hook as requested > by Daniel Borkmann. Patch to add egress hook from Lukas Wunner. > The Netfilter egress hook remains behind the static key, if unused > performance degradation is negligible. > > 5) Add netfilter egress handling to af_packet. > > Arguably, distributors might decide to compile nf_tables_netdev > built-in. Traditionally, distributors have compiled their kernels using > the default configuration that Netfilter Kconfig provides (ie. use > modules whenever possible). In any case, I consider that distributor > policy is out of scope in this discussion, providing a mechanism to > allow Daniel to prevent Netfilter ingress and egress chains to be loaded > should be sufficient IMHO. Hm, so in the case of SRv6 users were running into a similar issue and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added a new hook along with a sysctl which defaults the new hook to off. The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets do not break." [0,1] If the suggestion to flag the skb [2] one way or another from the tc forwarding path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not do a sysctl toggle like in the SRv6 case? [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7a3f5b0de3647c854e34269c3332d7a1e902901a [1] https://lore.kernel.org/netdev/20210830093852.21654-1-pablo@netfilter.org/ [2] https://lore.kernel.org/netdev/11584665-e3b5-3afe-d72c-ef92ad0b9313@iogearbox.net/
On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote: > Hm, so in the case of SRv6 users were running into a similar issue > and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 > data plane") [0] added a new hook along with a sysctl which defaults > the new hook to off. > > The rationale for it was given as "the hooks are enabled via > nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets > do not break." [0,1] > > If the suggestion to flag the skb [2] one way or another from the > tc forwarding path (e.g. skb bit or per-cpu marker) is not > technically feasible, then why not do a sysctl toggle like in the > SRv6 case? The skb flag *is* technically feasible. I amended the patches with the flag and was going to post them this week, but Pablo beat me to the punch and posted his alternative version, which lacks the flag but modularizes netfilter ingress/egress processing instead. Honestly I think a hodge-podge of config options and sysctl toggles is awful and I would prefer the skb flag you suggested. I kind of like your idea of considering tc and netfilter as layers. FWIW the finished patches *with* the flag are on this branch: https://github.com/l1k/linux/commits/nft_egress_v5 Below is the "git range-diff" between Pablo's patches and mine (just the hunks which pertain to the skb flag, plus excerpts from the commit message). Would you find the patch set acceptable with this skb flag? -- >8 -- + Alternatively or in addition to netfilter, packets can be classified + with traffic control (tc). On ingress, packets are classified first by + tc, then by netfilter. On egress, the order is reversed for symmetry. + Conceptually, tc and netfilter can be thought of as layers, with + netfilter layered above tc. + Traffic control is capable of redirecting packets to another interface + (man 8 tc-mirred). E.g., an ingress packet may be redirected from the + host namespace to a container via a veth connection: + tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) + In this case, netfilter egress classifying is not performed when leaving + the host namespace! That's because the packet is still on the tc layer. + If tc redirects the packet to a physical interface in the host namespace + such that it leaves the system, the packet is never subjected to + netfilter egress classifying. That is only logical since it hasn't + passed through netfilter ingress classifying either. + Packets can alternatively be redirected at the netfilter layer using + nft fwd. Such a packet *is* subjected to netfilter egress classifying. + Internally, the skb->nf_skip_egress flag controls whether netfilter is + invoked on egress by __dev_queue_xmit(). + + Interaction between tc and netfilter is possible by setting and querying + skb->mark. @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu +static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc, + struct net_device *dev) +{ -+ struct nf_hook_entries *e = rcu_dereference(dev->nf_hooks_egress); ++ struct nf_hook_entries *e; + struct nf_hook_state state; + int ret; + ++ if (skb->nf_skip_egress) ++ return skb; ++ ++ e = rcu_dereference(dev->nf_hooks_egress); + if (!e) + return skb; + @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu + return NULL; + } +} ++ ++static inline void nf_skip_egress(struct sk_buff *skb, bool skip) ++{ ++ skb->nf_skip_egress = skip; ++} +#else /* CONFIG_NETFILTER_EGRESS */ +static inline bool nf_hook_egress_active(void) +{ @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu +{ + return skb; +} ++ ++static inline void nf_skip_egress(struct sk_buff *skb, bool skip) ++{ ++} +#endif /* CONFIG_NETFILTER_EGRESS */ + static inline void nf_hook_netdev_init(struct net_device *dev) @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu #endif /* _NETFILTER_NETDEV_H_ */ + ## include/linux/skbuff.h ## +@@ include/linux/skbuff.h: typedef unsigned char *sk_buff_data_t; + * @tc_at_ingress: used within tc_classify to distinguish in/egress + * @redirected: packet was redirected by packet classifier + * @from_ingress: packet was redirected from the ingress path ++ * @nf_skip_egress: packet shall skip netfilter egress processing + * @peeked: this packet has been seen already, so stats have been + * done for it, don't do them again + * @nf_trace: netfilter packet trace flag +@@ include/linux/skbuff.h: struct sk_buff { + #ifdef CONFIG_NET_REDIRECT + __u8 from_ingress:1; + #endif ++#ifdef CONFIG_NETFILTER_EGRESS ++ __u8 nf_skip_egress:1; ++#endif + #ifdef CONFIG_TLS_DEVICE + __u8 decrypted:1; + #endif + ## include/uapi/linux/netfilter.h ## @@ include/uapi/linux/netfilter.h: enum nf_inet_hooks { @@ net/core/dev.c: static int __dev_queue_xmit(struct sk_buff *skb, struct net_devi + if (!skb) + goto out; + } ++ nf_skip_egress(skb, true); skb = sch_handle_egress(skb, &rc, dev); if (!skb) goto out; @@ net/core/dev.c: static int __dev_queue_xmit(struct sk_buff *skb, struct net_devi #endif /* If device/qdisc don't need skb->dst, release it right now while * its hot in this cpu cache. +@@ net/core/dev.c: static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, + if (static_branch_unlikely(&ingress_needed_key)) { + bool another = false; + ++ nf_skip_egress(skb, true); + skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev, + &another); + if (another) +@@ net/core/dev.c: static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, + if (!skb) + goto out; + ++ nf_skip_egress(skb, false); + if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0) + goto out; + }
On 9/30/21 8:52 AM, Lukas Wunner wrote: > On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote: >> Hm, so in the case of SRv6 users were running into a similar issue >> and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 >> data plane") [0] added a new hook along with a sysctl which defaults >> the new hook to off. >> >> The rationale for it was given as "the hooks are enabled via >> nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets >> do not break." [0,1] >> >> If the suggestion to flag the skb [2] one way or another from the >> tc forwarding path (e.g. skb bit or per-cpu marker) is not >> technically feasible, then why not do a sysctl toggle like in the >> SRv6 case? > > The skb flag *is* technically feasible. I amended the patches with > the flag and was going to post them this week, but Pablo beat me to > the punch and posted his alternative version, which lacks the flag > but modularizes netfilter ingress/egress processing instead. > > Honestly I think a hodge-podge of config options and sysctl toggles > is awful and I would prefer the skb flag you suggested. I kind of > like your idea of considering tc and netfilter as layers. > > FWIW the finished patches *with* the flag are on this branch: > https://github.com/l1k/linux/commits/nft_egress_v5 > > Below is the "git range-diff" between Pablo's patches and mine > (just the hunks which pertain to the skb flag, plus excerpts > from the commit message). > > Would you find the patch set acceptable with this skb flag? Yes, that looks good to me, thanks Lukas!
On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote: > On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote: > > Hi, > > > > This patchset v5 that re-adds the Netfilter egress: > > > > 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h > > from Lukas Wunner. > > > > 2) Generalize ingress hook file to accomodate egress support, > > from Lukas Wunner. > > > > 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel > > Borkmann is requesting for a mechanism to allow to blacklist > > Netfilter, this allows users to blacklist this new module that > > includes ingress chain and the new egress chain for the netdev > > family. There is no other in-tree user of the ingress and egress > > hooks than this which might interfer with his matter. > > > > 4) Place the egress hook again before the tc egress hook as requested > > by Daniel Borkmann. Patch to add egress hook from Lukas Wunner. > > The Netfilter egress hook remains behind the static key, if unused > > performance degradation is negligible. > > > > 5) Add netfilter egress handling to af_packet. > > > > Arguably, distributors might decide to compile nf_tables_netdev > > built-in. Traditionally, distributors have compiled their kernels using > > the default configuration that Netfilter Kconfig provides (ie. use > > modules whenever possible). In any case, I consider that distributor > > policy is out of scope in this discussion, providing a mechanism to > > allow Daniel to prevent Netfilter ingress and egress chains to be loaded > > should be sufficient IMHO. > > Hm, so in the case of SRv6 users were running into a similar issue and commit > 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added > a new hook along with a sysctl which defaults the new hook to off. > > The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel > sysctl to make sure existing netfilter rulesets do not break." [0,1] > > If the suggestion to flag the skb [2] one way or another from the tc forwarding > path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not > do a sysctl toggle like in the SRv6 case? I am already providing a global toggle to disable netdev ingress/egress hooks? In the SRv6 case that is not possible. Why do you need you need a sysctl knob when my proposal is already addressing your needs?
On Thu, Sep 30, 2021 at 08:52:38AM +0200, Lukas Wunner wrote: > On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote: > > Hm, so in the case of SRv6 users were running into a similar issue > > and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 > > data plane") [0] added a new hook along with a sysctl which defaults > > the new hook to off. > > > > The rationale for it was given as "the hooks are enabled via > > nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets > > do not break." [0,1] > > > > If the suggestion to flag the skb [2] one way or another from the > > tc forwarding path (e.g. skb bit or per-cpu marker) is not > > technically feasible, then why not do a sysctl toggle like in the > > SRv6 case? > > The skb flag *is* technically feasible. I amended the patches with > the flag and was going to post them this week, but Pablo beat me to > the punch and posted his alternative version, which lacks the flag > but modularizes netfilter ingress/egress processing instead. > > Honestly I think a hodge-podge of config options and sysctl toggles > is awful and I would prefer the skb flag you suggested. I kind of > like your idea of considering tc and netfilter as layers. > > FWIW the finished patches *with* the flag are on this branch: > https://github.com/l1k/linux/commits/nft_egress_v5 > > Below is the "git range-diff" between Pablo's patches and mine > (just the hunks which pertain to the skb flag, plus excerpts > from the commit message). > > Would you find the patch set acceptable with this skb flag? Why do you need a programmatic skb flag? Are you planning to do: skb->skip_nf_egress = random(); from the packet path? Seriously, Daniel is asking for a global toggle to disable Netfilter. What is wrong with the Netfilter blacklisting approach?
On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote: > On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote: >> On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote: >>> Hi, >>> >>> This patchset v5 that re-adds the Netfilter egress: >>> >>> 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h >>> from Lukas Wunner. >>> >>> 2) Generalize ingress hook file to accomodate egress support, >>> from Lukas Wunner. >>> >>> 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel >>> Borkmann is requesting for a mechanism to allow to blacklist >>> Netfilter, this allows users to blacklist this new module that >>> includes ingress chain and the new egress chain for the netdev >>> family. There is no other in-tree user of the ingress and egress >>> hooks than this which might interfer with his matter. >>> >>> 4) Place the egress hook again before the tc egress hook as requested >>> by Daniel Borkmann. Patch to add egress hook from Lukas Wunner. >>> The Netfilter egress hook remains behind the static key, if unused >>> performance degradation is negligible. >>> >>> 5) Add netfilter egress handling to af_packet. >>> >>> Arguably, distributors might decide to compile nf_tables_netdev >>> built-in. Traditionally, distributors have compiled their kernels using >>> the default configuration that Netfilter Kconfig provides (ie. use >>> modules whenever possible). In any case, I consider that distributor >>> policy is out of scope in this discussion, providing a mechanism to >>> allow Daniel to prevent Netfilter ingress and egress chains to be loaded >>> should be sufficient IMHO. >> >> Hm, so in the case of SRv6 users were running into a similar issue and commit >> 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added >> a new hook along with a sysctl which defaults the new hook to off. >> >> The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel >> sysctl to make sure existing netfilter rulesets do not break." [0,1] >> >> If the suggestion to flag the skb [2] one way or another from the tc forwarding >> path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not >> do a sysctl toggle like in the SRv6 case? > > I am already providing a global toggle to disable netdev > ingress/egress hooks? > > In the SRv6 case that is not possible. > > Why do you need you need a sysctl knob when my proposal is already > addressing your needs? Well, it's not addressing anything ... you even mention it yourself "arguably, distributors might decide to compile nf_tables_netdev built-in".
On Thu, Sep 30, 2021 at 09:33:23AM +0200, Daniel Borkmann wrote: > On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote: [...] > > Why do you need you need a sysctl knob when my proposal is already > > addressing your needs? > > Well, it's not addressing anything ... you even mention it yourself "arguably, > distributors might decide to compile nf_tables_netdev built-in". I said distributors traditionally select the option that we signal to them, which is to enable this as module. We can document this in Kconfig. I think distributors should select whatever is better for their needs. Anyway, I'll tell you why module blacklisting is bad: It is a hammer, it is a band aid to a problem. Blacklisting is just making things worst because it makes some people believe that something is unfixable. Yes, it took me a while to figure out. We already entered the let's bloat the skbuff for many years already, this is stuffing one more bit into the skbuff just because maybe users might break an existing setup when they load new rules to the new netfilter egress hook. Probably the sysctl for this new egress hook is the way to go as you suggest. Thanks.
On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote: > On Thu, Sep 30, 2021 at 09:33:23AM +0200, Daniel Borkmann wrote: > > On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote: > > > Why do you need you need a sysctl knob when my proposal is already > > > addressing your needs? > > > > Well, it's not addressing anything ... you even mention it yourself "arguably, > > distributors might decide to compile nf_tables_netdev built-in". > > I said distributors traditionally select the option that we signal to > them, which is to enable this as module. We can document this in > Kconfig. I think distributors should select whatever is better for > their needs. > > Anyway, I'll tell you why module blacklisting is bad: It is a hammer, > it is a band aid to a problem. Blacklisting is just making things > worst because it makes some people believe that something is > unfixable. Yes, it took me a while to figure out. > > We already entered the let's bloat the skbuff for many years already, > this is stuffing one more bit into the skbuff just because maybe users > might break an existing setup when they load new rules to the new > netfilter egress hook. The lifetime of this information is constrained, can't it be a percpu flag, like xmit_more? > Probably the sysctl for this new egress hook is the way to go as you > suggest. Knobs is making users pay, let's do our best to avoid that.
On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote: [...] > The lifetime of this information is constrained, can't it be a percpu > flag, like xmit_more? It's just one single bit in this case after all. > > Probably the sysctl for this new egress hook is the way to go as you > > suggest. > > Knobs is making users pay, let's do our best to avoid that. Could you elaborate? Thanks.
On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote: > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > > The lifetime of this information is constrained, can't it be a percpu > > flag, like xmit_more? > > It's just one single bit in this case after all. ?? > > > Probably the sysctl for this new egress hook is the way to go as you > > > suggest. > > > > Knobs is making users pay, let's do our best to avoid that. > > Could you elaborate? My reading of Daniel's objections was that the layering is incorrect because tc is not exclusively "under" nf. That problem is not solved by adding a knob. The only thing the knob achieves is let someone deploying tc/bpf based solution protect themselves from accidental nf deployment. That's just background / level set. IDK what requires explanation in my statement itself. I thought "admin knobs are bad" is as universally agreed on as, say, "testing is good".
On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote: > > this is stuffing one more bit into the skbuff > > The lifetime of this information is constrained, can't it be a percpu > flag, like xmit_more? Hm, can't an skb be queued and processed later on a different cpu? E.g. what about fragments? That would rule out a percpu flag, leaving a flag in struct sk_buff as the only option. Thanks, Lukas
On Thu, 30 Sep 2021 19:12:53 +0200 Lukas Wunner wrote: > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote: > > > this is stuffing one more bit into the skbuff > > > > The lifetime of this information is constrained, can't it be a percpu > > flag, like xmit_more? > > Hm, can't an skb be queued and processed later on a different cpu? > E.g. what about fragments? > > That would rule out a percpu flag, leaving a flag in struct sk_buff > as the only option. What queuing do you have in mind? Qdisc is after the egress hook.
On Thu, Sep 30, 2021 at 10:19:20AM -0700, Jakub Kicinski wrote: > On Thu, 30 Sep 2021 19:12:53 +0200 Lukas Wunner wrote: > > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > > > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote: > > > > this is stuffing one more bit into the skbuff > > > > > > The lifetime of this information is constrained, can't it be a percpu > > > flag, like xmit_more? > > > > Hm, can't an skb be queued and processed later on a different cpu? > > E.g. what about fragments? > > > > That would rule out a percpu flag, leaving a flag in struct sk_buff > > as the only option. > > What queuing do you have in mind? Qdisc is after the egress hook. Ingress queueing. E.g. a packet may be redirected or mirrored by tc on ingress to another interface, resulting in a recursive call to netif_receive_skb() or dev_queue_xmit(). The packet may be bounced around an arbitrary number of times this way. Forwarding like that can happen both at the tc and the netfilter "layer". I'm concerned that a packet may be handled by different cpus along the way and queueing might be one possibility how this could happen. Thanks, Lukas
On Thu, Sep 30, 2021 at 09:06:52AM -0700, Jakub Kicinski wrote: > On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote: > > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote: > > > The lifetime of this information is constrained, can't it be a percpu > > > flag, like xmit_more? > > > > It's just one single bit in this case after all. > > ?? There are "escape" points such ifb from ingress, where the packets gets enqueued and then percpu might not help, it might be fragile to use percpu in this case. > > > > Probably the sysctl for this new egress hook is the way to go as you > > > > suggest. > > > > > > Knobs is making users pay, let's do our best to avoid that. > > > > Could you elaborate? > > My reading of Daniel's objections was that the layering is incorrect > because tc is not exclusively "under" nf. That problem is not solved > by adding a knob. The only thing the knob achieves is let someone > deploying tc/bpf based solution protect themselves from accidental > nf deployment. > > That's just background / level set. IDK what requires explanation > in my statement itself. I thought "admin knobs are bad" is as > universally agreed on as, say, "testing is good". Yes, knobs are not ideal but Daniel mentioned it as a posibility, the skbuff bit might not ideal either because it might be not easy to debug the behaviour that it turns on to the user, but it could only be set on from act_mirred, Daniel mentioned to cover only the skb->redirect case. Thanks
On Thu, 30 Sep 2021 20:00:56 +0200 Pablo Neira Ayuso wrote: > On Thu, Sep 30, 2021 at 09:06:52AM -0700, Jakub Kicinski wrote: > > On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote: > > > It's just one single bit in this case after all. > > > > ?? > > There are "escape" points such ifb from ingress, where the packets gets > enqueued and then percpu might not help, it might be fragile to use > percpu in this case. You still have to scrub the skb mark at the correct points, otherwise the ignoring egress may propagate beyond the "paired hook". I don't see much difference in fragility TBH. Speaking of ifb, doesn't it have an egress hook? And ingress on the way out? IMHO the "ignore egress" mark should not survive going thru ifb. Anyway, that's just my preference. Whatever you, Daniel and Lukas decide together in the end is fine by me.