mbox series

[nf-next,v5,0/6] Netfilter egress hook

Message ID 20210928095538.114207-1-pablo@netfilter.org (mailing list archive)
Headers show
Series Netfilter egress hook | expand

Message

Pablo Neira Ayuso Sept. 28, 2021, 9:55 a.m. UTC
Hi,

This patchset v5 that re-adds the Netfilter egress:

1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h
   from Lukas Wunner.

2) Generalize ingress hook file to accomodate egress support,
   from Lukas Wunner.

3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel
   Borkmann is requesting for a mechanism to allow to blacklist
   Netfilter, this allows users to blacklist this new module that
   includes ingress chain and the new egress chain for the netdev
   family. There is no other in-tree user of the ingress and egress
   hooks than this which might interfer with his matter.

4) Place the egress hook again before the tc egress hook as requested
   by Daniel Borkmann. Patch to add egress hook from Lukas Wunner.
   The Netfilter egress hook remains behind the static key, if unused
   performance degradation is negligible.

5) Add netfilter egress handling to af_packet.

Arguably, distributors might decide to compile nf_tables_netdev
built-in. Traditionally, distributors have compiled their kernels using
the default configuration that Netfilter Kconfig provides (ie. use
modules whenever possible). In any case, I consider that distributor
policy is out of scope in this discussion, providing a mechanism to
allow Daniel to prevent Netfilter ingress and egress chains to be loaded
should be sufficient IMHO.

Joint work with Lukas Wunner.

Please review, thanks.

Lukas Wunner (3):
  netfilter: Rename ingress hook include file
  netfilter: Generalize ingress hook include file
  netfilter: Introduce egress hook

Pablo Neira Ayuso (3):
  netfilter: nf_tables: move netdev ingress filter chain to nf_tables_netdev.c
  af_packet: Introduce egress hook
  netfilter: nf_tables: add egress support

 include/linux/netdevice.h         |   4 +
 include/linux/netfilter_ingress.h |  58 ------------
 include/linux/netfilter_netdev.h  | 112 ++++++++++++++++++++++
 include/uapi/linux/netfilter.h    |   1 +
 net/core/dev.c                    |  15 ++-
 net/netfilter/Kconfig             |  10 +-
 net/netfilter/Makefile            |   1 +
 net/netfilter/core.c              |  34 ++++++-
 net/netfilter/nf_tables_api.c     |   7 +-
 net/netfilter/nf_tables_netdev.c  | 150 ++++++++++++++++++++++++++++++
 net/netfilter/nft_chain_filter.c  | 143 ----------------------------
 net/packet/af_packet.c            |  35 +++++++
 12 files changed, 358 insertions(+), 212 deletions(-)
 delete mode 100644 include/linux/netfilter_ingress.h
 create mode 100644 include/linux/netfilter_netdev.h
 create mode 100644 net/netfilter/nf_tables_netdev.c

Comments

Daniel Borkmann Sept. 30, 2021, 6:08 a.m. UTC | #1
On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote:
> Hi,
> 
> This patchset v5 that re-adds the Netfilter egress:
> 
> 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h
>     from Lukas Wunner.
> 
> 2) Generalize ingress hook file to accomodate egress support,
>     from Lukas Wunner.
> 
> 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel
>     Borkmann is requesting for a mechanism to allow to blacklist
>     Netfilter, this allows users to blacklist this new module that
>     includes ingress chain and the new egress chain for the netdev
>     family. There is no other in-tree user of the ingress and egress
>     hooks than this which might interfer with his matter.
> 
> 4) Place the egress hook again before the tc egress hook as requested
>     by Daniel Borkmann. Patch to add egress hook from Lukas Wunner.
>     The Netfilter egress hook remains behind the static key, if unused
>     performance degradation is negligible.
> 
> 5) Add netfilter egress handling to af_packet.
> 
> Arguably, distributors might decide to compile nf_tables_netdev
> built-in. Traditionally, distributors have compiled their kernels using
> the default configuration that Netfilter Kconfig provides (ie. use
> modules whenever possible). In any case, I consider that distributor
> policy is out of scope in this discussion, providing a mechanism to
> allow Daniel to prevent Netfilter ingress and egress chains to be loaded
> should be sufficient IMHO.

Hm, so in the case of SRv6 users were running into a similar issue and commit
7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added
a new hook along with a sysctl which defaults the new hook to off.

The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel
sysctl to make sure existing netfilter rulesets do not break." [0,1]

If the suggestion to flag the skb [2] one way or another from the tc forwarding
path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not
do a sysctl toggle like in the SRv6 case?

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7a3f5b0de3647c854e34269c3332d7a1e902901a
   [1] https://lore.kernel.org/netdev/20210830093852.21654-1-pablo@netfilter.org/
   [2] https://lore.kernel.org/netdev/11584665-e3b5-3afe-d72c-ef92ad0b9313@iogearbox.net/
Lukas Wunner Sept. 30, 2021, 6:52 a.m. UTC | #2
On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote:
> Hm, so in the case of SRv6 users were running into a similar issue
> and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6
> data plane") [0] added a new hook along with a sysctl which defaults
> the new hook to off.
> 
> The rationale for it was given as "the hooks are enabled via
> nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets
>  do not break." [0,1]
> 
> If the suggestion to flag the skb [2] one way or another from the
> tc forwarding path (e.g. skb bit or per-cpu marker) is not
> technically feasible, then why not do a sysctl toggle like in the
> SRv6 case?

The skb flag *is* technically feasible.  I amended the patches with
the flag and was going to post them this week, but Pablo beat me to
the punch and posted his alternative version, which lacks the flag
but modularizes netfilter ingress/egress processing instead.

Honestly I think a hodge-podge of config options and sysctl toggles
is awful and I would prefer the skb flag you suggested.  I kind of
like your idea of considering tc and netfilter as layers.

FWIW the finished patches *with* the flag are on this branch:
https://github.com/l1k/linux/commits/nft_egress_v5

Below is the "git range-diff" between Pablo's patches and mine
(just the hunks which pertain to the skb flag, plus excerpts
from the commit message).

Would you find the patch set acceptable with this skb flag?

-- >8 --

    +    Alternatively or in addition to netfilter, packets can be classified
    +    with traffic control (tc).  On ingress, packets are classified first by
    +    tc, then by netfilter.  On egress, the order is reversed for symmetry.
    +    Conceptually, tc and netfilter can be thought of as layers, with
    +    netfilter layered above tc.
     
    +    Traffic control is capable of redirecting packets to another interface
    +    (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
    +    host namespace to a container via a veth connection:
    +    tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
     
    +    In this case, netfilter egress classifying is not performed when leaving
    +    the host namespace!  That's because the packet is still on the tc layer.
    +    If tc redirects the packet to a physical interface in the host namespace
    +    such that it leaves the system, the packet is never subjected to
    +    netfilter egress classifying.  That is only logical since it hasn't
    +    passed through netfilter ingress classifying either.
     
    +    Packets can alternatively be redirected at the netfilter layer using
    +    nft fwd.  Such a packet *is* subjected to netfilter egress classifying.
    +    Internally, the skb->nf_skip_egress flag controls whether netfilter is
    +    invoked on egress by __dev_queue_xmit().
    +
    +    Interaction between tc and netfilter is possible by setting and querying
    +    skb->mark.
     
    @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu
     +static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
     +					     struct net_device *dev)
     +{
    -+	struct nf_hook_entries *e = rcu_dereference(dev->nf_hooks_egress);
    ++	struct nf_hook_entries *e;
     +	struct nf_hook_state state;
     +	int ret;
     +
    ++	if (skb->nf_skip_egress)
    ++		return skb;
    ++
    ++	e = rcu_dereference(dev->nf_hooks_egress);
     +	if (!e)
     +		return skb;
     +
    @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu
     +		return NULL;
     +	}
     +}
    ++
    ++static inline void nf_skip_egress(struct sk_buff *skb, bool skip)
    ++{
    ++	skb->nf_skip_egress = skip;
    ++}
     +#else /* CONFIG_NETFILTER_EGRESS */
     +static inline bool nf_hook_egress_active(void)
     +{
    @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu
     +{
     +	return skb;
     +}
    ++
    ++static inline void nf_skip_egress(struct sk_buff *skb, bool skip)
    ++{
    ++}
     +#endif /* CONFIG_NETFILTER_EGRESS */
     +
      static inline void nf_hook_netdev_init(struct net_device *dev)
    @@ include/linux/netfilter_netdev.h: static inline int nf_hook_ingress(struct sk_bu
      
      #endif /* _NETFILTER_NETDEV_H_ */
     
    + ## include/linux/skbuff.h ##
    +@@ include/linux/skbuff.h: typedef unsigned char *sk_buff_data_t;
    +  *	@tc_at_ingress: used within tc_classify to distinguish in/egress
    +  *	@redirected: packet was redirected by packet classifier
    +  *	@from_ingress: packet was redirected from the ingress path
    ++ *	@nf_skip_egress: packet shall skip netfilter egress processing
    +  *	@peeked: this packet has been seen already, so stats have been
    +  *		done for it, don't do them again
    +  *	@nf_trace: netfilter packet trace flag
    +@@ include/linux/skbuff.h: struct sk_buff {
    + #ifdef CONFIG_NET_REDIRECT
    + 	__u8			from_ingress:1;
    + #endif
    ++#ifdef CONFIG_NETFILTER_EGRESS
    ++	__u8			nf_skip_egress:1;
    ++#endif
    + #ifdef CONFIG_TLS_DEVICE
    + 	__u8			decrypted:1;
    + #endif
    +
      ## include/uapi/linux/netfilter.h ##
     @@ include/uapi/linux/netfilter.h: enum nf_inet_hooks {
      
    @@ net/core/dev.c: static int __dev_queue_xmit(struct sk_buff *skb, struct net_devi
     +			if (!skb)
     +				goto out;
     +		}
    ++		nf_skip_egress(skb, true);
      		skb = sch_handle_egress(skb, &rc, dev);
      		if (!skb)
      			goto out;
    @@ net/core/dev.c: static int __dev_queue_xmit(struct sk_buff *skb, struct net_devi
      #endif
      	/* If device/qdisc don't need skb->dst, release it right now while
      	 * its hot in this cpu cache.
    +@@ net/core/dev.c: static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
    + 	if (static_branch_unlikely(&ingress_needed_key)) {
    + 		bool another = false;
    + 
    ++		nf_skip_egress(skb, true);
    + 		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
    + 					 &another);
    + 		if (another)
    +@@ net/core/dev.c: static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
    + 		if (!skb)
    + 			goto out;
    + 
    ++		nf_skip_egress(skb, false);
    + 		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
    + 			goto out;
    + 	}
Daniel Borkmann Sept. 30, 2021, 7:10 a.m. UTC | #3
On 9/30/21 8:52 AM, Lukas Wunner wrote:
> On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote:
>> Hm, so in the case of SRv6 users were running into a similar issue
>> and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6
>> data plane") [0] added a new hook along with a sysctl which defaults
>> the new hook to off.
>>
>> The rationale for it was given as "the hooks are enabled via
>> nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets
>>   do not break." [0,1]
>>
>> If the suggestion to flag the skb [2] one way or another from the
>> tc forwarding path (e.g. skb bit or per-cpu marker) is not
>> technically feasible, then why not do a sysctl toggle like in the
>> SRv6 case?
> 
> The skb flag *is* technically feasible.  I amended the patches with
> the flag and was going to post them this week, but Pablo beat me to
> the punch and posted his alternative version, which lacks the flag
> but modularizes netfilter ingress/egress processing instead.
> 
> Honestly I think a hodge-podge of config options and sysctl toggles
> is awful and I would prefer the skb flag you suggested.  I kind of
> like your idea of considering tc and netfilter as layers.
> 
> FWIW the finished patches *with* the flag are on this branch:
> https://github.com/l1k/linux/commits/nft_egress_v5
> 
> Below is the "git range-diff" between Pablo's patches and mine
> (just the hunks which pertain to the skb flag, plus excerpts
> from the commit message).
> 
> Would you find the patch set acceptable with this skb flag?

Yes, that looks good to me, thanks Lukas!
Pablo Neira Ayuso Sept. 30, 2021, 7:19 a.m. UTC | #4
On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote:
> On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote:
> > Hi,
> > 
> > This patchset v5 that re-adds the Netfilter egress:
> > 
> > 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h
> >     from Lukas Wunner.
> > 
> > 2) Generalize ingress hook file to accomodate egress support,
> >     from Lukas Wunner.
> > 
> > 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel
> >     Borkmann is requesting for a mechanism to allow to blacklist
> >     Netfilter, this allows users to blacklist this new module that
> >     includes ingress chain and the new egress chain for the netdev
> >     family. There is no other in-tree user of the ingress and egress
> >     hooks than this which might interfer with his matter.
> > 
> > 4) Place the egress hook again before the tc egress hook as requested
> >     by Daniel Borkmann. Patch to add egress hook from Lukas Wunner.
> >     The Netfilter egress hook remains behind the static key, if unused
> >     performance degradation is negligible.
> > 
> > 5) Add netfilter egress handling to af_packet.
> > 
> > Arguably, distributors might decide to compile nf_tables_netdev
> > built-in. Traditionally, distributors have compiled their kernels using
> > the default configuration that Netfilter Kconfig provides (ie. use
> > modules whenever possible). In any case, I consider that distributor
> > policy is out of scope in this discussion, providing a mechanism to
> > allow Daniel to prevent Netfilter ingress and egress chains to be loaded
> > should be sufficient IMHO.
> 
> Hm, so in the case of SRv6 users were running into a similar issue and commit
> 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added
> a new hook along with a sysctl which defaults the new hook to off.
> 
> The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel
> sysctl to make sure existing netfilter rulesets do not break." [0,1]
> 
> If the suggestion to flag the skb [2] one way or another from the tc forwarding
> path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not
> do a sysctl toggle like in the SRv6 case?

I am already providing a global toggle to disable netdev
ingress/egress hooks?

In the SRv6 case that is not possible.

Why do you need you need a sysctl knob when my proposal is already
addressing your needs?
Pablo Neira Ayuso Sept. 30, 2021, 7:21 a.m. UTC | #5
On Thu, Sep 30, 2021 at 08:52:38AM +0200, Lukas Wunner wrote:
> On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote:
> > Hm, so in the case of SRv6 users were running into a similar issue
> > and commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6
> > data plane") [0] added a new hook along with a sysctl which defaults
> > the new hook to off.
> > 
> > The rationale for it was given as "the hooks are enabled via
> > nf_hooks_lwtunnel sysctl to make sure existing netfilter rulesets
> >  do not break." [0,1]
> > 
> > If the suggestion to flag the skb [2] one way or another from the
> > tc forwarding path (e.g. skb bit or per-cpu marker) is not
> > technically feasible, then why not do a sysctl toggle like in the
> > SRv6 case?
> 
> The skb flag *is* technically feasible.  I amended the patches with
> the flag and was going to post them this week, but Pablo beat me to
> the punch and posted his alternative version, which lacks the flag
> but modularizes netfilter ingress/egress processing instead.
> 
> Honestly I think a hodge-podge of config options and sysctl toggles
> is awful and I would prefer the skb flag you suggested.  I kind of
> like your idea of considering tc and netfilter as layers.
> 
> FWIW the finished patches *with* the flag are on this branch:
> https://github.com/l1k/linux/commits/nft_egress_v5
> 
> Below is the "git range-diff" between Pablo's patches and mine
> (just the hunks which pertain to the skb flag, plus excerpts
> from the commit message).
> 
> Would you find the patch set acceptable with this skb flag?

Why do you need a programmatic skb flag?

Are you planning to do:

        skb->skip_nf_egress = random();

from the packet path?

Seriously, Daniel is asking for a global toggle to disable Netfilter.

What is wrong with the Netfilter blacklisting approach?
Daniel Borkmann Sept. 30, 2021, 7:33 a.m. UTC | #6
On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote:
> On Thu, Sep 30, 2021 at 08:08:53AM +0200, Daniel Borkmann wrote:
>> On 9/28/21 11:55 AM, Pablo Neira Ayuso wrote:
>>> Hi,
>>>
>>> This patchset v5 that re-adds the Netfilter egress:
>>>
>>> 1) Rename linux/netfilter_ingress.h to linux/netfilter_netdev.h
>>>      from Lukas Wunner.
>>>
>>> 2) Generalize ingress hook file to accomodate egress support,
>>>      from Lukas Wunner.
>>>
>>> 3) Modularize Netfilter ingress hook into nf_tables_netdev: Daniel
>>>      Borkmann is requesting for a mechanism to allow to blacklist
>>>      Netfilter, this allows users to blacklist this new module that
>>>      includes ingress chain and the new egress chain for the netdev
>>>      family. There is no other in-tree user of the ingress and egress
>>>      hooks than this which might interfer with his matter.
>>>
>>> 4) Place the egress hook again before the tc egress hook as requested
>>>      by Daniel Borkmann. Patch to add egress hook from Lukas Wunner.
>>>      The Netfilter egress hook remains behind the static key, if unused
>>>      performance degradation is negligible.
>>>
>>> 5) Add netfilter egress handling to af_packet.
>>>
>>> Arguably, distributors might decide to compile nf_tables_netdev
>>> built-in. Traditionally, distributors have compiled their kernels using
>>> the default configuration that Netfilter Kconfig provides (ie. use
>>> modules whenever possible). In any case, I consider that distributor
>>> policy is out of scope in this discussion, providing a mechanism to
>>> allow Daniel to prevent Netfilter ingress and egress chains to be loaded
>>> should be sufficient IMHO.
>>
>> Hm, so in the case of SRv6 users were running into a similar issue and commit
>> 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane") [0] added
>> a new hook along with a sysctl which defaults the new hook to off.
>>
>> The rationale for it was given as "the hooks are enabled via nf_hooks_lwtunnel
>> sysctl to make sure existing netfilter rulesets do not break." [0,1]
>>
>> If the suggestion to flag the skb [2] one way or another from the tc forwarding
>> path (e.g. skb bit or per-cpu marker) is not technically feasible, then why not
>> do a sysctl toggle like in the SRv6 case?
> 
> I am already providing a global toggle to disable netdev
> ingress/egress hooks?
> 
> In the SRv6 case that is not possible.
> 
> Why do you need you need a sysctl knob when my proposal is already
> addressing your needs?

Well, it's not addressing anything ... you even mention it yourself "arguably,
distributors might decide to compile nf_tables_netdev built-in".
Pablo Neira Ayuso Sept. 30, 2021, 9:21 a.m. UTC | #7
On Thu, Sep 30, 2021 at 09:33:23AM +0200, Daniel Borkmann wrote:
> On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote:
[...]
> > Why do you need you need a sysctl knob when my proposal is already
> > addressing your needs?
> 
> Well, it's not addressing anything ... you even mention it yourself "arguably,
> distributors might decide to compile nf_tables_netdev built-in".

I said distributors traditionally select the option that we signal to
them, which is to enable this as module. We can document this in
Kconfig. I think distributors should select whatever is better for
their needs.

Anyway, I'll tell you why module blacklisting is bad: It is a hammer,
it is a band aid to a problem. Blacklisting is just making things
worst because it makes some people believe that something is
unfixable. Yes, it took me a while to figure out.

We already entered the let's bloat the skbuff for many years already,
this is stuffing one more bit into the skbuff just because maybe users
might break an existing setup when they load new rules to the new
netfilter egress hook.

Probably the sysctl for this new egress hook is the way to go as you
suggest.

Thanks.
Jakub Kicinski Sept. 30, 2021, 2:28 p.m. UTC | #8
On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote:
> On Thu, Sep 30, 2021 at 09:33:23AM +0200, Daniel Borkmann wrote:
> > On 9/30/21 9:19 AM, Pablo Neira Ayuso wrote:  
> > > Why do you need you need a sysctl knob when my proposal is already
> > > addressing your needs?  
> > 
> > Well, it's not addressing anything ... you even mention it yourself "arguably,
> > distributors might decide to compile nf_tables_netdev built-in".  
> 
> I said distributors traditionally select the option that we signal to
> them, which is to enable this as module. We can document this in
> Kconfig. I think distributors should select whatever is better for
> their needs.
> 
> Anyway, I'll tell you why module blacklisting is bad: It is a hammer,
> it is a band aid to a problem. Blacklisting is just making things
> worst because it makes some people believe that something is
> unfixable. Yes, it took me a while to figure out.
> 
> We already entered the let's bloat the skbuff for many years already,
> this is stuffing one more bit into the skbuff just because maybe users
> might break an existing setup when they load new rules to the new
> netfilter egress hook.

The lifetime of this information is constrained, can't it be a percpu
flag, like xmit_more?

> Probably the sysctl for this new egress hook is the way to go as you
> suggest.

Knobs is making users pay, let's do our best to avoid that.
Pablo Neira Ayuso Sept. 30, 2021, 3:13 p.m. UTC | #9
On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote:
[...]
> The lifetime of this information is constrained, can't it be a percpu
> flag, like xmit_more?

It's just one single bit in this case after all.

> > Probably the sysctl for this new egress hook is the way to go as you
> > suggest.
> 
> Knobs is making users pay, let's do our best to avoid that.

Could you elaborate?

Thanks.
Jakub Kicinski Sept. 30, 2021, 4:06 p.m. UTC | #10
On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote:
> On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> > The lifetime of this information is constrained, can't it be a percpu
> > flag, like xmit_more?  
> 
> It's just one single bit in this case after all.

??

> > > Probably the sysctl for this new egress hook is the way to go as you
> > > suggest.  
> > 
> > Knobs is making users pay, let's do our best to avoid that.  
> 
> Could you elaborate?

My reading of Daniel's objections was that the layering is incorrect
because tc is not exclusively "under" nf. That problem is not solved 
by adding a knob. The only thing the knob achieves is let someone
deploying tc/bpf based solution protect themselves from accidental
nf deployment.

That's just background / level set. IDK what requires explanation 
in my statement itself. I thought "admin knobs are bad" is as
universally agreed on as, say, "testing is good".
Lukas Wunner Sept. 30, 2021, 5:12 p.m. UTC | #11
On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote:
> > this is stuffing one more bit into the skbuff
> 
> The lifetime of this information is constrained, can't it be a percpu
> flag, like xmit_more?

Hm, can't an skb be queued and processed later on a different cpu?
E.g. what about fragments?

That would rule out a percpu flag, leaving a flag in struct sk_buff
as the only option.

Thanks,

Lukas
Jakub Kicinski Sept. 30, 2021, 5:19 p.m. UTC | #12
On Thu, 30 Sep 2021 19:12:53 +0200 Lukas Wunner wrote:
> On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote:  
> > > this is stuffing one more bit into the skbuff  
> > 
> > The lifetime of this information is constrained, can't it be a percpu
> > flag, like xmit_more?  
> 
> Hm, can't an skb be queued and processed later on a different cpu?
> E.g. what about fragments?
> 
> That would rule out a percpu flag, leaving a flag in struct sk_buff
> as the only option.

What queuing do you have in mind? Qdisc is after the egress hook.
Lukas Wunner Sept. 30, 2021, 5:36 p.m. UTC | #13
On Thu, Sep 30, 2021 at 10:19:20AM -0700, Jakub Kicinski wrote:
> On Thu, 30 Sep 2021 19:12:53 +0200 Lukas Wunner wrote:
> > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> > > On Thu, 30 Sep 2021 11:21:42 +0200 Pablo Neira Ayuso wrote:  
> > > > this is stuffing one more bit into the skbuff  
> > > 
> > > The lifetime of this information is constrained, can't it be a percpu
> > > flag, like xmit_more?  
> > 
> > Hm, can't an skb be queued and processed later on a different cpu?
> > E.g. what about fragments?
> > 
> > That would rule out a percpu flag, leaving a flag in struct sk_buff
> > as the only option.
> 
> What queuing do you have in mind? Qdisc is after the egress hook.

Ingress queueing.  E.g. a packet may be redirected or mirrored by tc
on ingress to another interface, resulting in a recursive call to
netif_receive_skb() or dev_queue_xmit().  The packet may be bounced
around an arbitrary number of times this way.  Forwarding like that
can happen both at the tc and the netfilter "layer".

I'm concerned that a packet may be handled by different cpus along the way
and queueing might be one possibility how this could happen.

Thanks,

Lukas
Pablo Neira Ayuso Sept. 30, 2021, 6 p.m. UTC | #14
On Thu, Sep 30, 2021 at 09:06:52AM -0700, Jakub Kicinski wrote:
> On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote:
> > On Thu, Sep 30, 2021 at 07:28:35AM -0700, Jakub Kicinski wrote:
> > > The lifetime of this information is constrained, can't it be a percpu
> > > flag, like xmit_more?  
> > 
> > It's just one single bit in this case after all.
> 
> ??

There are "escape" points such ifb from ingress, where the packets gets
enqueued and then percpu might not help, it might be fragile to use
percpu in this case.

> > > > Probably the sysctl for this new egress hook is the way to go as you
> > > > suggest.  
> > > 
> > > Knobs is making users pay, let's do our best to avoid that.  
> > 
> > Could you elaborate?
> 
> My reading of Daniel's objections was that the layering is incorrect
> because tc is not exclusively "under" nf. That problem is not solved 
> by adding a knob. The only thing the knob achieves is let someone
> deploying tc/bpf based solution protect themselves from accidental
> nf deployment.
> 
> That's just background / level set. IDK what requires explanation 
> in my statement itself. I thought "admin knobs are bad" is as
> universally agreed on as, say, "testing is good".

Yes, knobs are not ideal but Daniel mentioned it as a posibility, the
skbuff bit might not ideal either because it might be not easy to
debug the behaviour that it turns on to the user, but it could only be
set on from act_mirred, Daniel mentioned to cover only the
skb->redirect case.

Thanks
Jakub Kicinski Sept. 30, 2021, 7:17 p.m. UTC | #15
On Thu, 30 Sep 2021 20:00:56 +0200 Pablo Neira Ayuso wrote:
> On Thu, Sep 30, 2021 at 09:06:52AM -0700, Jakub Kicinski wrote:
> > On Thu, 30 Sep 2021 17:13:37 +0200 Pablo Neira Ayuso wrote:  
> > > It's just one single bit in this case after all.  
> > 
> > ??  
> 
> There are "escape" points such ifb from ingress, where the packets gets
> enqueued and then percpu might not help, it might be fragile to use
> percpu in this case.

You still have to scrub the skb mark at the correct points, otherwise
the ignoring egress may propagate beyond the "paired hook". I don't see
much difference in fragility TBH.

Speaking of ifb, doesn't it have an egress hook? And ingress on the way
out? IMHO the "ignore egress" mark should not survive going thru ifb.

Anyway, that's just my preference. Whatever you, Daniel and Lukas
decide together in the end is fine by me.