diff mbox series

[net-next] netfilter: conntrack: avoid sending RST to reply out-of-window skb

Message ID 20240307090732.56708-1-kerneljasonxing@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] netfilter: conntrack: avoid sending RST to reply out-of-window skb | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 940 this patch: 941
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang fail Errors and warnings before: 956 this patch: 959
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 956 this patch: 957
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jason Xing March 7, 2024, 9:07 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

Supposing we set DNAT policy converting a_port to b_port on the
server at the beginning, the socket is set up by using 4-turple:

client_ip:client_port <--> server_ip:b_port

Then, some strange skbs from client or gateway, say, out-of-window
skbs are sent to the server_ip:a_port (not b_port) due to DNAT
clearing skb->_nfct value in nf_conntrack_in() function. Why?
Because the tcp_in_window() considers the incoming skb as an
invalid skb by returning NFCT_TCP_INVALID.

At last, the TCP layer process the out-of-window
skb (client_ip,client_port,server_ip,a_port) and try to look up
such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
and send back an RST to the client.

The detailed call graphs go like this:
1)
nf_conntrack_in()
  -> nf_conntrack_handle_packet()
    -> nf_conntrack_tcp_packet()
      -> tcp_in_window() // tests if the skb is out-of-window
      -> return -NF_ACCEPT;
  -> skb->_nfct = 0; // if the above line returns a negative value
2)
tcp_v4_rcv()
  -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
  -> tcp_v4_send_reset()

The moment the client receives the RST, it will drop. So the RST
skb doesn't hurt the client (maybe hurt some gateway which cancels
the session when filtering the RST without validating
the sequence because of performance reason). Well, it doesn't
matter. However, we can see many RST in flight.

The key reason why I wrote this patch is that I don't think
the behaviour is expected because the RFC 793 defines this
case:

"If the connection is in a synchronized state (ESTABLISHED,
 FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
 any unacceptable segment (out of window sequence number or
 unacceptible acknowledgment number) must elicit only an empty
 acknowledgment segment containing the current send-sequence number
 and an acknowledgment..."

I think, even we have set DNAT policy, it would be better if the
whole process/behaviour adheres to the original TCP behaviour.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/netfilter/nf_conntrack_proto_tcp.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Florian Westphal March 7, 2024, 9:33 a.m. UTC | #1
Jason Xing <kerneljasonxing@gmail.com> wrote:
> client_ip:client_port <--> server_ip:b_port
> 
> Then, some strange skbs from client or gateway, say, out-of-window
> skbs are sent to the server_ip:a_port (not b_port) due to DNAT
> clearing skb->_nfct value in nf_conntrack_in() function. Why?
> Because the tcp_in_window() considers the incoming skb as an
> invalid skb by returning NFCT_TCP_INVALID.

So far everything is as intended.

> I think, even we have set DNAT policy, it would be better if the
> whole process/behaviour adheres to the original TCP behaviour.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/netfilter/nf_conntrack_proto_tcp.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> index ae493599a3ef..3f3e620f3969 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -1253,13 +1253,11 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
>  	res = tcp_in_window(ct, dir, index,
>  			    skb, dataoff, th, state);
>  	switch (res) {
> -	case NFCT_TCP_IGNORE:
> -		spin_unlock_bh(&ct->lock);
> -		return NF_ACCEPT;
>  	case NFCT_TCP_INVALID:
>  		nf_tcp_handle_invalid(ct, dir, index, skb, state);
> +	case NFCT_TCP_IGNORE:
>  		spin_unlock_bh(&ct->lock);
> -		return -NF_ACCEPT;
> +		return NF_ACCEPT;

This looks wrong.  -NF_ACCEPT means 'pass packet, but its not part
of the connection' (packet will match --ctstate INVALID check).

This change disables most of the tcp_in_window() test, this will
pretend everything is fine even though tcp_in_window says otherwise.

You could:
 - drop invalid tcp packets in input hook
 - set nf_conntrack_tcp_be_liberal=1

both will avoid this 'rst' issue.
Jason Xing March 7, 2024, 11:02 a.m. UTC | #2
Hello Florian,

On Thu, Mar 7, 2024 at 5:33 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > client_ip:client_port <--> server_ip:b_port
> >
> > Then, some strange skbs from client or gateway, say, out-of-window
> > skbs are sent to the server_ip:a_port (not b_port) due to DNAT
> > clearing skb->_nfct value in nf_conntrack_in() function. Why?
> > Because the tcp_in_window() considers the incoming skb as an
> > invalid skb by returning NFCT_TCP_INVALID.
>
> So far everything is as intended.
>
> > I think, even we have set DNAT policy, it would be better if the
> > whole process/behaviour adheres to the original TCP behaviour.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/netfilter/nf_conntrack_proto_tcp.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > index ae493599a3ef..3f3e620f3969 100644
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1253,13 +1253,11 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >       res = tcp_in_window(ct, dir, index,
> >                           skb, dataoff, th, state);
> >       switch (res) {
> > -     case NFCT_TCP_IGNORE:
> > -             spin_unlock_bh(&ct->lock);
> > -             return NF_ACCEPT;
> >       case NFCT_TCP_INVALID:
> >               nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > +     case NFCT_TCP_IGNORE:
> >               spin_unlock_bh(&ct->lock);
> > -             return -NF_ACCEPT;
> > +             return NF_ACCEPT;
>
> This looks wrong.  -NF_ACCEPT means 'pass packet, but its not part
> of the connection' (packet will match --ctstate INVALID check).
>
> This change disables most of the tcp_in_window() test, this will
> pretend everything is fine even though tcp_in_window says otherwise.

Thanks for the information. It does make sense.

What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
knob which you also pointed out. It also pretends to ignore those
out-of-window skbs.

>
> You could:
>  - drop invalid tcp packets in input hook

How about changing the return value only as below? Only two cases will
be handled:

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
b/net/netfilter/nf_conntrack_proto_tcp.c
index ae493599a3ef..c88ce4cd041e 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
        case NFCT_TCP_INVALID:
                nf_tcp_handle_invalid(ct, dir, index, skb, state);
                spin_unlock_bh(&ct->lock);
-               return -NF_ACCEPT;
+               return -NF_DROP;
        case NFCT_TCP_ACCEPT:
                break;
        }

Then it will be dropped naturally in nf_hook_slow().

>  - set nf_conntrack_tcp_be_liberal=1

Sure, it can workaround this case, but I would like to refuse the
out-of-window in netfilter or TCP layer as default instead of turning
on this sysctl knob. If I understand wrong, please correct me.

Thanks,
Jason

>
> both will avoid this 'rst' issue.
Florian Westphal March 7, 2024, noon UTC | #3
Jason Xing <kerneljasonxing@gmail.com> wrote:
> > This change disables most of the tcp_in_window() test, this will
> > pretend everything is fine even though tcp_in_window says otherwise.
> 
> Thanks for the information. It does make sense.
> 
> What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> knob which you also pointed out. It also pretends to ignore those
> out-of-window skbs.
> 
> >
> > You could:
> >  - drop invalid tcp packets in input hook
> 
> How about changing the return value only as below? Only two cases will
> be handled:
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> b/net/netfilter/nf_conntrack_proto_tcp.c
> index ae493599a3ef..c88ce4cd041e 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
>         case NFCT_TCP_INVALID:
>                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
>                 spin_unlock_bh(&ct->lock);
> -               return -NF_ACCEPT;
> +               return -NF_DROP;

Lets not do this.  conntrack should never drop packets and defer to ruleset
whereever possible.

> >  - set nf_conntrack_tcp_be_liberal=1
> 
> Sure, it can workaround this case, but I would like to refuse the
> out-of-window in netfilter or TCP layer as default instead of turning
> on this sysctl knob. If I understand wrong, please correct me.

Thats contradictory, you make a patch to always accept, then another
patch to always drop such packets?

You can get the drop behaviour via '-m conntrack --ctstate DROP' in
prerouting or inut hooks.

You can get the 'accept + do nat processing' via
nf_conntrack_tcp_be_liberal=1.
Jason Xing March 7, 2024, 1:33 p.m. UTC | #4
On Thu, Mar 7, 2024 at 8:00 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > This change disables most of the tcp_in_window() test, this will
> > > pretend everything is fine even though tcp_in_window says otherwise.
> >
> > Thanks for the information. It does make sense.
> >
> > What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> > knob which you also pointed out. It also pretends to ignore those
> > out-of-window skbs.
> >
> > >
> > > You could:
> > >  - drop invalid tcp packets in input hook
> >
> > How about changing the return value only as below? Only two cases will
> > be handled:
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > b/net/netfilter/nf_conntrack_proto_tcp.c
> > index ae493599a3ef..c88ce4cd041e 100644
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >         case NFCT_TCP_INVALID:
> >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> >                 spin_unlock_bh(&ct->lock);
> > -               return -NF_ACCEPT;
> > +               return -NF_DROP;
>
> Lets not do this.  conntrack should never drop packets and defer to ruleset
> whereever possible.

Hmm, sorry, it is against my understanding.

If we cannot return -NF_DROP, why have we already added some 'return
NF_DROP' in the nf_conntrack_handle_packet() function? And why does
this test statement exist?

nf_conntrack_in()
  -> nf_conntrack_handle_packet()
  -> if (ret <= 0) {
         if (ret == -NF_DROP) NF_CT_STAT_INC_ATOMIC(state->net, drop);

>
> > >  - set nf_conntrack_tcp_be_liberal=1
> >
> > Sure, it can workaround this case, but I would like to refuse the
> > out-of-window in netfilter or TCP layer as default instead of turning
> > on this sysctl knob. If I understand wrong, please correct me.
>
> Thats contradictory, you make a patch to always accept, then another
> patch to always drop such packets?

My only purpose is not to let the TCP layer sending strange RST to the
right flow.

Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
knob seems odd to me though it can workaround :S

I would like to prevent sending such an RST as default behaviour.

I wonder why we have to send RST at last due to out-of-window skbs. It
should not happen, right? As I said before, It can be set as default
without relying on some sysctl knob.

Forgive my superficial knowledge :(

>
> You can get the drop behaviour via '-m conntrack --ctstate DROP' in
> prerouting or inut hooks.
>
> You can get the 'accept + do nat processing' via
> nf_conntrack_tcp_be_liberal=1.

Sure. Just turning on the sysctl knob can be helpful because I've
tested it in production. After all, it roughly returns NFCT_TCP_ACCEPT
in nf_tcp_log_invalid() without considering those various
out-of-window cases.

Thanks,
Jason
Florian Westphal March 7, 2024, 2:10 p.m. UTC | #5
Jason Xing <kerneljasonxing@gmail.com> wrote:
> On Thu, Mar 7, 2024 at 8:00 PM Florian Westphal <fw@strlen.de> wrote:
> >
> > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > This change disables most of the tcp_in_window() test, this will
> > > > pretend everything is fine even though tcp_in_window says otherwise.
> > >
> > > Thanks for the information. It does make sense.
> > >
> > > What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> > > knob which you also pointed out. It also pretends to ignore those
> > > out-of-window skbs.
> > >
> > > >
> > > > You could:
> > > >  - drop invalid tcp packets in input hook
> > >
> > > How about changing the return value only as below? Only two cases will
> > > be handled:
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > > b/net/netfilter/nf_conntrack_proto_tcp.c
> > > index ae493599a3ef..c88ce4cd041e 100644
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > >         case NFCT_TCP_INVALID:
> > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > >                 spin_unlock_bh(&ct->lock);
> > > -               return -NF_ACCEPT;
> > > +               return -NF_DROP;
> >
> > Lets not do this.  conntrack should never drop packets and defer to ruleset
> > whereever possible.
> 
> Hmm, sorry, it is against my understanding.
> 
> If we cannot return -NF_DROP, why have we already added some 'return
> NF_DROP' in the nf_conntrack_handle_packet() function? And why does
> this test statement exist?

Sure we can drop.  But we should only do it if there is no better
alternative.

> nf_conntrack_in()
>   -> nf_conntrack_handle_packet()
>   -> if (ret <= 0) {
>          if (ret == -NF_DROP) NF_CT_STAT_INC_ATOMIC(state->net, drop);

AFAICS this only happens when we receive syn for an existing conntrack
that is being removed already so we'd expect next syn to create a new
connection.  Feel free to send patches that replace drop with -accept
where possible/where it makes sense, but I don't think the
TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.

> My only purpose is not to let the TCP layer sending strange RST to the
> right flow.

AFAIU tcp layer is correct, no?  Out of the blue packet to some listener
socket?

> Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> knob seems odd to me though it can workaround :S

I don't see a better alternative, other than -p tcp -m conntrack
--ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
such packets.

> I would like to prevent sending such an RST as default behaviour.

I don't see a way to make this work out of the box, without possible
unwanted side effects.

MAYBE we could drop IFF we check that the conntrack entry candidate
that fails sequence validation has NAT translation applied to it, and
thus the '-NF_ACCEPT' packet won't be translated.

Not even compile tested:

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
        case NFCT_TCP_IGNORE:
                spin_unlock_bh(&ct->lock);
                return NF_ACCEPT;
-       case NFCT_TCP_INVALID:
+       case NFCT_TCP_INVALID: {
+               verdict = -NF_ACCEPT;
+               if (ct->status & IPS_NAT_MASK)
+                       res = NF_DROP; /* skb would miss nat transformation */
                nf_tcp_handle_invalid(ct, dir, index, skb, state);
                spin_unlock_bh(&ct->lock);
-               return -NF_ACCEPT;
+               return verdict;
+       }
        case NFCT_TCP_ACCEPT:
                break;
        }

But I don't really see the advantage compared to doing drop decision in
iptables/nftables ruleset.

I also have a hunch that someone will eventually complain about this
change in behavior.
Jason Xing March 7, 2024, 3:11 p.m. UTC | #6
On Thu, Mar 7, 2024 at 10:10 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > On Thu, Mar 7, 2024 at 8:00 PM Florian Westphal <fw@strlen.de> wrote:
> > >
> > > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > This change disables most of the tcp_in_window() test, this will
> > > > > pretend everything is fine even though tcp_in_window says otherwise.
> > > >
> > > > Thanks for the information. It does make sense.
> > > >
> > > > What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> > > > knob which you also pointed out. It also pretends to ignore those
> > > > out-of-window skbs.
> > > >
> > > > >
> > > > > You could:
> > > > >  - drop invalid tcp packets in input hook
> > > >
> > > > How about changing the return value only as below? Only two cases will
> > > > be handled:
> > > >
> > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > index ae493599a3ef..c88ce4cd041e 100644
> > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > >         case NFCT_TCP_INVALID:
> > > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > >                 spin_unlock_bh(&ct->lock);
> > > > -               return -NF_ACCEPT;
> > > > +               return -NF_DROP;
> > >
> > > Lets not do this.  conntrack should never drop packets and defer to ruleset
> > > whereever possible.
> >
> > Hmm, sorry, it is against my understanding.
> >
> > If we cannot return -NF_DROP, why have we already added some 'return
> > NF_DROP' in the nf_conntrack_handle_packet() function? And why does
> > this test statement exist?
>
> Sure we can drop.  But we should only do it if there is no better
> alternative.
>
> > nf_conntrack_in()
> >   -> nf_conntrack_handle_packet()
> >   -> if (ret <= 0) {
> >          if (ret == -NF_DROP) NF_CT_STAT_INC_ATOMIC(state->net, drop);
>
> AFAICS this only happens when we receive syn for an existing conntrack
> that is being removed already so we'd expect next syn to create a new

Sorry, I've double-checked this part and found out there is no chance
to return '-NF_DROP' for nf_conntrack_handle_packet(). It might return
'NF_DROP' (see link [1]) instead. The if-else statements seem like
dead code.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/netfilter/nf_conntrack_proto_tcp.c#:~:text=%2DNF_REPEAT%3B-,return%20NF_DROP%3B,-%7D%0A%09%09fallthrough%3B

> connection.  Feel free to send patches that replace drop with -accept
> where possible/where it makes sense, but I don't think the
> TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.

Oh, are you suggesting replacing NF_DROP with -NF_ACCEPT in
nf_conntrack_dccp_packet()?

There are three points where nf_conntrack_handle_packet() returns NF_DROP:
1) one (syn_sent case) exists in nf_conntrack_tcp_packet(). As you
said, it's not necessary to change.
2) another two exist in nf_conntrack_dccp_packet() which should be the
same as nf_conntrack_tcp_packet() handles.

The patch goes like this:
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c
b/net/netfilter/nf_conntrack_proto_dccp.c
index e2db1f4ec2df..ebc4f733bb2e 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -525,7 +525,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
struct sk_buff *skb,

        dh = skb_header_pointer(skb, dataoff, sizeof(*dh), &_dh.dh);
        if (!dh)
-               return NF_DROP;
+               return -NF_ACCEPT;

        if (dccp_error(dh, skb, dataoff, state))
                return -NF_ACCEPT;
@@ -533,7 +533,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
struct sk_buff *skb,
        /* pull again, including possible 48 bit sequences and subtype header */
        dh = dccp_header_pointer(skb, dataoff, dh, &_dh);
        if (!dh)
-               return NF_DROP;
+               return -NF_ACCEPT;

        type = dh->dccph_type;
        if (!nf_ct_is_confirmed(ct) && !dccp_new(ct, skb, dh, state))

>
> > My only purpose is not to let the TCP layer sending strange RST to the
> > right flow.
>
> AFAIU tcp layer is correct, no?  Out of the blue packet to some listener
> socket?

Allow me to finish the full sentence: my only purpose is not to let
the TCP layer send strange RST to the _established_ socket due to
receiving strange out-of-window skbs.

>
> > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > knob seems odd to me though it can workaround :S
>
> I don't see a better alternative, other than -p tcp -m conntrack
> --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> such packets.
>
> > I would like to prevent sending such an RST as default behaviour.
>
> I don't see a way to make this work out of the box, without possible
> unwanted side effects.
>
> MAYBE we could drop IFF we check that the conntrack entry candidate
> that fails sequence validation has NAT translation applied to it, and
> thus the '-NF_ACCEPT' packet won't be translated.
>
> Not even compile tested:
>
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
>         case NFCT_TCP_IGNORE:
>                 spin_unlock_bh(&ct->lock);
>                 return NF_ACCEPT;
> -       case NFCT_TCP_INVALID:
> +       case NFCT_TCP_INVALID: {
> +               verdict = -NF_ACCEPT;
> +               if (ct->status & IPS_NAT_MASK)
> +                       res = NF_DROP; /* skb would miss nat transformation */

Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
would be dropped in nf_hook_slow() eventually and would not be passed
to the TCP layer.

>                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
>                 spin_unlock_bh(&ct->lock);
> -               return -NF_ACCEPT;
> +               return verdict;
> +       }
>         case NFCT_TCP_ACCEPT:
>                 break;
>         }

Great! I think your draft patch makes sense really, which takes NAT
into consideration.

>
> But I don't really see the advantage compared to doing drop decision in
> iptables/nftables ruleset.

From our views, especially to kernel developers, you're right: we
could easily turn on that knob or add a drop policy to prevent it
happening. Actually I did this in production to prevent such a case.
It surely works.

But from the views of normal users and those who do not understand how
it works in the kernel, it looks strange: people may ask why we get
some unknown RSTs in flight?

>
> I also have a hunch that someone will eventually complain about this
> change in behavior.

Well, I still think the patch you suggested is proper and don't know
why people could complain about it.

Thanks for your patience :)

Thanks,
Jason
Jozsef Kadlecsik March 7, 2024, 3:34 p.m. UTC | #7
On Thu, 7 Mar 2024, Jason Xing wrote:

> On Thu, Mar 7, 2024 at 10:10 PM Florian Westphal <fw@strlen.de> wrote:
> >
> > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > On Thu, Mar 7, 2024 at 8:00 PM Florian Westphal <fw@strlen.de> wrote:
> > > >
> > > > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > This change disables most of the tcp_in_window() test, this will
> > > > > > pretend everything is fine even though tcp_in_window says otherwise.
> > > > >
> > > > > Thanks for the information. It does make sense.
> > > > >
> > > > > What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> > > > > knob which you also pointed out. It also pretends to ignore those
> > > > > out-of-window skbs.
> > > > >
> > > > > >
> > > > > > You could:
> > > > > >  - drop invalid tcp packets in input hook
> > > > >
> > > > > How about changing the return value only as below? Only two cases will
> > > > > be handled:
> > > > >
> > > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > index ae493599a3ef..c88ce4cd041e 100644
> > > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > > >         case NFCT_TCP_INVALID:
> > > > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > > >                 spin_unlock_bh(&ct->lock);
> > > > > -               return -NF_ACCEPT;
> > > > > +               return -NF_DROP;
> > > >
> > > > Lets not do this.  conntrack should never drop packets and defer to ruleset
> > > > whereever possible.
> > >
> > > Hmm, sorry, it is against my understanding.
> > >
> > > If we cannot return -NF_DROP, why have we already added some 'return
> > > NF_DROP' in the nf_conntrack_handle_packet() function? And why does
> > > this test statement exist?
> >
> > Sure we can drop.  But we should only do it if there is no better
> > alternative.
> >
> > > nf_conntrack_in()
> > >   -> nf_conntrack_handle_packet()
> > >   -> if (ret <= 0) {
> > >          if (ret == -NF_DROP) NF_CT_STAT_INC_ATOMIC(state->net, drop);
> >
> > AFAICS this only happens when we receive syn for an existing conntrack
> > that is being removed already so we'd expect next syn to create a new
> 
> Sorry, I've double-checked this part and found out there is no chance
> to return '-NF_DROP' for nf_conntrack_handle_packet(). It might return
> 'NF_DROP' (see link [1]) instead. The if-else statements seem like
> dead code.
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/netfilter/nf_conntrack_proto_tcp.c#:~:text=%2DNF_REPEAT%3B-,return%20NF_DROP%3B,-%7D%0A%09%09fallthrough%3B
> 
> > connection.  Feel free to send patches that replace drop with -accept
> > where possible/where it makes sense, but I don't think the
> > TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.
> 
> Oh, are you suggesting replacing NF_DROP with -NF_ACCEPT in
> nf_conntrack_dccp_packet()?
> 
> There are three points where nf_conntrack_handle_packet() returns NF_DROP:
> 1) one (syn_sent case) exists in nf_conntrack_tcp_packet(). As you
> said, it's not necessary to change.
> 2) another two exist in nf_conntrack_dccp_packet() which should be the
> same as nf_conntrack_tcp_packet() handles.
> 
> The patch goes like this:
> diff --git a/net/netfilter/nf_conntrack_proto_dccp.c
> b/net/netfilter/nf_conntrack_proto_dccp.c
> index e2db1f4ec2df..ebc4f733bb2e 100644
> --- a/net/netfilter/nf_conntrack_proto_dccp.c
> +++ b/net/netfilter/nf_conntrack_proto_dccp.c
> @@ -525,7 +525,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
> struct sk_buff *skb,
> 
>         dh = skb_header_pointer(skb, dataoff, sizeof(*dh), &_dh.dh);
>         if (!dh)
> -               return NF_DROP;
> +               return -NF_ACCEPT;
> 
>         if (dccp_error(dh, skb, dataoff, state))
>                 return -NF_ACCEPT;
> @@ -533,7 +533,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
> struct sk_buff *skb,
>         /* pull again, including possible 48 bit sequences and subtype header */
>         dh = dccp_header_pointer(skb, dataoff, dh, &_dh);
>         if (!dh)
> -               return NF_DROP;
> +               return -NF_ACCEPT;
> 
>         type = dh->dccph_type;
>         if (!nf_ct_is_confirmed(ct) && !dccp_new(ct, skb, dh, state))
> 
> >
> > > My only purpose is not to let the TCP layer sending strange RST to the
> > > right flow.
> >
> > AFAIU tcp layer is correct, no?  Out of the blue packet to some listener
> > socket?
> 
> Allow me to finish the full sentence: my only purpose is not to let
> the TCP layer send strange RST to the _established_ socket due to
> receiving strange out-of-window skbs.

I don't understand why do you want to modify conntrack at all: conntrack 
itself does not send RST packets. And the TCP layer don't send RST packets 
to out of window ones either.

The only possibility I see for such packets is an iptables/nftables rule 
which rejects packets classified as INVALID by conntrack.

As Florian suggested, why don't you change that rule?

The conntrack states are not fine-grained to express different TCP states 
which covered with INVALID. It was never a good idea to reject INVALID 
packets or let them through (leaking internal addresses).

Best regards,
Jozsef

> > > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > > knob seems odd to me though it can workaround :S
> >
> > I don't see a better alternative, other than -p tcp -m conntrack
> > --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> > such packets.
> >
> > > I would like to prevent sending such an RST as default behaviour.
> >
> > I don't see a way to make this work out of the box, without possible
> > unwanted side effects.
> >
> > MAYBE we could drop IFF we check that the conntrack entry candidate
> > that fails sequence validation has NAT translation applied to it, and
> > thus the '-NF_ACCEPT' packet won't be translated.
> >
> > Not even compile tested:
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >         case NFCT_TCP_IGNORE:
> >                 spin_unlock_bh(&ct->lock);
> >                 return NF_ACCEPT;
> > -       case NFCT_TCP_INVALID:
> > +       case NFCT_TCP_INVALID: {
> > +               verdict = -NF_ACCEPT;
> > +               if (ct->status & IPS_NAT_MASK)
> > +                       res = NF_DROP; /* skb would miss nat transformation */
> 
> Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
> would be dropped in nf_hook_slow() eventually and would not be passed
> to the TCP layer.
> 
> >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> >                 spin_unlock_bh(&ct->lock);
> > -               return -NF_ACCEPT;
> > +               return verdict;
> > +       }
> >         case NFCT_TCP_ACCEPT:
> >                 break;
> >         }
> 
> Great! I think your draft patch makes sense really, which takes NAT
> into consideration.
> 
> >
> > But I don't really see the advantage compared to doing drop decision in
> > iptables/nftables ruleset.
> 
> From our views, especially to kernel developers, you're right: we
> could easily turn on that knob or add a drop policy to prevent it
> happening. Actually I did this in production to prevent such a case.
> It surely works.
> 
> But from the views of normal users and those who do not understand how
> it works in the kernel, it looks strange: people may ask why we get
> some unknown RSTs in flight?
>
> > I also have a hunch that someone will eventually complain about this
> > change in behavior.
> 
> Well, I still think the patch you suggested is proper and don't know
> why people could complain about it.
> 
> Thanks for your patience :)
> 
> Thanks,
> Jason
>
Jason Xing March 7, 2024, 3:59 p.m. UTC | #8
[...]
> > Allow me to finish the full sentence: my only purpose is not to let
> > the TCP layer send strange RST to the _established_ socket due to
> > receiving strange out-of-window skbs.
>
> I don't understand why do you want to modify conntrack at all: conntrack
> itself does not send RST packets. And the TCP layer don't send RST packets
> to out of window ones either.

Thanks for your reply.

To normal TCP flow, you're right because the TCP layer doesn't send
RST to out-of-window skbs.

But the DNAT policy on the server should have converted the port of
incoming skb from A_port to B_port as my description in this patch
said.

It actually doesn't happen. The conntrack clears the skb->_nfct value
after returning -NF_ACCEPT in nf_conntrack_tcp_packet() and then DNAT
would not convert the A_port to B_port.

So the TCP layer is not able to look up the correct socket (client_ip,
client_port, server_ip, B_port) because A_port doesn't match B_port,
then an RST would be sent to the client.

>
> The only possibility I see for such packets is an iptables/nftables rule
> which rejects packets classified as INVALID by conntrack.
>
> As Florian suggested, why don't you change that rule?

As I said, just for the workaround method, only turning on that sysctl
knob can solve the issue.

Thanks,
Jason

>
> The conntrack states are not fine-grained to express different TCP states
> which covered with INVALID. It was never a good idea to reject INVALID
> packets or let them through (leaking internal addresses).
>
> Best regards,
> Jozsef
>
> > > > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > > > knob seems odd to me though it can workaround :S
> > >
> > > I don't see a better alternative, other than -p tcp -m conntrack
> > > --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> > > such packets.
> > >
> > > > I would like to prevent sending such an RST as default behaviour.
> > >
> > > I don't see a way to make this work out of the box, without possible
> > > unwanted side effects.
> > >
> > > MAYBE we could drop IFF we check that the conntrack entry candidate
> > > that fails sequence validation has NAT translation applied to it, and
> > > thus the '-NF_ACCEPT' packet won't be translated.
> > >
> > > Not even compile tested:
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > >         case NFCT_TCP_IGNORE:
> > >                 spin_unlock_bh(&ct->lock);
> > >                 return NF_ACCEPT;
> > > -       case NFCT_TCP_INVALID:
> > > +       case NFCT_TCP_INVALID: {
> > > +               verdict = -NF_ACCEPT;
> > > +               if (ct->status & IPS_NAT_MASK)
> > > +                       res = NF_DROP; /* skb would miss nat transformation */
> >
> > Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
> > would be dropped in nf_hook_slow() eventually and would not be passed
> > to the TCP layer.
> >
> > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > >                 spin_unlock_bh(&ct->lock);
> > > -               return -NF_ACCEPT;
> > > +               return verdict;
> > > +       }
> > >         case NFCT_TCP_ACCEPT:
> > >                 break;
> > >         }
> >
> > Great! I think your draft patch makes sense really, which takes NAT
> > into consideration.
> >
> > >
> > > But I don't really see the advantage compared to doing drop decision in
> > > iptables/nftables ruleset.
> >
> > From our views, especially to kernel developers, you're right: we
> > could easily turn on that knob or add a drop policy to prevent it
> > happening. Actually I did this in production to prevent such a case.
> > It surely works.
> >
> > But from the views of normal users and those who do not understand how
> > it works in the kernel, it looks strange: people may ask why we get
> > some unknown RSTs in flight?
> >
> > > I also have a hunch that someone will eventually complain about this
> > > change in behavior.
> >
> > Well, I still think the patch you suggested is proper and don't know
> > why people could complain about it.
> >
> > Thanks for your patience :)
> >
> > Thanks,
> > Jason
> >
>
> --
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
Jozsef Kadlecsik March 7, 2024, 7 p.m. UTC | #9
On Thu, 7 Mar 2024, Jason Xing wrote:

> > > Allow me to finish the full sentence: my only purpose is not to let
> > > the TCP layer send strange RST to the _established_ socket due to
> > > receiving strange out-of-window skbs.
> >
> > I don't understand why do you want to modify conntrack at all: conntrack
> > itself does not send RST packets. And the TCP layer don't send RST packets
> > to out of window ones either.
> 
> To normal TCP flow, you're right because the TCP layer doesn't send RST 
> to out-of-window skbs.
> 
> But the DNAT policy on the server should have converted the port of 
> incoming skb from A_port to B_port as my description in this patch said.
> 
> It actually doesn't happen. The conntrack clears the skb->_nfct value 
> after returning -NF_ACCEPT in nf_conntrack_tcp_packet() and then DNAT 
> would not convert the A_port to B_port.

The packet is INVALID therefore it's not NATed. I don't think that could 
simply be changed in netfilter.

> So the TCP layer is not able to look up the correct socket (client_ip, 
> client_port, server_ip, B_port) because A_port doesn't match B_port, 
> then an RST would be sent to the client.

Yes, if you let the packet continue to traverse the stack.

> > The only possibility I see for such packets is an iptables/nftables rule
> > which rejects packets classified as INVALID by conntrack.
> >
> > As Florian suggested, why don't you change that rule?
> 
> As I said, just for the workaround method, only turning on that sysctl 
> knob can solve the issue.

Sorry, but why? If you drop the packet by a rule, then there's no need to 
use the sysctl setting and there's no unwanted RST packets.

Best regards,
Jozsef

> > The conntrack states are not fine-grained to express different TCP states
> > which covered with INVALID. It was never a good idea to reject INVALID
> > packets or let them through (leaking internal addresses).
> >
> > Best regards,
> > Jozsef
> >
> > > > > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > > > > knob seems odd to me though it can workaround :S
> > > >
> > > > I don't see a better alternative, other than -p tcp -m conntrack
> > > > --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> > > > such packets.
> > > >
> > > > > I would like to prevent sending such an RST as default behaviour.
> > > >
> > > > I don't see a way to make this work out of the box, without possible
> > > > unwanted side effects.
> > > >
> > > > MAYBE we could drop IFF we check that the conntrack entry candidate
> > > > that fails sequence validation has NAT translation applied to it, and
> > > > thus the '-NF_ACCEPT' packet won't be translated.
> > > >
> > > > Not even compile tested:
> > > >
> > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > >         case NFCT_TCP_IGNORE:
> > > >                 spin_unlock_bh(&ct->lock);
> > > >                 return NF_ACCEPT;
> > > > -       case NFCT_TCP_INVALID:
> > > > +       case NFCT_TCP_INVALID: {
> > > > +               verdict = -NF_ACCEPT;
> > > > +               if (ct->status & IPS_NAT_MASK)
> > > > +                       res = NF_DROP; /* skb would miss nat transformation */
> > >
> > > Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
> > > would be dropped in nf_hook_slow() eventually and would not be passed
> > > to the TCP layer.
> > >
> > > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > >                 spin_unlock_bh(&ct->lock);
> > > > -               return -NF_ACCEPT;
> > > > +               return verdict;
> > > > +       }
> > > >         case NFCT_TCP_ACCEPT:
> > > >                 break;
> > > >         }
> > >
> > > Great! I think your draft patch makes sense really, which takes NAT
> > > into consideration.
> > >
> > > >
> > > > But I don't really see the advantage compared to doing drop decision in
> > > > iptables/nftables ruleset.
> > >
> > > From our views, especially to kernel developers, you're right: we
> > > could easily turn on that knob or add a drop policy to prevent it
> > > happening. Actually I did this in production to prevent such a case.
> > > It surely works.
> > >
> > > But from the views of normal users and those who do not understand how
> > > it works in the kernel, it looks strange: people may ask why we get
> > > some unknown RSTs in flight?
> > >
> > > > I also have a hunch that someone will eventually complain about this
> > > > change in behavior.
> > >
> > > Well, I still think the patch you suggested is proper and don't know
> > > why people could complain about it.
> > >
> > > Thanks for your patience :)
> > >
> > > Thanks,
> > > Jason
> > >
> >
> > --
> > E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> > PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> > Address : Wigner Research Centre for Physics
> >           H-1525 Budapest 114, POB. 49, Hungary
>
Jason Xing March 8, 2024, 12:42 a.m. UTC | #10
On Fri, Mar 8, 2024 at 3:00 AM Jozsef Kadlecsik
<kadlec@blackhole.kfki.hu> wrote:
>
> On Thu, 7 Mar 2024, Jason Xing wrote:
>
> > > > Allow me to finish the full sentence: my only purpose is not to let
> > > > the TCP layer send strange RST to the _established_ socket due to
> > > > receiving strange out-of-window skbs.
> > >
> > > I don't understand why do you want to modify conntrack at all: conntrack
> > > itself does not send RST packets. And the TCP layer don't send RST packets
> > > to out of window ones either.
> >
> > To normal TCP flow, you're right because the TCP layer doesn't send RST
> > to out-of-window skbs.
> >
> > But the DNAT policy on the server should have converted the port of
> > incoming skb from A_port to B_port as my description in this patch said.
> >
> > It actually doesn't happen. The conntrack clears the skb->_nfct value
> > after returning -NF_ACCEPT in nf_conntrack_tcp_packet() and then DNAT
> > would not convert the A_port to B_port.
>
> The packet is INVALID therefore it's not NATed. I don't think that could
> simply be changed in netfilter.

Well, what would you suggest ? :)

>
> > So the TCP layer is not able to look up the correct socket (client_ip,
> > client_port, server_ip, B_port) because A_port doesn't match B_port,
> > then an RST would be sent to the client.
>
> Yes, if you let the packet continue to traverse the stack.

Yes!

>
> > > The only possibility I see for such packets is an iptables/nftables rule
> > > which rejects packets classified as INVALID by conntrack.
> > >
> > > As Florian suggested, why don't you change that rule?
> >
> > As I said, just for the workaround method, only turning on that sysctl
> > knob can solve the issue.
>
> Sorry, but why? If you drop the packet by a rule, then there's no need to
> use the sysctl setting and there's no unwanted RST packets.

Oh, I was trying to clarify that using some knob can work around which
is not my original intention, but I don't seek a workaround method. I
didn't use --cstate INVALID to drop those INVALID packets in
production, which I feel should work.

For this particular case, I feel it's not that good to ask
users/customers to add more rules or turn on sysctl knob to prevent
seeing such RSTs.

Instead I thought we could naturally stop sending such RSTs as default
without asking other people to change something. People shouldn't see
the RSTs really. That's why I wrote this patch.

I hope I'm right... :S

Thanks,
Jason

>
> Best regards,
> Jozsef
>
> > > The conntrack states are not fine-grained to express different TCP states
> > > which covered with INVALID. It was never a good idea to reject INVALID
> > > packets or let them through (leaking internal addresses).
> > >
> > > Best regards,
> > > Jozsef
> > >
> > > > > > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > > > > > knob seems odd to me though it can workaround :S
> > > > >
> > > > > I don't see a better alternative, other than -p tcp -m conntrack
> > > > > --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> > > > > such packets.
> > > > >
> > > > > > I would like to prevent sending such an RST as default behaviour.
> > > > >
> > > > > I don't see a way to make this work out of the box, without possible
> > > > > unwanted side effects.
> > > > >
> > > > > MAYBE we could drop IFF we check that the conntrack entry candidate
> > > > > that fails sequence validation has NAT translation applied to it, and
> > > > > thus the '-NF_ACCEPT' packet won't be translated.
> > > > >
> > > > > Not even compile tested:
> > > > >
> > > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > > >         case NFCT_TCP_IGNORE:
> > > > >                 spin_unlock_bh(&ct->lock);
> > > > >                 return NF_ACCEPT;
> > > > > -       case NFCT_TCP_INVALID:
> > > > > +       case NFCT_TCP_INVALID: {
> > > > > +               verdict = -NF_ACCEPT;
> > > > > +               if (ct->status & IPS_NAT_MASK)
> > > > > +                       res = NF_DROP; /* skb would miss nat transformation */
> > > >
> > > > Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
> > > > would be dropped in nf_hook_slow() eventually and would not be passed
> > > > to the TCP layer.
> > > >
> > > > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > > >                 spin_unlock_bh(&ct->lock);
> > > > > -               return -NF_ACCEPT;
> > > > > +               return verdict;
> > > > > +       }
> > > > >         case NFCT_TCP_ACCEPT:
> > > > >                 break;
> > > > >         }
> > > >
> > > > Great! I think your draft patch makes sense really, which takes NAT
> > > > into consideration.
> > > >
> > > > >
> > > > > But I don't really see the advantage compared to doing drop decision in
> > > > > iptables/nftables ruleset.
> > > >
> > > > From our views, especially to kernel developers, you're right: we
> > > > could easily turn on that knob or add a drop policy to prevent it
> > > > happening. Actually I did this in production to prevent such a case.
> > > > It surely works.
> > > >
> > > > But from the views of normal users and those who do not understand how
> > > > it works in the kernel, it looks strange: people may ask why we get
> > > > some unknown RSTs in flight?
> > > >
> > > > > I also have a hunch that someone will eventually complain about this
> > > > > change in behavior.
> > > >
> > > > Well, I still think the patch you suggested is proper and don't know
> > > > why people could complain about it.
> > > >
> > > > Thanks for your patience :)
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > >
> > > --
> > > E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> > > PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> > > Address : Wigner Research Centre for Physics
> > >           H-1525 Budapest 114, POB. 49, Hungary
> >
>
> --
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
Jason Xing March 8, 2024, 8:59 a.m. UTC | #11
On Thu, Mar 7, 2024 at 11:11 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 10:10 PM Florian Westphal <fw@strlen.de> wrote:
> >
> > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > On Thu, Mar 7, 2024 at 8:00 PM Florian Westphal <fw@strlen.de> wrote:
> > > >
> > > > Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > This change disables most of the tcp_in_window() test, this will
> > > > > > pretend everything is fine even though tcp_in_window says otherwise.
> > > > >
> > > > > Thanks for the information. It does make sense.
> > > > >
> > > > > What I've done is quite similar to nf_conntrack_tcp_be_liberal sysctl
> > > > > knob which you also pointed out. It also pretends to ignore those
> > > > > out-of-window skbs.
> > > > >
> > > > > >
> > > > > > You could:
> > > > > >  - drop invalid tcp packets in input hook
> > > > >
> > > > > How about changing the return value only as below? Only two cases will
> > > > > be handled:
> > > > >
> > > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > index ae493599a3ef..c88ce4cd041e 100644
> > > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > > @@ -1259,7 +1259,7 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > > >         case NFCT_TCP_INVALID:
> > > > >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > > >                 spin_unlock_bh(&ct->lock);
> > > > > -               return -NF_ACCEPT;
> > > > > +               return -NF_DROP;
> > > >
> > > > Lets not do this.  conntrack should never drop packets and defer to ruleset
> > > > whereever possible.
> > >
> > > Hmm, sorry, it is against my understanding.
> > >
> > > If we cannot return -NF_DROP, why have we already added some 'return
> > > NF_DROP' in the nf_conntrack_handle_packet() function? And why does
> > > this test statement exist?
> >
> > Sure we can drop.  But we should only do it if there is no better
> > alternative.
> >
> > > nf_conntrack_in()
> > >   -> nf_conntrack_handle_packet()
> > >   -> if (ret <= 0) {
> > >          if (ret == -NF_DROP) NF_CT_STAT_INC_ATOMIC(state->net, drop);
> >
> > AFAICS this only happens when we receive syn for an existing conntrack
> > that is being removed already so we'd expect next syn to create a new
>
> Sorry, I've double-checked this part and found out there is no chance
> to return '-NF_DROP' for nf_conntrack_handle_packet(). It might return
> 'NF_DROP' (see link [1]) instead. The if-else statements seem like
> dead code.

My bad. My mind went blank last night, maybe it's too late. Well, It's
not dead code because '-NF_DROP' which is still zero makes me
misunderstand :(

I'll post a patch to change 'if (ret == -NF_DROP)' to 'if (ret ==
NF_DROP)' just for easy reading.

Thanks,
Jason

>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/netfilter/nf_conntrack_proto_tcp.c#:~:text=%2DNF_REPEAT%3B-,return%20NF_DROP%3B,-%7D%0A%09%09fallthrough%3B
>
> > connection.  Feel free to send patches that replace drop with -accept
> > where possible/where it makes sense, but I don't think the
> > TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.
>
> Oh, are you suggesting replacing NF_DROP with -NF_ACCEPT in
> nf_conntrack_dccp_packet()?
>
> There are three points where nf_conntrack_handle_packet() returns NF_DROP:
> 1) one (syn_sent case) exists in nf_conntrack_tcp_packet(). As you
> said, it's not necessary to change.
> 2) another two exist in nf_conntrack_dccp_packet() which should be the
> same as nf_conntrack_tcp_packet() handles.
>
> The patch goes like this:
> diff --git a/net/netfilter/nf_conntrack_proto_dccp.c
> b/net/netfilter/nf_conntrack_proto_dccp.c
> index e2db1f4ec2df..ebc4f733bb2e 100644
> --- a/net/netfilter/nf_conntrack_proto_dccp.c
> +++ b/net/netfilter/nf_conntrack_proto_dccp.c
> @@ -525,7 +525,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
> struct sk_buff *skb,
>
>         dh = skb_header_pointer(skb, dataoff, sizeof(*dh), &_dh.dh);
>         if (!dh)
> -               return NF_DROP;
> +               return -NF_ACCEPT;
>
>         if (dccp_error(dh, skb, dataoff, state))
>                 return -NF_ACCEPT;
> @@ -533,7 +533,7 @@ int nf_conntrack_dccp_packet(struct nf_conn *ct,
> struct sk_buff *skb,
>         /* pull again, including possible 48 bit sequences and subtype header */
>         dh = dccp_header_pointer(skb, dataoff, dh, &_dh);
>         if (!dh)
> -               return NF_DROP;
> +               return -NF_ACCEPT;
>
>         type = dh->dccph_type;
>         if (!nf_ct_is_confirmed(ct) && !dccp_new(ct, skb, dh, state))
>
> >
> > > My only purpose is not to let the TCP layer sending strange RST to the
> > > right flow.
> >
> > AFAIU tcp layer is correct, no?  Out of the blue packet to some listener
> > socket?
>
> Allow me to finish the full sentence: my only purpose is not to let
> the TCP layer send strange RST to the _established_ socket due to
> receiving strange out-of-window skbs.
>
> >
> > > Besides, resorting to turning on nf_conntrack_tcp_be_liberal sysctl
> > > knob seems odd to me though it can workaround :S
> >
> > I don't see a better alternative, other than -p tcp -m conntrack
> > --ctstate INVALID -j DROP rule, if you wish for tcp stack to not see
> > such packets.
> >
> > > I would like to prevent sending such an RST as default behaviour.
> >
> > I don't see a way to make this work out of the box, without possible
> > unwanted side effects.
> >
> > MAYBE we could drop IFF we check that the conntrack entry candidate
> > that fails sequence validation has NAT translation applied to it, and
> > thus the '-NF_ACCEPT' packet won't be translated.
> >
> > Not even compile tested:
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1256,10 +1256,14 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >         case NFCT_TCP_IGNORE:
> >                 spin_unlock_bh(&ct->lock);
> >                 return NF_ACCEPT;
> > -       case NFCT_TCP_INVALID:
> > +       case NFCT_TCP_INVALID: {
> > +               verdict = -NF_ACCEPT;
> > +               if (ct->status & IPS_NAT_MASK)
> > +                       res = NF_DROP; /* skb would miss nat transformation */
>
> Above line, I guess, should be 'verdict = NF_DROP'? Then this skb
> would be dropped in nf_hook_slow() eventually and would not be passed
> to the TCP layer.
>
> >                 nf_tcp_handle_invalid(ct, dir, index, skb, state);
> >                 spin_unlock_bh(&ct->lock);
> > -               return -NF_ACCEPT;
> > +               return verdict;
> > +       }
> >         case NFCT_TCP_ACCEPT:
> >                 break;
> >         }
>
> Great! I think your draft patch makes sense really, which takes NAT
> into consideration.
>
> >
> > But I don't really see the advantage compared to doing drop decision in
> > iptables/nftables ruleset.
>
> From our views, especially to kernel developers, you're right: we
> could easily turn on that knob or add a drop policy to prevent it
> happening. Actually I did this in production to prevent such a case.
> It surely works.
>
> But from the views of normal users and those who do not understand how
> it works in the kernel, it looks strange: people may ask why we get
> some unknown RSTs in flight?
>
> >
> > I also have a hunch that someone will eventually complain about this
> > change in behavior.
>
> Well, I still think the patch you suggested is proper and don't know
> why people could complain about it.
>
> Thanks for your patience :)
>
> Thanks,
> Jason
Florian Westphal March 8, 2024, 10:46 p.m. UTC | #12
Jason Xing <kerneljasonxing@gmail.com> wrote:
> > connection.  Feel free to send patches that replace drop with -accept
> > where possible/where it makes sense, but I don't think the
> > TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.
> 
> Oh, are you suggesting replacing NF_DROP with -NF_ACCEPT in
> nf_conntrack_dccp_packet()?

It would be more consistent with what tcp and sctp trackers are
doing, but this should not matter in practice (the packet is malformed).

> > +       case NFCT_TCP_INVALID: {
> > +               verdict = -NF_ACCEPT;
> > +               if (ct->status & IPS_NAT_MASK)
> > +                       res = NF_DROP; /* skb would miss nat transformation */
> 
> Above line, I guess, should be 'verdict = NF_DROP'?

Yes.

> Great! I think your draft patch makes sense really, which takes NAT
> into consideration.

You could submit this officially and we could give it a try and see if
anyone complains down the road.
Jason Xing March 9, 2024, 12:37 a.m. UTC | #13
On Sat, Mar 9, 2024 at 6:47 AM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > connection.  Feel free to send patches that replace drop with -accept
> > > where possible/where it makes sense, but I don't think the
> > > TCP_CONNTRACK_SYN_SENT one can reasonably be avoided.
> >
> > Oh, are you suggesting replacing NF_DROP with -NF_ACCEPT in
> > nf_conntrack_dccp_packet()?
>
> It would be more consistent with what tcp and sctp trackers are
> doing, but this should not matter in practice (the packet is malformed).

Okay, I will take some time to check the sctp part. BTW, just like one
of previous emails said, I noticed there are two points in DCCP part
which is not consistent with TCP part, so I submitted one simple patch
[1] to do it.

[1]: https://lore.kernel.org/all/20240308092915.9751-1-kerneljasonxing@gmail.com/

>
> > > +       case NFCT_TCP_INVALID: {
> > > +               verdict = -NF_ACCEPT;
> > > +               if (ct->status & IPS_NAT_MASK)
> > > +                       res = NF_DROP; /* skb would miss nat transformation */
> >
> > Above line, I guess, should be 'verdict = NF_DROP'?
>
> Yes.
>
> > Great! I think your draft patch makes sense really, which takes NAT
> > into consideration.
>
> You could submit this officially and we could give it a try and see if
> anyone complains down the road.

Great :)

Thanks,
Jason
diff mbox series

Patch

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index ae493599a3ef..3f3e620f3969 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1253,13 +1253,11 @@  int nf_conntrack_tcp_packet(struct nf_conn *ct,
 	res = tcp_in_window(ct, dir, index,
 			    skb, dataoff, th, state);
 	switch (res) {
-	case NFCT_TCP_IGNORE:
-		spin_unlock_bh(&ct->lock);
-		return NF_ACCEPT;
 	case NFCT_TCP_INVALID:
 		nf_tcp_handle_invalid(ct, dir, index, skb, state);
+	case NFCT_TCP_IGNORE:
 		spin_unlock_bh(&ct->lock);
-		return -NF_ACCEPT;
+		return NF_ACCEPT;
 	case NFCT_TCP_ACCEPT:
 		break;
 	}