diff mbox series

[nf-next,v2] netfilter: conntrack: avoid sending RST to reply out-of-window skb

Message ID 20240311070550.7438-1-kerneljasonxing@gmail.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [nf-next,v2] netfilter: conntrack: avoid sending RST to reply out-of-window skb | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 940 this patch: 940
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 956 this patch: 956
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 956 this patch: 956
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-03-11--15-00 (tests: 888)

Commit Message

Jason Xing March 11, 2024, 7:05 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

Supposing we set DNAT policy converting a_port to b_port on the
server at the beginning, the socket is set up by using 4-tuple:

client_ip:client_port <--> server_ip:b_port

Then, some strange skbs from client or gateway, say, out-of-window
skbs are eventually sent to the server_ip:a_port (not b_port)
in TCP layer due to netfilter clearing skb->_nfct value in
nf_conntrack_in() function. Why? Because the tcp_in_window()
considers the incoming skb as an invalid skb by returning
NFCT_TCP_INVALID.

At last, the TCP layer process the out-of-window
skb (client_ip,client_port,server_ip,a_port) and try to look up
such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
because the port is a_port not our expected b_port and then send
back an RST to the client.

The detailed call graphs go like this:
1)
nf_conntrack_in()
  -> nf_conntrack_handle_packet()
    -> nf_conntrack_tcp_packet()
      -> tcp_in_window() // tests if the skb is out-of-window
      -> return -NF_ACCEPT;
  -> skb->_nfct = 0; // if the above line returns a negative value
2)
tcp_v4_rcv()
  -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
  -> tcp_v4_send_reset()

The moment the client receives the RST, it will drop. So the RST
skb doesn't hurt the client (maybe hurt some gateway which cancels
the session when filtering the RST without validating
the sequence because of performance reason). Well, it doesn't
matter. However, we can see many strange RST in flight.

The key reason why I wrote this patch is that I don't think
the behaviour is expected because the RFC 793 defines this
case:

"If the connection is in a synchronized state (ESTABLISHED,
 FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
 any unacceptable segment (out of window sequence number or
 unacceptible acknowledgment number) must elicit only an empty
 acknowledgment segment containing the current send-sequence number
 and an acknowledgment..."

I think, even we have set DNAT policy, it would be better if the
whole process/behaviour adheres to the original TCP behaviour as
default.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
v2
Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
1. add one more test about NAT and then drop the skb (Florian)
---
 net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Florian Westphal March 12, 2024, 12:24 p.m. UTC | #1
Jason Xing <kerneljasonxing@gmail.com> wrote:
> I think, even we have set DNAT policy, it would be better if the
> whole process/behaviour adheres to the original TCP behaviour as
> default.

LGTM.
Acked-by: Florian Westphal <fw@strlen.de>
Jason Xing March 13, 2024, 2:24 a.m. UTC | #2
Hello Florian,

On Tue, Mar 12, 2024 at 8:24 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > I think, even we have set DNAT policy, it would be better if the
> > whole process/behaviour adheres to the original TCP behaviour as
> > default.
>
> LGTM.
> Acked-by: Florian Westphal <fw@strlen.de>

Thanks!

I wonder if I should repost it in two weeks? BTW, what about the
status of the other two patches[1][2]? Should I repost them after
March 25th too?

[1]:  https://lore.kernel.org/netdev/20240308092915.9751-1-kerneljasonxing@gmail.com/
[2]: https://lore.kernel.org/netdev/20240308092915.9751-2-kerneljasonxing@gmail.com/
Simon Horman March 18, 2024, 8:16 p.m. UTC | #3
On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Supposing we set DNAT policy converting a_port to b_port on the
> server at the beginning, the socket is set up by using 4-tuple:
> 
> client_ip:client_port <--> server_ip:b_port
> 
> Then, some strange skbs from client or gateway, say, out-of-window
> skbs are eventually sent to the server_ip:a_port (not b_port)
> in TCP layer due to netfilter clearing skb->_nfct value in
> nf_conntrack_in() function. Why? Because the tcp_in_window()
> considers the incoming skb as an invalid skb by returning
> NFCT_TCP_INVALID.
> 
> At last, the TCP layer process the out-of-window
> skb (client_ip,client_port,server_ip,a_port) and try to look up
> such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> because the port is a_port not our expected b_port and then send
> back an RST to the client.
> 
> The detailed call graphs go like this:
> 1)
> nf_conntrack_in()
>   -> nf_conntrack_handle_packet()
>     -> nf_conntrack_tcp_packet()
>       -> tcp_in_window() // tests if the skb is out-of-window
>       -> return -NF_ACCEPT;
>   -> skb->_nfct = 0; // if the above line returns a negative value
> 2)
> tcp_v4_rcv()
>   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
>   -> tcp_v4_send_reset()
> 
> The moment the client receives the RST, it will drop. So the RST
> skb doesn't hurt the client (maybe hurt some gateway which cancels
> the session when filtering the RST without validating
> the sequence because of performance reason). Well, it doesn't
> matter. However, we can see many strange RST in flight.
> 
> The key reason why I wrote this patch is that I don't think
> the behaviour is expected because the RFC 793 defines this
> case:
> 
> "If the connection is in a synchronized state (ESTABLISHED,
>  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
>  any unacceptable segment (out of window sequence number or
>  unacceptible acknowledgment number) must elicit only an empty

Not for those following along, it appears that RFC 793 does misspell
unacceptable as above. Perhaps spelling was different in 1981 :)

>  acknowledgment segment containing the current send-sequence number
>  and an acknowledgment..."
> 
> I think, even we have set DNAT policy, it would be better if the
> whole process/behaviour adheres to the original TCP behaviour as
> default.
> 
> Suggested-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>

...
Jason Xing March 19, 2024, 2:52 a.m. UTC | #4
Hello Simon,

On Tue, Mar 19, 2024 at 4:16 AM Simon Horman <horms@kernel.org> wrote:
>
> On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Supposing we set DNAT policy converting a_port to b_port on the
> > server at the beginning, the socket is set up by using 4-tuple:
> >
> > client_ip:client_port <--> server_ip:b_port
> >
> > Then, some strange skbs from client or gateway, say, out-of-window
> > skbs are eventually sent to the server_ip:a_port (not b_port)
> > in TCP layer due to netfilter clearing skb->_nfct value in
> > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > considers the incoming skb as an invalid skb by returning
> > NFCT_TCP_INVALID.
> >
> > At last, the TCP layer process the out-of-window
> > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > because the port is a_port not our expected b_port and then send
> > back an RST to the client.
> >
> > The detailed call graphs go like this:
> > 1)
> > nf_conntrack_in()
> >   -> nf_conntrack_handle_packet()
> >     -> nf_conntrack_tcp_packet()
> >       -> tcp_in_window() // tests if the skb is out-of-window
> >       -> return -NF_ACCEPT;
> >   -> skb->_nfct = 0; // if the above line returns a negative value
> > 2)
> > tcp_v4_rcv()
> >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> >   -> tcp_v4_send_reset()
> >
> > The moment the client receives the RST, it will drop. So the RST
> > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > the session when filtering the RST without validating
> > the sequence because of performance reason). Well, it doesn't
> > matter. However, we can see many strange RST in flight.
> >
> > The key reason why I wrote this patch is that I don't think
> > the behaviour is expected because the RFC 793 defines this
> > case:
> >
> > "If the connection is in a synchronized state (ESTABLISHED,
> >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> >  any unacceptable segment (out of window sequence number or
> >  unacceptible acknowledgment number) must elicit only an empty
>
> Not for those following along, it appears that RFC 793 does misspell
> unacceptable as above. Perhaps spelling was different in 1981 :)

Thanks for the check. Yes, it did misspell that word. Should I correct
that word in my quotation?

>
> >  acknowledgment segment containing the current send-sequence number
> >  and an acknowledgment..."
> >
> > I think, even we have set DNAT policy, it would be better if the
> > whole process/behaviour adheres to the original TCP behaviour as
> > default.
> >
> > Suggested-by: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
>
> ...
Simon Horman March 19, 2024, 6:46 p.m. UTC | #5
On Tue, Mar 19, 2024 at 10:52:44AM +0800, Jason Xing wrote:
> Hello Simon,
> 
> On Tue, Mar 19, 2024 at 4:16 AM Simon Horman <horms@kernel.org> wrote:
> >
> > On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Supposing we set DNAT policy converting a_port to b_port on the
> > > server at the beginning, the socket is set up by using 4-tuple:
> > >
> > > client_ip:client_port <--> server_ip:b_port
> > >
> > > Then, some strange skbs from client or gateway, say, out-of-window
> > > skbs are eventually sent to the server_ip:a_port (not b_port)
> > > in TCP layer due to netfilter clearing skb->_nfct value in
> > > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > > considers the incoming skb as an invalid skb by returning
> > > NFCT_TCP_INVALID.
> > >
> > > At last, the TCP layer process the out-of-window
> > > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > > because the port is a_port not our expected b_port and then send
> > > back an RST to the client.
> > >
> > > The detailed call graphs go like this:
> > > 1)
> > > nf_conntrack_in()
> > >   -> nf_conntrack_handle_packet()
> > >     -> nf_conntrack_tcp_packet()
> > >       -> tcp_in_window() // tests if the skb is out-of-window
> > >       -> return -NF_ACCEPT;
> > >   -> skb->_nfct = 0; // if the above line returns a negative value
> > > 2)
> > > tcp_v4_rcv()
> > >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> > >   -> tcp_v4_send_reset()
> > >
> > > The moment the client receives the RST, it will drop. So the RST
> > > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > > the session when filtering the RST without validating
> > > the sequence because of performance reason). Well, it doesn't
> > > matter. However, we can see many strange RST in flight.
> > >
> > > The key reason why I wrote this patch is that I don't think
> > > the behaviour is expected because the RFC 793 defines this
> > > case:
> > >
> > > "If the connection is in a synchronized state (ESTABLISHED,
> > >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> > >  any unacceptable segment (out of window sequence number or
> > >  unacceptible acknowledgment number) must elicit only an empty
> >
> > Not for those following along, it appears that RFC 793 does misspell
> > unacceptable as above. Perhaps spelling was different in 1981 :)
> 
> Thanks for the check. Yes, it did misspell that word. Should I correct
> that word in my quotation?

No, I think you should keep the quote the same as the original text.

> > >  acknowledgment segment containing the current send-sequence number
> > >  and an acknowledgment..."
> > >
> > > I think, even we have set DNAT policy, it would be better if the
> > > whole process/behaviour adheres to the original TCP behaviour as
> > > default.
> > >
> > > Suggested-by: Florian Westphal <fw@strlen.de>
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> >
> > ...
>
Pablo Neira Ayuso March 21, 2024, 9:06 p.m. UTC | #6
On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Supposing we set DNAT policy converting a_port to b_port on the
> server at the beginning, the socket is set up by using 4-tuple:
> 
> client_ip:client_port <--> server_ip:b_port
> 
> Then, some strange skbs from client or gateway, say, out-of-window
> skbs are eventually sent to the server_ip:a_port (not b_port)
> in TCP layer due to netfilter clearing skb->_nfct value in
> nf_conntrack_in() function. Why? Because the tcp_in_window()
> considers the incoming skb as an invalid skb by returning
> NFCT_TCP_INVALID.
> 
> At last, the TCP layer process the out-of-window
> skb (client_ip,client_port,server_ip,a_port) and try to look up
> such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> because the port is a_port not our expected b_port and then send
> back an RST to the client.
> 
> The detailed call graphs go like this:
> 1)
> nf_conntrack_in()
>   -> nf_conntrack_handle_packet()
>     -> nf_conntrack_tcp_packet()
>       -> tcp_in_window() // tests if the skb is out-of-window
>       -> return -NF_ACCEPT;
>   -> skb->_nfct = 0; // if the above line returns a negative value
> 2)
> tcp_v4_rcv()
>   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
>   -> tcp_v4_send_reset()
> 
> The moment the client receives the RST, it will drop. So the RST
> skb doesn't hurt the client (maybe hurt some gateway which cancels
> the session when filtering the RST without validating
> the sequence because of performance reason). Well, it doesn't
> matter. However, we can see many strange RST in flight.
> 
> The key reason why I wrote this patch is that I don't think
> the behaviour is expected because the RFC 793 defines this
> case:
> 
> "If the connection is in a synchronized state (ESTABLISHED,
>  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
>  any unacceptable segment (out of window sequence number or
>  unacceptible acknowledgment number) must elicit only an empty
>  acknowledgment segment containing the current send-sequence number
>  and an acknowledgment..."
> 
> I think, even we have set DNAT policy, it would be better if the
> whole process/behaviour adheres to the original TCP behaviour as
> default.
> 
> Suggested-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> v2
> Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> 1. add one more test about NAT and then drop the skb (Florian)
> ---
>  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> index ae493599a3ef..19ddac526ea0 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
>  	case NFCT_TCP_IGNORE:
>  		spin_unlock_bh(&ct->lock);
>  		return NF_ACCEPT;
> -	case NFCT_TCP_INVALID:
> +	case NFCT_TCP_INVALID: {
> +		int verdict = -NF_ACCEPT;
> +
> +		if (ct->status & IPS_NAT_MASK)
> +			/* If DNAT is enabled and netfilter receives
> +			 * out-of-window skbs, we should drop it directly,

Yes, if _be_liberal toggle is disabled this can happen.

> +			 * or else skb would miss NAT transformation and
> +			 * trigger corresponding RST sending to the flow
> +			 * in TCP layer, which is not supposed to happen.
> +			 */
> +			verdict = NF_DROP;

One comment for the SNAT case.

nf_conntrack_in() calls this function from the prerouting hook. For
the very first packet, IPS_NAT_MASK might not be yet fully set on
(masquerade/snat happens in postrouting), then still one packet can be
leaked without NAT mangling in the SNAT case.

Rulesets should really need to set default policy to drop in NAT
chains to address this.

And after this update, user has no chance anymore to bump counters at
the end of the policy, to debug issues.

We have relied on the rule that "conntrack should not drop packets"
since the very beginning, instead signal rulesets that something is
invalid, so user decides what to do.

I'm ambivalent about this, Jozsef?

>  		nf_tcp_handle_invalid(ct, dir, index, skb, state);
>  		spin_unlock_bh(&ct->lock);
> -		return -NF_ACCEPT;
> +		return verdict;
> +	}
>  	case NFCT_TCP_ACCEPT:
>  		break;
>  	}
> -- 
> 2.37.3
>
Jason Xing March 22, 2024, 1:06 a.m. UTC | #7
Hello Pablo,

On Fri, Mar 22, 2024 at 5:06 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Supposing we set DNAT policy converting a_port to b_port on the
> > server at the beginning, the socket is set up by using 4-tuple:
> >
> > client_ip:client_port <--> server_ip:b_port
> >
> > Then, some strange skbs from client or gateway, say, out-of-window
> > skbs are eventually sent to the server_ip:a_port (not b_port)
> > in TCP layer due to netfilter clearing skb->_nfct value in
> > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > considers the incoming skb as an invalid skb by returning
> > NFCT_TCP_INVALID.
> >
> > At last, the TCP layer process the out-of-window
> > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > because the port is a_port not our expected b_port and then send
> > back an RST to the client.
> >
> > The detailed call graphs go like this:
> > 1)
> > nf_conntrack_in()
> >   -> nf_conntrack_handle_packet()
> >     -> nf_conntrack_tcp_packet()
> >       -> tcp_in_window() // tests if the skb is out-of-window
> >       -> return -NF_ACCEPT;
> >   -> skb->_nfct = 0; // if the above line returns a negative value
> > 2)
> > tcp_v4_rcv()
> >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> >   -> tcp_v4_send_reset()
> >
> > The moment the client receives the RST, it will drop. So the RST
> > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > the session when filtering the RST without validating
> > the sequence because of performance reason). Well, it doesn't
> > matter. However, we can see many strange RST in flight.
> >
> > The key reason why I wrote this patch is that I don't think
> > the behaviour is expected because the RFC 793 defines this
> > case:
> >
> > "If the connection is in a synchronized state (ESTABLISHED,
> >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> >  any unacceptable segment (out of window sequence number or
> >  unacceptible acknowledgment number) must elicit only an empty
> >  acknowledgment segment containing the current send-sequence number
> >  and an acknowledgment..."
> >
> > I think, even we have set DNAT policy, it would be better if the
> > whole process/behaviour adheres to the original TCP behaviour as
> > default.
> >
> > Suggested-by: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > v2
> > Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> > 1. add one more test about NAT and then drop the skb (Florian)
> > ---
> >  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > index ae493599a3ef..19ddac526ea0 100644
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >       case NFCT_TCP_IGNORE:
> >               spin_unlock_bh(&ct->lock);
> >               return NF_ACCEPT;
> > -     case NFCT_TCP_INVALID:
> > +     case NFCT_TCP_INVALID: {
> > +             int verdict = -NF_ACCEPT;
> > +
> > +             if (ct->status & IPS_NAT_MASK)
> > +                     /* If DNAT is enabled and netfilter receives
> > +                      * out-of-window skbs, we should drop it directly,
>
> Yes, if _be_liberal toggle is disabled this can happen.
>
> > +                      * or else skb would miss NAT transformation and
> > +                      * trigger corresponding RST sending to the flow
> > +                      * in TCP layer, which is not supposed to happen.
> > +                      */
> > +                     verdict = NF_DROP;
>
> One comment for the SNAT case.

Thanks for the comment :)

>
> nf_conntrack_in() calls this function from the prerouting hook. For
> the very first packet, IPS_NAT_MASK might not be yet fully set on
> (masquerade/snat happens in postrouting), then still one packet can be
> leaked without NAT mangling in the SNAT case.

It's possible if the flag is not set and out-of-window skb comes first...

>
> Rulesets should really need to set default policy to drop in NAT
> chains to address this.
>
> And after this update, user has no chance anymore to bump counters at
> the end of the policy, to debug issues.

You mean 'set default policy' is using iptables command to set, right?
If that's the case, I suspect the word "address" because it just hides
the issue and not lets people see it. I think many users don't know
this case. If I tell them about this "just set one more sysctl knob
and you'll be fine", they will definitely question me... Actually I
was questioned many times last week.

We have a _be_liberal sysctl knob to "address" this, yes, but what I'm
thinking is : the less we resort to sysctl knob, the easier life we
have.

It's very normal to drop an out-of-window skb without S/DNAT enabled.
Naturally, we're supposed to drop it finally with S/DNAT enabled. It
can be the default behaviour. Why would we use a knob to do it
instead? :/

>
> We have relied on the rule that "conntrack should not drop packets"
> since the very beginning, instead signal rulesets that something is
> invalid, so user decides what to do.

Yes, I know that rule, but we already have some exceptions for this:
we dropped the unexpected skb in the netfilter unless there are no
other better alternatives.

My logic in the V1 patch is not setting invalid (in order to not clear
skb->_nfct field) and letting it go until it is passed to the TCP
layer which will drop it finally.

>
> I'm ambivalent about this, Jozsef?

Hope to see more comments and suggestions from you two maintainers :)

Thanks,
Jason

>
> >               nf_tcp_handle_invalid(ct, dir, index, skb, state);
> >               spin_unlock_bh(&ct->lock);
> > -             return -NF_ACCEPT;
> > +             return verdict;
> > +     }
> >       case NFCT_TCP_ACCEPT:
> >               break;
> >       }
> > --
> > 2.37.3
> >
Pablo Neira Ayuso March 22, 2024, 10:40 a.m. UTC | #8
On Fri, Mar 22, 2024 at 09:06:41AM +0800, Jason Xing wrote:
> Hello Pablo,
> 
> On Fri, Mar 22, 2024 at 5:06 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Supposing we set DNAT policy converting a_port to b_port on the
> > > server at the beginning, the socket is set up by using 4-tuple:
> > >
> > > client_ip:client_port <--> server_ip:b_port
> > >
> > > Then, some strange skbs from client or gateway, say, out-of-window
> > > skbs are eventually sent to the server_ip:a_port (not b_port)
> > > in TCP layer due to netfilter clearing skb->_nfct value in
> > > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > > considers the incoming skb as an invalid skb by returning
> > > NFCT_TCP_INVALID.
> > >
> > > At last, the TCP layer process the out-of-window
> > > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > > because the port is a_port not our expected b_port and then send
> > > back an RST to the client.
> > >
> > > The detailed call graphs go like this:
> > > 1)
> > > nf_conntrack_in()
> > >   -> nf_conntrack_handle_packet()
> > >     -> nf_conntrack_tcp_packet()
> > >       -> tcp_in_window() // tests if the skb is out-of-window
> > >       -> return -NF_ACCEPT;
> > >   -> skb->_nfct = 0; // if the above line returns a negative value
> > > 2)
> > > tcp_v4_rcv()
> > >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> > >   -> tcp_v4_send_reset()
> > >
> > > The moment the client receives the RST, it will drop. So the RST
> > > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > > the session when filtering the RST without validating
> > > the sequence because of performance reason). Well, it doesn't
> > > matter. However, we can see many strange RST in flight.
> > >
> > > The key reason why I wrote this patch is that I don't think
> > > the behaviour is expected because the RFC 793 defines this
> > > case:
> > >
> > > "If the connection is in a synchronized state (ESTABLISHED,
> > >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> > >  any unacceptable segment (out of window sequence number or
> > >  unacceptible acknowledgment number) must elicit only an empty
> > >  acknowledgment segment containing the current send-sequence number
> > >  and an acknowledgment..."
> > >
> > > I think, even we have set DNAT policy, it would be better if the
> > > whole process/behaviour adheres to the original TCP behaviour as
> > > default.
> > >
> > > Suggested-by: Florian Westphal <fw@strlen.de>
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> > > 1. add one more test about NAT and then drop the skb (Florian)
> > > ---
> > >  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
> > >  1 file changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > index ae493599a3ef..19ddac526ea0 100644
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > >       case NFCT_TCP_IGNORE:
> > >               spin_unlock_bh(&ct->lock);
> > >               return NF_ACCEPT;
> > > -     case NFCT_TCP_INVALID:
> > > +     case NFCT_TCP_INVALID: {
> > > +             int verdict = -NF_ACCEPT;
> > > +
> > > +             if (ct->status & IPS_NAT_MASK)
> > > +                     /* If DNAT is enabled and netfilter receives
> > > +                      * out-of-window skbs, we should drop it directly,
> >
> > Yes, if _be_liberal toggle is disabled this can happen.
> >
> > > +                      * or else skb would miss NAT transformation and
> > > +                      * trigger corresponding RST sending to the flow
> > > +                      * in TCP layer, which is not supposed to happen.
> > > +                      */
> > > +                     verdict = NF_DROP;
> >
> > One comment for the SNAT case.
> 
> Thanks for the comment :)
> 
> >
> > nf_conntrack_in() calls this function from the prerouting hook. For
> > the very first packet, IPS_NAT_MASK might not be yet fully set on
> > (masquerade/snat happens in postrouting), then still one packet can be
> > leaked without NAT mangling in the SNAT case.
> 
> It's possible if the flag is not set and out-of-window skb comes first...

Not only out-of-window packets. Any invalid initial packet being sent
that triggers a transition to invalid state will be dropped, eg.
spoofed TCP traffic, right?

> > Rulesets should really need to set default policy to drop in NAT
> > chains to address this.
> >
> > And after this update, user has no chance anymore to bump counters at
> > the end of the policy, to debug issues.
> 
> You mean 'set default policy' is using iptables command to set, right?
> If that's the case, I suspect the word "address" because it just hides
> the issue and not lets people see it. I think many users don't know
> this case. If I tell them about this "just set one more sysctl knob
> and you'll be fine", they will definitely question me... Actually I
> was questioned many times last week.
>
> We have a _be_liberal sysctl knob to "address" this, yes, but what I'm
> thinking is : the less we resort to sysctl knob, the easier life we
> have.
> 
> It's very normal to drop an out-of-window skb without S/DNAT enabled.
> Naturally, we're supposed to drop it finally with S/DNAT enabled. It
> can be the default behaviour. Why would we use a knob to do it
> instead? :/

My concern is that, if conntrack drops invalid traffic by default,
user gets no reports that this is going on other than enabling
conntrack logging, because the rule that matches and drop INVALID
packets does match anymore.

> > We have relied on the rule that "conntrack should not drop packets"
> > since the very beginning, instead signal rulesets that something is
> > invalid, so user decides what to do.
> 
> Yes, I know that rule, but we already have some exceptions for this:
> we dropped the unexpected skb in the netfilter unless there are no
> other better alternatives.
>
> My logic in the V1 patch is not setting invalid (in order to not clear
> skb->_nfct field) and letting it go until it is passed to the TCP
> layer which will drop it finally.
Jozsef Kadlecsik March 22, 2024, 10:50 a.m. UTC | #9
Hi,

On Thu, 21 Mar 2024, Pablo Neira Ayuso wrote:

> On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> > 
> > Supposing we set DNAT policy converting a_port to b_port on the
> > server at the beginning, the socket is set up by using 4-tuple:
> > 
> > client_ip:client_port <--> server_ip:b_port
> > 
> > Then, some strange skbs from client or gateway, say, out-of-window
> > skbs are eventually sent to the server_ip:a_port (not b_port)
> > in TCP layer due to netfilter clearing skb->_nfct value in
> > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > considers the incoming skb as an invalid skb by returning
> > NFCT_TCP_INVALID.
> > 
> > At last, the TCP layer process the out-of-window
> > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > because the port is a_port not our expected b_port and then send
> > back an RST to the client.
> > 
> > The detailed call graphs go like this:
> > 1)
> > nf_conntrack_in()
> >   -> nf_conntrack_handle_packet()
> >     -> nf_conntrack_tcp_packet()
> >       -> tcp_in_window() // tests if the skb is out-of-window
> >       -> return -NF_ACCEPT;
> >   -> skb->_nfct = 0; // if the above line returns a negative value
> > 2)
> > tcp_v4_rcv()
> >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> >   -> tcp_v4_send_reset()
> > 
> > The moment the client receives the RST, it will drop. So the RST
> > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > the session when filtering the RST without validating
> > the sequence because of performance reason). Well, it doesn't
> > matter. However, we can see many strange RST in flight.
> > 
> > The key reason why I wrote this patch is that I don't think
> > the behaviour is expected because the RFC 793 defines this
> > case:
> > 
> > "If the connection is in a synchronized state (ESTABLISHED,
> >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> >  any unacceptable segment (out of window sequence number or
> >  unacceptible acknowledgment number) must elicit only an empty
> >  acknowledgment segment containing the current send-sequence number
> >  and an acknowledgment..."
> > 
> > I think, even we have set DNAT policy, it would be better if the
> > whole process/behaviour adheres to the original TCP behaviour as
> > default.
> > 
> > Suggested-by: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > v2
> > Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> > 1. add one more test about NAT and then drop the skb (Florian)
> > ---
> >  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > index ae493599a3ef..19ddac526ea0 100644
> > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> >  	case NFCT_TCP_IGNORE:
> >  		spin_unlock_bh(&ct->lock);
> >  		return NF_ACCEPT;
> > -	case NFCT_TCP_INVALID:
> > +	case NFCT_TCP_INVALID: {
> > +		int verdict = -NF_ACCEPT;
> > +
> > +		if (ct->status & IPS_NAT_MASK)
> > +			/* If DNAT is enabled and netfilter receives
> > +			 * out-of-window skbs, we should drop it directly,
> 
> Yes, if _be_liberal toggle is disabled this can happen.
> 
> > +			 * or else skb would miss NAT transformation and
> > +			 * trigger corresponding RST sending to the flow
> > +			 * in TCP layer, which is not supposed to happen.
> > +			 */
> > +			verdict = NF_DROP;
> 
> One comment for the SNAT case.
> 
> nf_conntrack_in() calls this function from the prerouting hook. For
> the very first packet, IPS_NAT_MASK might not be yet fully set on
> (masquerade/snat happens in postrouting), then still one packet can be
> leaked without NAT mangling in the SNAT case.
> 
> Rulesets should really need to set default policy to drop in NAT
> chains to address this.
> 
> And after this update, user has no chance anymore to bump counters at
> the end of the policy, to debug issues.
> 
> We have relied on the rule that "conntrack should not drop packets"
> since the very beginning, instead signal rulesets that something is
> invalid, so user decides what to do.
> 
> I'm ambivalent about this, Jozsef?

[I'm putting on my sysadmin hat.]

My personal opinion is that silently dropping packets does not make 
sysadmin's life easier at all. On the contrary, it makes hunting down 
problems harder and more challenging: you have got no indication 
whatsoever why the given packets were dropped.

The proper solution to the problem is to (log and) drop INVALID packets.
That is neither a knob nor a workaround: conntrack cannot handle the 
packets and should only signal it to the rule stack. 

Actually, the few cases where conntrack itself drops (directly causes it) 
packets should be eliminated and not more added.

Do not blind sysadmins by silently dropping packets. 

Jason, the RST packets which triggered you to write your patch are not 
cause but effect. The cause is the INVALID packets.

Best regards,
Jozsef 

> >  		nf_tcp_handle_invalid(ct, dir, index, skb, state);
> >  		spin_unlock_bh(&ct->lock);
> > -		return -NF_ACCEPT;
> > +		return verdict;
> > +	}
> >  	case NFCT_TCP_ACCEPT:
> >  		break;
> >  	}
> > -- 
> > 2.37.3
> > 
>
Jason Xing March 22, 2024, 11:07 a.m. UTC | #10
On Fri, Mar 22, 2024 at 6:50 PM Jozsef Kadlecsik
<kadlec@blackhole.kfki.hu> wrote:
>
> Hi,
>
> On Thu, 21 Mar 2024, Pablo Neira Ayuso wrote:
>
> > On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Supposing we set DNAT policy converting a_port to b_port on the
> > > server at the beginning, the socket is set up by using 4-tuple:
> > >
> > > client_ip:client_port <--> server_ip:b_port
> > >
> > > Then, some strange skbs from client or gateway, say, out-of-window
> > > skbs are eventually sent to the server_ip:a_port (not b_port)
> > > in TCP layer due to netfilter clearing skb->_nfct value in
> > > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > > considers the incoming skb as an invalid skb by returning
> > > NFCT_TCP_INVALID.
> > >
> > > At last, the TCP layer process the out-of-window
> > > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > > because the port is a_port not our expected b_port and then send
> > > back an RST to the client.
> > >
> > > The detailed call graphs go like this:
> > > 1)
> > > nf_conntrack_in()
> > >   -> nf_conntrack_handle_packet()
> > >     -> nf_conntrack_tcp_packet()
> > >       -> tcp_in_window() // tests if the skb is out-of-window
> > >       -> return -NF_ACCEPT;
> > >   -> skb->_nfct = 0; // if the above line returns a negative value
> > > 2)
> > > tcp_v4_rcv()
> > >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> > >   -> tcp_v4_send_reset()
> > >
> > > The moment the client receives the RST, it will drop. So the RST
> > > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > > the session when filtering the RST without validating
> > > the sequence because of performance reason). Well, it doesn't
> > > matter. However, we can see many strange RST in flight.
> > >
> > > The key reason why I wrote this patch is that I don't think
> > > the behaviour is expected because the RFC 793 defines this
> > > case:
> > >
> > > "If the connection is in a synchronized state (ESTABLISHED,
> > >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> > >  any unacceptable segment (out of window sequence number or
> > >  unacceptible acknowledgment number) must elicit only an empty
> > >  acknowledgment segment containing the current send-sequence number
> > >  and an acknowledgment..."
> > >
> > > I think, even we have set DNAT policy, it would be better if the
> > > whole process/behaviour adheres to the original TCP behaviour as
> > > default.
> > >
> > > Suggested-by: Florian Westphal <fw@strlen.de>
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> > > 1. add one more test about NAT and then drop the skb (Florian)
> > > ---
> > >  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
> > >  1 file changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > index ae493599a3ef..19ddac526ea0 100644
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > >     case NFCT_TCP_IGNORE:
> > >             spin_unlock_bh(&ct->lock);
> > >             return NF_ACCEPT;
> > > -   case NFCT_TCP_INVALID:
> > > +   case NFCT_TCP_INVALID: {
> > > +           int verdict = -NF_ACCEPT;
> > > +
> > > +           if (ct->status & IPS_NAT_MASK)
> > > +                   /* If DNAT is enabled and netfilter receives
> > > +                    * out-of-window skbs, we should drop it directly,
> >
> > Yes, if _be_liberal toggle is disabled this can happen.
> >
> > > +                    * or else skb would miss NAT transformation and
> > > +                    * trigger corresponding RST sending to the flow
> > > +                    * in TCP layer, which is not supposed to happen.
> > > +                    */
> > > +                   verdict = NF_DROP;
> >
> > One comment for the SNAT case.
> >
> > nf_conntrack_in() calls this function from the prerouting hook. For
> > the very first packet, IPS_NAT_MASK might not be yet fully set on
> > (masquerade/snat happens in postrouting), then still one packet can be
> > leaked without NAT mangling in the SNAT case.
> >
> > Rulesets should really need to set default policy to drop in NAT
> > chains to address this.
> >
> > And after this update, user has no chance anymore to bump counters at
> > the end of the policy, to debug issues.
> >
> > We have relied on the rule that "conntrack should not drop packets"
> > since the very beginning, instead signal rulesets that something is
> > invalid, so user decides what to do.
> >
> > I'm ambivalent about this, Jozsef?
>
> [I'm putting on my sysadmin hat.]
>
> My personal opinion is that silently dropping packets does not make
> sysadmin's life easier at all. On the contrary, it makes hunting down
> problems harder and more challenging: you have got no indication
> whatsoever why the given packets were dropped.
>
> The proper solution to the problem is to (log and) drop INVALID packets.
> That is neither a knob nor a workaround: conntrack cannot handle the
> packets and should only signal it to the rule stack.
>
> Actually, the few cases where conntrack itself drops (directly causes it)
> packets should be eliminated and not more added.
>
> Do not blind sysadmins by silently dropping packets.

Thanks for the comment.

Though I'm not totally convinced, I can live with it because it seems
there are no other good ways to solve it perfectly, meanwhile
diminishing my confusion (like resorting to more complex
configurations).

>
> Jason, the RST packets which triggered you to write your patch are not
> cause but effect. The cause is the INVALID packets.

You could say that.

I spent a lot of time tracing down to this area and finally found the
out-of-window causing the problem, so I ignorantly think other admins
also may not know about this :(

Anyway, thanks to both of you for so much patience and help :)

Thanks,
Jason

>
> Best regards,
> Jozsef
>
> > >             nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > >             spin_unlock_bh(&ct->lock);
> > > -           return -NF_ACCEPT;
> > > +           return verdict;
> > > +   }
> > >     case NFCT_TCP_ACCEPT:
> > >             break;
> > >     }
> > > --
> > > 2.37.3
> > >
> >
>
> --
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
Jozsef Kadlecsik March 22, 2024, 8:16 p.m. UTC | #11
On Fri, 22 Mar 2024, Jason Xing wrote:

> On Fri, Mar 22, 2024 at 6:50 PM Jozsef Kadlecsik
> <kadlec@blackhole.kfki.hu> wrote:
> >
> > Hi,
> >
> > On Thu, 21 Mar 2024, Pablo Neira Ayuso wrote:
> >
> > > On Mon, Mar 11, 2024 at 03:05:50PM +0800, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Supposing we set DNAT policy converting a_port to b_port on the
> > > > server at the beginning, the socket is set up by using 4-tuple:
> > > >
> > > > client_ip:client_port <--> server_ip:b_port
> > > >
> > > > Then, some strange skbs from client or gateway, say, out-of-window
> > > > skbs are eventually sent to the server_ip:a_port (not b_port)
> > > > in TCP layer due to netfilter clearing skb->_nfct value in
> > > > nf_conntrack_in() function. Why? Because the tcp_in_window()
> > > > considers the incoming skb as an invalid skb by returning
> > > > NFCT_TCP_INVALID.
> > > >
> > > > At last, the TCP layer process the out-of-window
> > > > skb (client_ip,client_port,server_ip,a_port) and try to look up
> > > > such an socket in tcp_v4_rcv(), as we can see, it will fail for sure
> > > > because the port is a_port not our expected b_port and then send
> > > > back an RST to the client.
> > > >
> > > > The detailed call graphs go like this:
> > > > 1)
> > > > nf_conntrack_in()
> > > >   -> nf_conntrack_handle_packet()
> > > >     -> nf_conntrack_tcp_packet()
> > > >       -> tcp_in_window() // tests if the skb is out-of-window
> > > >       -> return -NF_ACCEPT;
> > > >   -> skb->_nfct = 0; // if the above line returns a negative value
> > > > 2)
> > > > tcp_v4_rcv()
> > > >   -> __inet_lookup_skb() // fails, then jump to no_tcp_socket
> > > >   -> tcp_v4_send_reset()
> > > >
> > > > The moment the client receives the RST, it will drop. So the RST
> > > > skb doesn't hurt the client (maybe hurt some gateway which cancels
> > > > the session when filtering the RST without validating
> > > > the sequence because of performance reason). Well, it doesn't
> > > > matter. However, we can see many strange RST in flight.
> > > >
> > > > The key reason why I wrote this patch is that I don't think
> > > > the behaviour is expected because the RFC 793 defines this
> > > > case:
> > > >
> > > > "If the connection is in a synchronized state (ESTABLISHED,
> > > >  FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
> > > >  any unacceptable segment (out of window sequence number or
> > > >  unacceptible acknowledgment number) must elicit only an empty
> > > >  acknowledgment segment containing the current send-sequence number
> > > >  and an acknowledgment..."
> > > >
> > > > I think, even we have set DNAT policy, it would be better if the
> > > > whole process/behaviour adheres to the original TCP behaviour as
> > > > default.
> > > >
> > > > Suggested-by: Florian Westphal <fw@strlen.de>
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > v2
> > > > Link: https://lore.kernel.org/netdev/20240307090732.56708-1-kerneljasonxing@gmail.com/
> > > > 1. add one more test about NAT and then drop the skb (Florian)
> > > > ---
> > > >  net/netfilter/nf_conntrack_proto_tcp.c | 15 +++++++++++++--
> > > >  1 file changed, 13 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > index ae493599a3ef..19ddac526ea0 100644
> > > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > > @@ -1256,10 +1256,21 @@ int nf_conntrack_tcp_packet(struct nf_conn *ct,
> > > >     case NFCT_TCP_IGNORE:
> > > >             spin_unlock_bh(&ct->lock);
> > > >             return NF_ACCEPT;
> > > > -   case NFCT_TCP_INVALID:
> > > > +   case NFCT_TCP_INVALID: {
> > > > +           int verdict = -NF_ACCEPT;
> > > > +
> > > > +           if (ct->status & IPS_NAT_MASK)
> > > > +                   /* If DNAT is enabled and netfilter receives
> > > > +                    * out-of-window skbs, we should drop it directly,
> > >
> > > Yes, if _be_liberal toggle is disabled this can happen.
> > >
> > > > +                    * or else skb would miss NAT transformation and
> > > > +                    * trigger corresponding RST sending to the flow
> > > > +                    * in TCP layer, which is not supposed to happen.
> > > > +                    */
> > > > +                   verdict = NF_DROP;
> > >
> > > One comment for the SNAT case.
> > >
> > > nf_conntrack_in() calls this function from the prerouting hook. For
> > > the very first packet, IPS_NAT_MASK might not be yet fully set on
> > > (masquerade/snat happens in postrouting), then still one packet can be
> > > leaked without NAT mangling in the SNAT case.
> > >
> > > Rulesets should really need to set default policy to drop in NAT
> > > chains to address this.
> > >
> > > And after this update, user has no chance anymore to bump counters at
> > > the end of the policy, to debug issues.
> > >
> > > We have relied on the rule that "conntrack should not drop packets"
> > > since the very beginning, instead signal rulesets that something is
> > > invalid, so user decides what to do.
> > >
> > > I'm ambivalent about this, Jozsef?
> >
> > [I'm putting on my sysadmin hat.]
> >
> > My personal opinion is that silently dropping packets does not make
> > sysadmin's life easier at all. On the contrary, it makes hunting down
> > problems harder and more challenging: you have got no indication
> > whatsoever why the given packets were dropped.
> >
> > The proper solution to the problem is to (log and) drop INVALID packets.
> > That is neither a knob nor a workaround: conntrack cannot handle the
> > packets and should only signal it to the rule stack.
> >
> > Actually, the few cases where conntrack itself drops (directly causes it)
> > packets should be eliminated and not more added.
> >
> > Do not blind sysadmins by silently dropping packets.
> 
> Thanks for the comment.
> 
> Though I'm not totally convinced, I can live with it because it seems
> there are no other good ways to solve it perfectly, meanwhile
> diminishing my confusion (like resorting to more complex
> configurations).
> 
> >
> > Jason, the RST packets which triggered you to write your patch are not
> > cause but effect. The cause is the INVALID packets.
> 
> You could say that.
> 
> I spent a lot of time tracing down to this area and finally found the
> out-of-window causing the problem, so I ignorantly think other admins
> also may not know about this :(
> 
> Anyway, thanks to both of you for so much patience and help :)

I understand and appreciate your efforts. But please consider the case 
when one have to diagnose a failing connection and conntrack drops 
packets. What should be suspected? Firewall rules? One can enable TRACE 
and check which rules are hit - but because conntrack drops packet, 
nothing is shown there. Enable and check conntrack events? Because the 
packets are INVALID, checking the events does not help either. Only when 
one runs tcpdump and compares it with the TRACE/NFLOG/LOG entries can one 
spot that some packets "disappeared".

Compare the whole thing with the case when packets are not dropped 
silently but can be logged via checking the INVALID flag. One can directly 
tell that conntrack could not handle the packets and can see all packet 
parameters.

Best regards,
Jozsef

> > > >             nf_tcp_handle_invalid(ct, dir, index, skb, state);
> > > >             spin_unlock_bh(&ct->lock);
> > > > -           return -NF_ACCEPT;
> > > > +           return verdict;
> > > > +   }
> > > >     case NFCT_TCP_ACCEPT:
> > > >             break;
> > > >     }
> > > > --
> > > > 2.37.3
> > > >
> > >
> >
> > --
> > E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> > PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> > Address : Wigner Research Centre for Physics
> >           H-1525 Budapest 114, POB. 49, Hungary
>
Jason Xing March 23, 2024, 12:25 a.m. UTC | #12
> I understand and appreciate your efforts. But please consider the case
> when one have to diagnose a failing connection and conntrack drops
> packets. What should be suspected? Firewall rules? One can enable TRACE
> and check which rules are hit - but because conntrack drops packet,
> nothing is shown there. Enable and check conntrack events? Because the
> packets are INVALID, checking the events does not help either. Only when
> one runs tcpdump and compares it with the TRACE/NFLOG/LOG entries can one
> spot that some packets "disappeared".
>
> Compare the whole thing with the case when packets are not dropped
> silently but can be logged via checking the INVALID flag. One can directly
> tell that conntrack could not handle the packets and can see all packet
> parameters.

Thanks for explaining such importance about why not drop silently. Now
I can see :)

In my first version, I didn't drop it directly but let it go without
clearing skb->_nfct fields and then let the TCP layer handle it. As
you said, the out-of-window case is just one of some INVALID cases
which could also cause RST behaviour, so it seems that the first
version doesn't handle it well either. It has to take all INVALID
cases into account...

Is there anything left I can do like particular tracepoints something
like this? No idea. I only hope somebody who encounters such an issue
can notice this behaviour effortlessly :)

Thanks,
Jason

>
> Best regards,
> Jozsef
diff mbox series

Patch

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index ae493599a3ef..19ddac526ea0 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1256,10 +1256,21 @@  int nf_conntrack_tcp_packet(struct nf_conn *ct,
 	case NFCT_TCP_IGNORE:
 		spin_unlock_bh(&ct->lock);
 		return NF_ACCEPT;
-	case NFCT_TCP_INVALID:
+	case NFCT_TCP_INVALID: {
+		int verdict = -NF_ACCEPT;
+
+		if (ct->status & IPS_NAT_MASK)
+			/* If DNAT is enabled and netfilter receives
+			 * out-of-window skbs, we should drop it directly,
+			 * or else skb would miss NAT transformation and
+			 * trigger corresponding RST sending to the flow
+			 * in TCP layer, which is not supposed to happen.
+			 */
+			verdict = NF_DROP;
 		nf_tcp_handle_invalid(ct, dir, index, skb, state);
 		spin_unlock_bh(&ct->lock);
-		return -NF_ACCEPT;
+		return verdict;
+	}
 	case NFCT_TCP_ACCEPT:
 		break;
 	}