Message ID | 20210707154630.583448-1-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] ipv6: tcp: drop silly ICMPv6 packet too big messages | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 2 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 2 this patch: 2 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 41 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 2 this patch: 2 |
netdev/header_inline | success | Link |
Hi Eric, On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > While TCP stack scales reasonably well, there is still one part that > can be used to DDOS it. > > IPv6 Packet too big messages have to lookup/insert a new route, > and if abused by attackers, can easily put hosts under high stress, > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() Just thinking loud, wouldn't it make sense to support randomly dropping such packets on input (or even better rate-limit them) ? After all, if a host on the net feels like it will need to send one, it will surely need to send a few more until one is taken into account so it's not dramatic. And this could help significantly reduce their processing cost. Willy
On Wed, Jul 7, 2021 at 5:59 PM Willy Tarreau <w@1wt.eu> wrote: > > Hi Eric, > > On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > While TCP stack scales reasonably well, there is still one part that > > can be used to DDOS it. > > > > IPv6 Packet too big messages have to lookup/insert a new route, > > and if abused by attackers, can easily put hosts under high stress, > > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() > > Just thinking loud, wouldn't it make sense to support randomly dropping > such packets on input (or even better rate-limit them) ? After all, if > a host on the net feels like it will need to send one, it will surely > need to send a few more until one is taken into account so it's not > dramatic. And this could help significantly reduce their processing cost. Not sure what you mean by random. We probably want to process valid packets, if they ever reach us. In our case, we could simply drop all ICMPv6 " packet too big" messages, since we clamp TCP/IPv6 MSS to the bare minimum anyway. Adding a generic check in TCP/ipv6 stack is cheaper than an iptables rule (especially if this is the only rule that must be used)
On Wed, Jul 07, 2021 at 06:06:21PM +0200, Eric Dumazet wrote: > On Wed, Jul 7, 2021 at 5:59 PM Willy Tarreau <w@1wt.eu> wrote: > > > > Hi Eric, > > > > On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > > > From: Eric Dumazet <edumazet@google.com> > > > > > > While TCP stack scales reasonably well, there is still one part that > > > can be used to DDOS it. > > > > > > IPv6 Packet too big messages have to lookup/insert a new route, > > > and if abused by attackers, can easily put hosts under high stress, > > > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() > > > > Just thinking loud, wouldn't it make sense to support randomly dropping > > such packets on input (or even better rate-limit them) ? After all, if > > a host on the net feels like it will need to send one, it will surely > > need to send a few more until one is taken into account so it's not > > dramatic. And this could help significantly reduce their processing cost. > > Not sure what you mean by random. I just meant statistical randomness. E.g. drop 9/10 when under stress for example. > We probably want to process valid packets, if they ever reach us. That's indeed the other side of my question. I.e. if a server gets hit by such a flood, do we consider more important to spend the CPU cycles processing all received packets or can we afford dropping a lot of them. > In our case, we could simply drop all ICMPv6 " packet too big" > messages, since we clamp TCP/IPv6 MSS to the bare minimum anyway. > > Adding a generic check in TCP/ipv6 stack is cheaper than an iptables > rule (especially if this is the only rule that must be used) Sure, I was not thinking about iptables here, rather a hard-coded prandom_u32() call or a percpu cycling counter. Willy
On Wed, Jul 7, 2021 at 6:15 PM Willy Tarreau <w@1wt.eu> wrote: > > On Wed, Jul 07, 2021 at 06:06:21PM +0200, Eric Dumazet wrote: > > On Wed, Jul 7, 2021 at 5:59 PM Willy Tarreau <w@1wt.eu> wrote: > > > > > > Hi Eric, > > > > > > On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > > > > > > > While TCP stack scales reasonably well, there is still one part that > > > > can be used to DDOS it. > > > > > > > > IPv6 Packet too big messages have to lookup/insert a new route, > > > > and if abused by attackers, can easily put hosts under high stress, > > > > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() > > > > > > Just thinking loud, wouldn't it make sense to support randomly dropping > > > such packets on input (or even better rate-limit them) ? After all, if > > > a host on the net feels like it will need to send one, it will surely > > > need to send a few more until one is taken into account so it's not > > > dramatic. And this could help significantly reduce their processing cost. > > > > Not sure what you mean by random. > > I just meant statistical randomness. E.g. drop 9/10 when under stress for > example. It is hard to define ' stress'. In our case we were maybe receiving 10 ICMPv6 messages per second " only " I would rather define the issue as a deficiency in current IPv6 stack vs routes. One can hope that one day the issue will disappear. > > > We probably want to process valid packets, if they ever reach us. > > That's indeed the other side of my question. I.e. if a server gets hit > by such a flood, do we consider more important to spend the CPU cycles > processing all received packets or can we afford dropping a lot of them. > > > In our case, we could simply drop all ICMPv6 " packet too big" > > messages, since we clamp TCP/IPv6 MSS to the bare minimum anyway. > > > > Adding a generic check in TCP/ipv6 stack is cheaper than an iptables > > rule (especially if this is the only rule that must be used) > > Sure, I was not thinking about iptables here, rather a hard-coded > prandom_u32() call or a percpu cycling counter. > > Willy
On Wed, Jul 07, 2021 at 06:25:10PM +0200, Eric Dumazet wrote: > On Wed, Jul 7, 2021 at 6:15 PM Willy Tarreau <w@1wt.eu> wrote: > > > > On Wed, Jul 07, 2021 at 06:06:21PM +0200, Eric Dumazet wrote: > > > On Wed, Jul 7, 2021 at 5:59 PM Willy Tarreau <w@1wt.eu> wrote: > > > > > > > > Hi Eric, > > > > > > > > On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > > > > > > While TCP stack scales reasonably well, there is still one part that > > > > > can be used to DDOS it. > > > > > > > > > > IPv6 Packet too big messages have to lookup/insert a new route, > > > > > and if abused by attackers, can easily put hosts under high stress, > > > > > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() > > > > > > > > Just thinking loud, wouldn't it make sense to support randomly dropping > > > > such packets on input (or even better rate-limit them) ? After all, if > > > > a host on the net feels like it will need to send one, it will surely > > > > need to send a few more until one is taken into account so it's not > > > > dramatic. And this could help significantly reduce their processing cost. > > > > > > Not sure what you mean by random. > > > > I just meant statistical randomness. E.g. drop 9/10 when under stress for > > example. > > It is hard to define ' stress'. In our case we were maybe receiving 10 > ICMPv6 messages per second " only " > > I would rather define the issue as a deficiency in current IPv6 stack vs routes. > > One can hope that one day the issue will disappear. Aie. I thought you were speaking about million PPS. For sure if 10 pkt/s already cause harm, there's little to nothing that can be achieved with sampling! Thanks for the clarifications! Willy
On Wed, Jul 7, 2021 at 8:46 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > While TCP stack scales reasonably well, there is still one part that > can be used to DDOS it. > > IPv6 Packet too big messages have to lookup/insert a new route, > and if abused by attackers, can easily put hosts under high stress, > with many cpus contending on a spinlock while one is stuck in fib6_run_gc() > > ip6_protocol_deliver_rcu() > icmpv6_rcv() > icmpv6_notify() > tcp_v6_err() > tcp_v6_mtu_reduced() > inet6_csk_update_pmtu() > ip6_rt_update_pmtu() > __ip6_rt_update_pmtu() > ip6_rt_cache_alloc() > ip6_dst_alloc() > dst_alloc() > ip6_dst_gc() > fib6_run_gc() > spin_lock_bh() ... > > Some of our servers have been hit by malicious ICMPv6 packets > trying to _increase_ the MTU/MSS of TCP flows. > > We believe these ICMPv6 packets are a result of a bug in one ISP stack, > since they were blindly sent back for _every_ (small) packet sent to them. > > These packets are for one TCP flow: > 09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 > 09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 > 09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 > 09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 > 09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 > > TCP stack can filter some silly requests : > > 1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err() > 2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS. > > This tests happen before the IPv6 routing stack is entered, thus > removing the potential contention and route exhaustion. > > Note that IPv6 stack was performing these checks, but too late > (ie : after the route has been added, and after the potential > garbage collect war) > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Maciej Żenczykowski <maze@google.com> > --- > net/ipv6/tcp_ipv6.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 593c32fe57ed13a218492fd6056f2593e601ec79..bc334a6f24992c7b5b2c415eab4b6cf51bf36cb4 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -348,11 +348,20 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, > static void tcp_v6_mtu_reduced(struct sock *sk) > { > struct dst_entry *dst; > + u32 mtu; > > if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) > return; > > - dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); > + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); > + > + /* Drop requests trying to increase our current mss. > + * Check done in __ip6_rt_update_pmtu() is too late. > + */ > + if (tcp_mss_to_mtu(sk, mtu) >= tcp_sk(sk)->mss_cache) > + return; > + > + dst = inet6_csk_update_pmtu(sk, mtu); > if (!dst) > return; > > @@ -433,6 +442,8 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, > } > > if (type == ICMPV6_PKT_TOOBIG) { > + u32 mtu = ntohl(info); > + > /* We are not interested in TCP_LISTEN and open_requests > * (SYN-ACKs send out by Linux are always <576bytes so > * they should go through unfragmented). > @@ -443,7 +454,11 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, > if (!ip6_sk_accept_pmtu(sk)) > goto out; > > - WRITE_ONCE(tp->mtu_info, ntohl(info)); > + if (mtu < IPV6_MIN_MTU) > + goto out; > + > + WRITE_ONCE(tp->mtu_info, mtu); > + > if (!sock_owned_by_user(sk)) > tcp_v6_mtu_reduced(sk); > else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, > -- > 2.32.0.93.g670b81a890-goog Reviewed-by: Maciej Żenczykowski <maze@google.com>
On Wed, Jul 7, 2021 at 6:27 PM Willy Tarreau <w@1wt.eu> wrote: > > On Wed, Jul 07, 2021 at 06:25:10PM +0200, Eric Dumazet wrote: > > On Wed, Jul 7, 2021 at 6:15 PM Willy Tarreau <w@1wt.eu> wrote: > >> One can hope that one day the issue will disappear. > > Aie. I thought you were speaking about million PPS. For sure if 10 pkt/s > already cause harm, there's little to nothing that can be achieved with > sampling! Yes, these routes expire after 600 seconds, and the ipv6/max_size default to 4096 After a few minutes, the route cache is full and bad things happen.
On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: [ ... ] > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 593c32fe57ed13a218492fd6056f2593e601ec79..bc334a6f24992c7b5b2c415eab4b6cf51bf36cb4 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -348,11 +348,20 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, > static void tcp_v6_mtu_reduced(struct sock *sk) > { > struct dst_entry *dst; > + u32 mtu; > > if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) > return; > > - dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); > + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); > + > + /* Drop requests trying to increase our current mss. > + * Check done in __ip6_rt_update_pmtu() is too late. > + */ > + if (tcp_mss_to_mtu(sk, mtu) >= tcp_sk(sk)->mss_cache) Instead of tcp_mss_to_mtu, should this be calling tcp_mtu_to_mss to compare with tcp_sk(sk)->mss_cache?
On Thu, Jul 8, 2021 at 12:26 AM Martin KaFai Lau <kafai@fb.com> wrote: > > On Wed, Jul 07, 2021 at 08:46:30AM -0700, Eric Dumazet wrote: > [ ... ] > > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > > index 593c32fe57ed13a218492fd6056f2593e601ec79..bc334a6f24992c7b5b2c415eab4b6cf51bf36cb4 100644 > > --- a/net/ipv6/tcp_ipv6.c > > +++ b/net/ipv6/tcp_ipv6.c > > @@ -348,11 +348,20 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, > > static void tcp_v6_mtu_reduced(struct sock *sk) > > { > > struct dst_entry *dst; > > + u32 mtu; > > > > if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) > > return; > > > > - dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); > > + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); > > + > > + /* Drop requests trying to increase our current mss. > > + * Check done in __ip6_rt_update_pmtu() is too late. > > + */ > > + if (tcp_mss_to_mtu(sk, mtu) >= tcp_sk(sk)->mss_cache) > Instead of tcp_mss_to_mtu, should this be calling tcp_mtu_to_mss to > compare with tcp_sk(sk)->mss_cache? Good catch Martin, let me add and run a few more tests before sending a v2. Thanks !
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 593c32fe57ed13a218492fd6056f2593e601ec79..bc334a6f24992c7b5b2c415eab4b6cf51bf36cb4 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -348,11 +348,20 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, static void tcp_v6_mtu_reduced(struct sock *sk) { struct dst_entry *dst; + u32 mtu; if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) return; - dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); + + /* Drop requests trying to increase our current mss. + * Check done in __ip6_rt_update_pmtu() is too late. + */ + if (tcp_mss_to_mtu(sk, mtu) >= tcp_sk(sk)->mss_cache) + return; + + dst = inet6_csk_update_pmtu(sk, mtu); if (!dst) return; @@ -433,6 +442,8 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, } if (type == ICMPV6_PKT_TOOBIG) { + u32 mtu = ntohl(info); + /* We are not interested in TCP_LISTEN and open_requests * (SYN-ACKs send out by Linux are always <576bytes so * they should go through unfragmented). @@ -443,7 +454,11 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (!ip6_sk_accept_pmtu(sk)) goto out; - WRITE_ONCE(tp->mtu_info, ntohl(info)); + if (mtu < IPV6_MIN_MTU) + goto out; + + WRITE_ONCE(tp->mtu_info, mtu); + if (!sock_owned_by_user(sk)) tcp_v6_mtu_reduced(sk); else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,