diff mbox series

[2/2] Rescan the hash2 list if the hash chains have got cross-linked.

Message ID c45337a3d46641dc8c4c66bd49fb55b6@AcuMS.aculab.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series udp: rescan hash2 list if chains crossed. | expand

Checks

Context Check Description
netdev/series_format warning Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1350 this patch: 1350
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1365 this patch: 1365
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1373 this patch: 1373
netdev/checkpatch warning CHECK: From:/Signed-off-by: email comments mismatch: 'From: David Laight <David.Laight@ACULAB.COM>' != 'Signed-off-by: David Laight <david.laight@aculab.com>' WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

David Laight July 26, 2023, 12:05 p.m. UTC
udp_lib_rehash() can get called at any time and will move a
socket to a different hash2 chain.
This can cause udp4_lib_lookup2() (processing incoming UDP) to
fail to find a socket and an ICMP port unreachable be sent.

Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked
that the 'end if list' marker was on the correct list.

Rather than re-instate the 'nulls' list just check that the final
socket is on the correct list.

The cross-linking can definitely happen (see earlier issues with
it looping forever because gcc cached the list head).

Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU")
Signed-off-by: David Laight <david.laight@aculab.com>
---
 net/ipv4/udp.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Paolo Abeni July 26, 2023, 1:36 p.m. UTC | #1
Hi,

On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote:
> udp_lib_rehash() can get called at any time and will move a
> socket to a different hash2 chain.
> This can cause udp4_lib_lookup2() (processing incoming UDP) to
> fail to find a socket and an ICMP port unreachable be sent.
> 
> Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked
> that the 'end if list' marker was on the correct list.
> 
> Rather than re-instate the 'nulls' list just check that the final
> socket is on the correct list.
> 
> The cross-linking can definitely happen (see earlier issues with
> it looping forever because gcc cached the list head).
> 
> Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU")
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>  net/ipv4/udp.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ad64d6c4cd99..ed92ba7610b0 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  				     struct sk_buff *skb)
>  {
>  	unsigned int hash2, slot2;
> +	unsigned int hash2_rescan;
>  	struct udp_hslot *hslot2;
>  	struct sock *sk, *result;
>  	int score, badness;
> @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  	slot2 = hash2 & udptable->mask;
>  	hslot2 = &udptable->hash2[slot2];
>  
> +rescan:
> +	hash2_rescan = hash2;
>  	result = NULL;
>  	badness = 0;
>  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> +		hash2_rescan = udp_sk(sk)->udp_portaddr_hash;
>  		score = compute_score(sk, net, saddr, sport,
>  				      daddr, hnum, dif, sdif);
>  		if (score > badness) {
> @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  			badness = score;
>  		}
>  	}
> +
> +	/* udp sockets can get moved to a different hash chain.
> +	 * If the chains have got crossed then rescan.
> +	 */
> +	if ((hash2_rescan & udptable->mask) != slot2) {
> +		/* Ensure hslot2->head is reread */
> +		barrier();

udp_portaddr_for_each_entry_rcu() uses (indirectly)
rcu_dereference_raw() to access the head. That implies a READ_ONCE().
Additional barriers for re-read should not be necessary.

What about IPV6?

Cheers,

Paolo
Eric Dumazet July 26, 2023, 1:37 p.m. UTC | #2
On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote:
>
> udp_lib_rehash() can get called at any time and will move a
> socket to a different hash2 chain.
> This can cause udp4_lib_lookup2() (processing incoming UDP) to
> fail to find a socket and an ICMP port unreachable be sent.
>
> Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked
> that the 'end if list' marker was on the correct list.
>
> Rather than re-instate the 'nulls' list just check that the final
> socket is on the correct list.
>
> The cross-linking can definitely happen (see earlier issues with
> it looping forever because gcc cached the list head).
>
> Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU")
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---

Hi David, thanks a lot for the investigations.

I do not think this is the proper fix.

UDP rehash has always been buggy, because we lack an rcu grace period
between the removal of the socket
from the old hash bucket to the new one.

We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(),
and that might not be easy [1]
and might add unexpected latency to some real time applications.
([1] : Not sure if we are allowed to sleep in udp_lib_rehash())

Also note that adding a synchronize_rcu() would mean the socket would
not be found anyway by some incoming packets.

I think that rehash is tricky to implement if you expect that all
incoming packets must find the socket, wherever it is located.


>  net/ipv4/udp.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ad64d6c4cd99..ed92ba7610b0 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>                                      struct sk_buff *skb)
>  {
>         unsigned int hash2, slot2;
> +       unsigned int hash2_rescan;
>         struct udp_hslot *hslot2;
>         struct sock *sk, *result;
>         int score, badness;
> @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>         slot2 = hash2 & udptable->mask;
>         hslot2 = &udptable->hash2[slot2];
>
> +rescan:
> +       hash2_rescan = hash2;
>         result = NULL;
>         badness = 0;
>         udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> +               hash2_rescan = udp_sk(sk)->udp_portaddr_hash;
>                 score = compute_score(sk, net, saddr, sport,
>                                       daddr, hnum, dif, sdif);
>                 if (score > badness) {
> @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>                         badness = score;
>                 }
>         }
> +
> +       /* udp sockets can get moved to a different hash chain.
> +        * If the chains have got crossed then rescan.
> +        */
> +       if ((hash2_rescan & udptable->mask) != slot2) {

This is only going to catch one of the possible cases.

If we really want to add extra checks in this fast path, we would have
to check all found sockets,
not only the last one.
David Laight July 26, 2023, 2:13 p.m. UTC | #3
From: Eric Dumazet
> Sent: 26 July 2023 14:37
> 
> On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > udp_lib_rehash() can get called at any time and will move a
> > socket to a different hash2 chain.
> > This can cause udp4_lib_lookup2() (processing incoming UDP) to
> > fail to find a socket and an ICMP port unreachable be sent.
> >
> > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked
> > that the 'end if list' marker was on the correct list.
> >
> > Rather than re-instate the 'nulls' list just check that the final
> > socket is on the correct list.
> >
> > The cross-linking can definitely happen (see earlier issues with
> > it looping forever because gcc cached the list head).
> >
> > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> 
> Hi David, thanks a lot for the investigations.
> 
> I do not think this is the proper fix.
> 
> UDP rehash has always been buggy, because we lack an rcu grace period
> between the removal of the socket
> from the old hash bucket to the new one.
> 
> We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(),
> and that might not be easy [1]
> and might add unexpected latency to some real time applications.
> ([1] : Not sure if we are allowed to sleep in udp_lib_rehash())

I'm also not sure that the callers would always like the potentially
long rcu sleep.

> Also note that adding a synchronize_rcu() would mean the socket would
> not be found anyway by some incoming packets.
> 
> I think that rehash is tricky to implement if you expect that all
> incoming packets must find the socket, wherever it is located.

I thought about something like the checks done for reading
multi-word counters.
Something like requiring the updater to increment a count on
entry and exit and have the reader rescan with the lock held
if the count is odd or changes.

The problem is that a single 'port unreachable' can be treated
as a fatal error by the receiving application.
So you really don't want to be sending them.

> 
> 
> >  net/ipv4/udp.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index ad64d6c4cd99..ed92ba7610b0 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >                                      struct sk_buff *skb)
> >  {
> >         unsigned int hash2, slot2;
> > +       unsigned int hash2_rescan;
> >         struct udp_hslot *hslot2;
> >         struct sock *sk, *result;
> >         int score, badness;
> > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >         slot2 = hash2 & udptable->mask;
> >         hslot2 = &udptable->hash2[slot2];
> >
> > +rescan:
> > +       hash2_rescan = hash2;
> >         result = NULL;
> >         badness = 0;
> >         udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > +               hash2_rescan = udp_sk(sk)->udp_portaddr_hash;
> >                 score = compute_score(sk, net, saddr, sport,
> >                                       daddr, hnum, dif, sdif);
> >                 if (score > badness) {
> > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> >                         badness = score;
> >                 }
> >         }
> > +
> > +       /* udp sockets can get moved to a different hash chain.
> > +        * If the chains have got crossed then rescan.
> > +        */
> > +       if ((hash2_rescan & udptable->mask) != slot2) {
> 
> This is only going to catch one of the possible cases.
> 
> If we really want to add extra checks in this fast path, we would have
> to check all found sockets,
> not only the last one.

I did think about that.
Being hit by a single rehash is very unlikely.
Being hit by two that also put you back onto the original
chain really isn't going to happen.

Putting the check inside the loop would save the test when the
hash list is empty - probably common for the first lookup.

The code in compute_score() is pretty horrid so maybe it
wouldn't really be noticeable.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Dumazet July 26, 2023, 2:21 p.m. UTC | #4
On Wed, Jul 26, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 26 July 2023 14:37
> >
> > On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > udp_lib_rehash() can get called at any time and will move a
> > > socket to a different hash2 chain.
> > > This can cause udp4_lib_lookup2() (processing incoming UDP) to
> > > fail to find a socket and an ICMP port unreachable be sent.
> > >
> > > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked
> > > that the 'end if list' marker was on the correct list.
> > >
> > > Rather than re-instate the 'nulls' list just check that the final
> > > socket is on the correct list.
> > >
> > > The cross-linking can definitely happen (see earlier issues with
> > > it looping forever because gcc cached the list head).
> > >
> > > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > ---
> >
> > Hi David, thanks a lot for the investigations.
> >
> > I do not think this is the proper fix.
> >
> > UDP rehash has always been buggy, because we lack an rcu grace period
> > between the removal of the socket
> > from the old hash bucket to the new one.
> >
> > We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(),
> > and that might not be easy [1]
> > and might add unexpected latency to some real time applications.
> > ([1] : Not sure if we are allowed to sleep in udp_lib_rehash())
>
> I'm also not sure that the callers would always like the potentially
> long rcu sleep.
>
> > Also note that adding a synchronize_rcu() would mean the socket would
> > not be found anyway by some incoming packets.
> >
> > I think that rehash is tricky to implement if you expect that all
> > incoming packets must find the socket, wherever it is located.
>
> I thought about something like the checks done for reading
> multi-word counters.
> Something like requiring the updater to increment a count on
> entry and exit and have the reader rescan with the lock held
> if the count is odd or changes.
>

Can you describe what user space operation is done by your precious application,
triggering a rehash in the first place ?

Maybe we can think of something less disruptive in the kernel.
(For instance, you could have a second socket, insert it in the new bucket,
then remove the old socket)

> The problem is that a single 'port unreachable' can be treated
> as a fatal error by the receiving application.
> So you really don't want to be sending them.

Well, if your application needs to run with old kernels, and or
transient netfilter changes (some firewall setups do not use
iptables-restore)
better be more resilient to transient ICMP messages anyway.


>
> >
> >
> > >  net/ipv4/udp.c | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > >
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index ad64d6c4cd99..ed92ba7610b0 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > >                                      struct sk_buff *skb)
> > >  {
> > >         unsigned int hash2, slot2;
> > > +       unsigned int hash2_rescan;
> > >         struct udp_hslot *hslot2;
> > >         struct sock *sk, *result;
> > >         int score, badness;
> > > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > >         slot2 = hash2 & udptable->mask;
> > >         hslot2 = &udptable->hash2[slot2];
> > >
> > > +rescan:
> > > +       hash2_rescan = hash2;
> > >         result = NULL;
> > >         badness = 0;
> > >         udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > > +               hash2_rescan = udp_sk(sk)->udp_portaddr_hash;
> > >                 score = compute_score(sk, net, saddr, sport,
> > >                                       daddr, hnum, dif, sdif);
> > >                 if (score > badness) {
> > > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > >                         badness = score;
> > >                 }
> > >         }
> > > +
> > > +       /* udp sockets can get moved to a different hash chain.
> > > +        * If the chains have got crossed then rescan.
> > > +        */
> > > +       if ((hash2_rescan & udptable->mask) != slot2) {
> >
> > This is only going to catch one of the possible cases.
> >
> > If we really want to add extra checks in this fast path, we would have
> > to check all found sockets,
> > not only the last one.
>
> I did think about that.
> Being hit by a single rehash is very unlikely.
> Being hit by two that also put you back onto the original
> chain really isn't going to happen.
>
> Putting the check inside the loop would save the test when the
> hash list is empty - probably common for the first lookup.
>
> The code in compute_score() is pretty horrid so maybe it
> wouldn't really be noticeable.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
David Laight July 26, 2023, 2:39 p.m. UTC | #5
From: Eric Dumazet
> Sent: 26 July 2023 15:22
...
> Can you describe what user space operation is done by your precious application,
> triggering a rehash in the first place ?

We've no idea what is causing the rehash.
There are a lot of sockets that are receiving RTP audio.
But they are only created, bound and then deleted.

The 'best guess' is something to do with ipsec tunnels
being created, deleted or rehashed.

> 
> Maybe we can think of something less disruptive in the kernel.
> (For instance, you could have a second socket, insert it in the new bucket,
> then remove the old socket)
> 
> > The problem is that a single 'port unreachable' can be treated
> > as a fatal error by the receiving application.
> > So you really don't want to be sending them.
> 
> Well, if your application needs to run with old kernels, and or
> transient netfilter changes (some firewall setups do not use
> iptables-restore)
> better be more resilient to transient ICMP messages anyway.

This is being done for the specific pair of sockets that caused grief.
For this setup they were on 127.0.0.1 but that isn't always true.
But they would be expected to be on a local network.

Reading between the lines of the comment in ipv4/icmp.c
it is reasonable to assume that ICMP_PORT_UNREACH be treated
as a fatal error (ie not a transient one).
So really the Linux kernel ought to try quite hard to not
generate them when the port exists.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Dumazet July 26, 2023, 2:48 p.m. UTC | #6
On Wed, Jul 26, 2023 at 4:39 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 26 July 2023 15:22
> ...
> > Can you describe what user space operation is done by your precious application,
> > triggering a rehash in the first place ?
>
> We've no idea what is causing the rehash.
> There are a lot of sockets that are receiving RTP audio.
> But they are only created, bound and then deleted.
>
> The 'best guess' is something to do with ipsec tunnels
> being created, deleted or rehashed.
>
> >
> > Maybe we can think of something less disruptive in the kernel.
> > (For instance, you could have a second socket, insert it in the new bucket,
> > then remove the old socket)
> >
> > > The problem is that a single 'port unreachable' can be treated
> > > as a fatal error by the receiving application.
> > > So you really don't want to be sending them.
> >
> > Well, if your application needs to run with old kernels, and or
> > transient netfilter changes (some firewall setups do not use
> > iptables-restore)
> > better be more resilient to transient ICMP messages anyway.
>
> This is being done for the specific pair of sockets that caused grief.
> For this setup they were on 127.0.0.1 but that isn't always true.
> But they would be expected to be on a local network.
>
> Reading between the lines of the comment in ipv4/icmp.c
> it is reasonable to assume that ICMP_PORT_UNREACH be treated
> as a fatal error (ie not a transient one).
> So really the Linux kernel ought to try quite hard to not
> generate them when the port exists.

Sure, then please add the synchronize_rcu() call, because it won't affect you.

You could add a probe to try to identify what is causing a rehash.

perf probe -a udp_lib_rehash
perf record -a -g -e probe:udp_lib_rehash sleep 60
...
perf script
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ad64d6c4cd99..ed92ba7610b0 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -443,6 +443,7 @@  static struct sock *udp4_lib_lookup2(struct net *net,
 				     struct sk_buff *skb)
 {
 	unsigned int hash2, slot2;
+	unsigned int hash2_rescan;
 	struct udp_hslot *hslot2;
 	struct sock *sk, *result;
 	int score, badness;
@@ -451,9 +452,12 @@  static struct sock *udp4_lib_lookup2(struct net *net,
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+rescan:
+	hash2_rescan = hash2;
 	result = NULL;
 	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
+		hash2_rescan = udp_sk(sk)->udp_portaddr_hash;
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
@@ -467,6 +471,16 @@  static struct sock *udp4_lib_lookup2(struct net *net,
 			badness = score;
 		}
 	}
+
+	/* udp sockets can get moved to a different hash chain.
+	 * If the chains have got crossed then rescan.
+	 */
+	if ((hash2_rescan & udptable->mask) != slot2) {
+		/* Ensure hslot2->head is reread */
+		barrier();
+		goto rescan;
+	}
+
 	return result;
 }