Message ID | 5eb8631d430248999116ce8ced13e4b2@AcuMS.aculab.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | udp: rescan hash2 list if chains crossed. | expand |
On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > (the port is an xor) the value isn't saved. > Since the hash function doesn't get simplified when passed zero the hash Are you sure? could you please objdump and compare the binary code generated before and after the patch? In theory all the callers up to __jhash_final() included should be inlined, and the compiler should be able to optimze at least rol32(0, <n>). Cheers, Paolo
From: Paolo Abeni > Sent: 26 July 2023 14:44 > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > (the port is an xor) the value isn't saved. > > Since the hash function doesn't get simplified when passed zero the hash > > Are you sure? could you please objdump and compare the binary code > generated before and after the patch? In theory all the callers up to > __jhash_final() included should be inlined, and the compiler should be > able to optimze at least rol32(0, <n>). I looked the hash is 20+ instructions and pretty much all of them appeared twice. (I'm in the wrong building to have a buildable kernel tree.) It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) could just be net_hash_mix(net) ^ port. (Or maybe you could use a different random value.) I'm not even sure the relatively expensive mixing of 'saddr' is needed - it is one of the local IPv4 addresses. Mixing in the remote address for connected sockets might be useful for any server code that uses a lot of connected udp sockets - but that isn't done. We will have hundreds of udp sockets with different ports that are not connected (I don't know if they get bound to a local address). So a remote address hash wouldn't help. If you look at the generated code for __udp4_lib_lookup() it is actually quite horrid. Too many called functions with too many parameters. Things spill to the stack all the time. The reuse_port code made it a whole lot worse. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Jul 26, 2023 at 4:02 PM David Laight <David.Laight@aculab.com> wrote: > > From: Paolo Abeni > > Sent: 26 July 2023 14:44 > > > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > > (the port is an xor) the value isn't saved. > > > Since the hash function doesn't get simplified when passed zero the hash > > > > Are you sure? could you please objdump and compare the binary code > > generated before and after the patch? In theory all the callers up to > > __jhash_final() included should be inlined, and the compiler should be > > able to optimze at least rol32(0, <n>). > > I looked the hash is 20+ instructions and pretty much all of > them appeared twice. > (I'm in the wrong building to have a buildable kernel tree.) > > It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) > could just be net_hash_mix(net) ^ port. > (Or maybe you could use a different random value.) > > I'm not even sure the relatively expensive mixing of 'saddr' > is needed - it is one of the local IPv4 addresses. > Mixing in the remote address for connected sockets might > be useful for any server code that uses a lot of connected > udp sockets - but that isn't done. > > We will have hundreds of udp sockets with different ports that > are not connected (I don't know if they get bound to a local > address). So a remote address hash wouldn't help. > > If you look at the generated code for __udp4_lib_lookup() > it is actually quite horrid. > Too many called functions with too many parameters. > Things spill to the stack all the time. > > The reuse_port code made it a whole lot worse. > Maybe ... Please show us performance numbers. If less than 1%, I would not bother changing this code, making future backports more risky.
From: Eric Dumazet > Sent: 26 July 2023 15:04 > > On Wed, Jul 26, 2023 at 4:02 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Paolo Abeni > > > Sent: 26 July 2023 14:44 > > > > > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > > > (the port is an xor) the value isn't saved. > > > > Since the hash function doesn't get simplified when passed zero the hash > > > > > > Are you sure? could you please objdump and compare the binary code > > > generated before and after the patch? In theory all the callers up to > > > __jhash_final() included should be inlined, and the compiler should be > > > able to optimze at least rol32(0, <n>). > > > > I looked the hash is 20+ instructions and pretty much all of > > them appeared twice. > > (I'm in the wrong building to have a buildable kernel tree.) > > > > It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) > > could just be net_hash_mix(net) ^ port. > > (Or maybe you could use a different random value.) > > > > I'm not even sure the relatively expensive mixing of 'saddr' > > is needed - it is one of the local IPv4 addresses. > > Mixing in the remote address for connected sockets might > > be useful for any server code that uses a lot of connected > > udp sockets - but that isn't done. > > > > We will have hundreds of udp sockets with different ports that > > are not connected (I don't know if they get bound to a local > > address). So a remote address hash wouldn't help. > > > > If you look at the generated code for __udp4_lib_lookup() > > it is actually quite horrid. > > Too many called functions with too many parameters. > > Things spill to the stack all the time. > > > > The reuse_port code made it a whole lot worse. > > > > Maybe ... Please show us performance numbers. > > If less than 1%, I would not bother changing this code, making future > backports more risky. I didn't :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Paolo Abeni > Sent: 26 July 2023 14:44 > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > (the port is an xor) the value isn't saved. > > Since the hash function doesn't get simplified when passed zero the hash > > Are you sure? could you please objdump and compare the binary code > generated before and after the patch? In theory all the callers up to > __jhash_final() included should be inlined, and the compiler should be > able to optimze at least rol32(0, <n>). From an oldish build I have lurking: 0000000000003960 <__udp4_lib_lookup>: 3960: e8 00 00 00 00 callq 3965 <__udp4_lib_lookup+0x5> 3961: R_X86_64_PC32 __fentry__-0x4 3965: 48 83 ec 48 sub $0x48,%rsp 3969: 66 41 c1 c0 08 rol $0x8,%r8w 396e: 48 89 5c 24 18 mov %rbx,0x18(%rsp) 3973: 48 89 6c 24 20 mov %rbp,0x20(%rsp) 3978: 89 f5 mov %esi,%ebp 397a: 4c 89 64 24 28 mov %r12,0x28(%rsp) 397f: 4c 89 6c 24 30 mov %r13,0x30(%rsp) 3984: 44 0f b7 e2 movzwl %dx,%r12d 3988: 4c 89 74 24 38 mov %r14,0x38(%rsp) 398d: 4c 89 7c 24 40 mov %r15,0x40(%rsp) 3992: 49 89 fe mov %rdi,%r14 3995: 8b bf 40 01 00 00 mov 0x140(%rdi),%edi 399b: 45 0f b7 f8 movzwl %r8w,%r15d 399f: 48 8b 5c 24 58 mov 0x58(%rsp),%rbx 39a4: 48 8b 54 24 60 mov 0x60(%rsp),%rdx 39a9: 45 89 cd mov %r9d,%r13d 39ac: 81 ef 0d 41 52 21 sub $0x2152410d,%edi 39b2: 89 fe mov %edi,%esi 39b4: 8d 04 39 lea (%rcx,%rdi,1),%eax 39b7: 48 89 54 24 10 mov %rdx,0x10(%rsp) 39bc: c1 c6 0e rol $0xe,%esi 39bf: 44 89 e2 mov %r12d,%edx 39c2: f7 de neg %esi 39c4: 41 89 f0 mov %esi,%r8d 39c7: 31 f0 xor %esi,%eax 39c9: 41 c1 c0 0b rol $0xb,%r8d 39cd: 44 29 c0 sub %r8d,%eax 39d0: 41 89 c0 mov %eax,%r8d 39d3: 31 c7 xor %eax,%edi 39d5: 41 c1 c0 19 rol $0x19,%r8d 39d9: 44 29 c7 sub %r8d,%edi 39dc: 41 89 f8 mov %edi,%r8d 39df: 31 fe xor %edi,%esi 39e1: 41 c1 c0 10 rol $0x10,%r8d 39e5: 44 29 c6 sub %r8d,%esi 39e8: 41 89 f0 mov %esi,%r8d 39eb: 31 f0 xor %esi,%eax 39ed: 41 c1 c0 04 rol $0x4,%r8d 39f1: 44 29 c0 sub %r8d,%eax 39f4: 45 89 f8 mov %r15d,%r8d 39f7: 31 c7 xor %eax,%edi 39f9: c1 c0 0e rol $0xe,%eax 39fc: 29 c7 sub %eax,%edi 39fe: 89 f8 mov %edi,%eax 3a00: c1 c7 18 rol $0x18,%edi 3a03: 31 f0 xor %esi,%eax 3a05: 89 ee mov %ebp,%esi 3a07: 29 f8 sub %edi,%eax 3a09: 4c 89 f7 mov %r14,%rdi 3a0c: 44 31 f8 xor %r15d,%eax 3a0f: 23 43 10 and 0x10(%rbx),%eax 3a12: 48 c1 e0 04 shl $0x4,%rax 3a16: 48 03 43 08 add 0x8(%rbx),%rax 3a1a: 48 89 44 24 08 mov %rax,0x8(%rsp) 3a1f: 8b 44 24 50 mov 0x50(%rsp),%eax 3a23: 89 04 24 mov %eax,(%rsp) 3a26: e8 05 f7 ff ff callq 3130 <udp4_lib_lookup2> 3a2b: 48 85 c0 test %rax,%rax 3a2e: 74 32 je 3a62 <__udp4_lib_lookup+0x102> 3a30: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 3a36: ba 00 00 00 00 mov $0x0,%edx 3a3b: 48 8b 5c 24 18 mov 0x18(%rsp),%rbx 3a40: 48 0f 43 c2 cmovae %rdx,%rax 3a44: 48 8b 6c 24 20 mov 0x20(%rsp),%rbp 3a49: 4c 8b 64 24 28 mov 0x28(%rsp),%r12 3a4e: 4c 8b 6c 24 30 mov 0x30(%rsp),%r13 3a53: 4c 8b 74 24 38 mov 0x38(%rsp),%r14 3a58: 4c 8b 7c 24 40 mov 0x40(%rsp),%r15 3a5d: 48 83 c4 48 add $0x48,%rsp 3a61: c3 retq 3a62: 41 8b 96 40 01 00 00 mov 0x140(%r14),%edx 3a69: 45 89 e9 mov %r13d,%r9d 3a6c: 45 89 f8 mov %r15d,%r8d 3a6f: 4c 89 f7 mov %r14,%rdi 3a72: 81 ea 0d 41 52 21 sub $0x2152410d,%edx 3a78: 89 d0 mov %edx,%eax 3a7a: c1 c0 0e rol $0xe,%eax 3a7d: f7 d8 neg %eax 3a7f: 89 c1 mov %eax,%ecx 3a81: 89 c6 mov %eax,%esi 3a83: 31 d1 xor %edx,%ecx 3a85: c1 c6 0b rol $0xb,%esi 3a88: 29 f1 sub %esi,%ecx 3a8a: 89 ce mov %ecx,%esi 3a8c: 31 ca xor %ecx,%edx 3a8e: c1 c6 19 rol $0x19,%esi 3a91: 29 f2 sub %esi,%edx 3a93: 89 d6 mov %edx,%esi 3a95: 31 d0 xor %edx,%eax 3a97: c1 c6 10 rol $0x10,%esi 3a9a: 29 f0 sub %esi,%eax 3a9c: 89 c6 mov %eax,%esi 3a9e: 31 c1 xor %eax,%ecx 3aa0: c1 c6 04 rol $0x4,%esi 3aa3: 29 f1 sub %esi,%ecx 3aa5: 89 ee mov %ebp,%esi 3aa7: 31 ca xor %ecx,%edx 3aa9: c1 c1 0e rol $0xe,%ecx 3aac: 29 ca sub %ecx,%edx 3aae: 48 8b 4c 24 60 mov 0x60(%rsp),%rcx 3ab3: 31 d0 xor %edx,%eax 3ab5: c1 c2 18 rol $0x18,%edx 3ab8: 29 d0 sub %edx,%eax 3aba: 44 89 e2 mov %r12d,%edx 3abd: 48 89 4c 24 10 mov %rcx,0x10(%rsp) 3ac2: 44 31 f8 xor %r15d,%eax 3ac5: 23 43 10 and 0x10(%rbx),%eax 3ac8: 31 c9 xor %ecx,%ecx 3aca: 48 c1 e0 04 shl $0x4,%rax 3ace: 48 03 43 08 add 0x8(%rbx),%rax 3ad2: 48 89 44 24 08 mov %rax,0x8(%rsp) 3ad7: 8b 44 24 50 mov 0x50(%rsp),%eax 3adb: 89 04 24 mov %eax,(%rsp) 3ade: e8 4d f6 ff ff callq 3130 <udp4_lib_lookup2> 3ae3: e9 48 ff ff ff jmpq 3a30 <__udp4_lib_lookup+0xd0> The two copies of the hash are pretty much the same. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 42a96b3547c9..ad64d6c4cd99 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -439,12 +439,18 @@ static struct sock *udp4_lib_lookup2(struct net *net, __be32 saddr, __be16 sport, __be32 daddr, unsigned int hnum, int dif, int sdif, - struct udp_hslot *hslot2, + struct udp_table *udptable, struct sk_buff *skb) { + unsigned int hash2, slot2; + struct udp_hslot *hslot2; struct sock *sk, *result; int score, badness; + hash2 = ipv4_portaddr_hash(net, daddr, hnum); + slot2 = hash2 & udptable->mask; + hslot2 = &udptable->hash2[slot2]; + result = NULL; badness = 0; udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { @@ -495,18 +501,12 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, int sdif, struct udp_table *udptable, struct sk_buff *skb) { unsigned short hnum = ntohs(dport); - unsigned int hash2, slot2; - struct udp_hslot *hslot2; struct sock *result, *sk; - hash2 = ipv4_portaddr_hash(net, daddr, hnum); - slot2 = hash2 & udptable->mask; - hslot2 = &udptable->hash2[slot2]; - /* Lookup connected or non-wildcard socket */ result = udp4_lib_lookup2(net, saddr, sport, daddr, hnum, dif, sdif, - hslot2, skb); + udptable, skb); if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED) goto done; @@ -525,13 +525,9 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, goto done; /* Lookup wildcard sockets */ - hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); - slot2 = hash2 & udptable->mask; - hslot2 = &udptable->hash2[slot2]; - result = udp4_lib_lookup2(net, saddr, sport, htonl(INADDR_ANY), hnum, dif, sdif, - hslot2, skb); + udptable, skb); done: if (IS_ERR(result)) return NULL;
Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net (the port is an xor) the value isn't saved. Since the hash function doesn't get simplified when passed zero the hash might as well be computed inside udp4_lib_lookup2(). This reduces the cache footprint and allows additional checks. Signed-off-by: David Laight <david.laight@aculab.com> --- net/ipv4/udp.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-)