diff mbox series

[RFC,net-next] net/udp: Add 4-tuple hash for connected socket

Message ID 20240913100941.8565-1-lulie@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] net/udp: Add 4-tuple hash for connected socket | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 26 this patch: 26
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 41 this patch: 41
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1967 this patch: 1967
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 4 this patch: 4
netdev/source_inline success Was 0 now: 0

Commit Message

Philo Lu Sept. 13, 2024, 10:09 a.m. UTC
This RFC patch introduces 4-tuple hash for connected udp sockets, to
make udp lookup faster. It is a tentative proposal and any comment is
welcome.

Currently, the udp_table has two hash table, the port hash and portaddr
hash. But for UDP server, all sockets have the same local port and addr,
so they are all on the same hash slot within a reuseport group. And the
target sock is selected by scoring.

In some applications, the UDP server uses connect() for each incoming
client, and then the socket (fd) is used exclusively by the client. In
such scenarios, current scoring method can be ineffcient with a large
number of connections, resulting in high softirq overhead.

To solve the problem, a 4-tuple hash list is added to udp_table, and is
updated when calling connect(). Then __udp4_lib_lookup() firstly
searches the 4-tuple hash list, and return directly if success. A new
sockopt UDP_HASH4 is added to enable it. So the usage is:
1. socket()
2. bind()
3. setsockopt(UDP_HASH4)
4. connect()

AFAICT the patch (if useful) can be further improved by:
(a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
once turned on until the socket closed.
(b) Better interact with hash2/reuseport. Now hash4 hardly affects other
mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
unnecessary.
(c) Support early demux and ipv6.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
 include/linux/udp.h      |   8 +++
 include/net/udp.h        |  17 +++++-
 include/uapi/linux/udp.h |   1 +
 net/ipv4/udp.c           | 127 +++++++++++++++++++++++++++++++++++++--
 net/ipv6/udp.c           |   2 +-
 5 files changed, 147 insertions(+), 8 deletions(-)

Comments

Eric Dumazet Sept. 13, 2024, 11:49 a.m. UTC | #1
On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> This RFC patch introduces 4-tuple hash for connected udp sockets, to
> make udp lookup faster. It is a tentative proposal and any comment is
> welcome.
>
> Currently, the udp_table has two hash table, the port hash and portaddr
> hash. But for UDP server, all sockets have the same local port and addr,
> so they are all on the same hash slot within a reuseport group. And the
> target sock is selected by scoring.
>
> In some applications, the UDP server uses connect() for each incoming
> client, and then the socket (fd) is used exclusively by the client. In
> such scenarios, current scoring method can be ineffcient with a large
> number of connections, resulting in high softirq overhead.
>
> To solve the problem, a 4-tuple hash list is added to udp_table, and is
> updated when calling connect(). Then __udp4_lib_lookup() firstly
> searches the 4-tuple hash list, and return directly if success. A new
> sockopt UDP_HASH4 is added to enable it. So the usage is:
> 1. socket()
> 2. bind()
> 3. setsockopt(UDP_HASH4)
> 4. connect()
>
> AFAICT the patch (if useful) can be further improved by:
> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
> once turned on until the socket closed.
> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
> unnecessary.
> (c) Support early demux and ipv6.
>
> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>

Adding a 4-tuple hash for UDP has been discussed in the past.

Main issue is that this is adding one cache line miss per incoming packet.

Most heavy duty UDP servers (DNS, QUIC), use non connected sockets,
because having one million UDP sockets has huge kernel memory cost,
not counting poor cache locality.
Dust Li Sept. 13, 2024, 2:21 p.m. UTC | #2
On 2024-09-13 13:49:03, Eric Dumazet wrote:
>On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>>
>> This RFC patch introduces 4-tuple hash for connected udp sockets, to
>> make udp lookup faster. It is a tentative proposal and any comment is
>> welcome.
>>
>> Currently, the udp_table has two hash table, the port hash and portaddr
>> hash. But for UDP server, all sockets have the same local port and addr,
>> so they are all on the same hash slot within a reuseport group. And the
>> target sock is selected by scoring.
>>
>> In some applications, the UDP server uses connect() for each incoming
>> client, and then the socket (fd) is used exclusively by the client. In
>> such scenarios, current scoring method can be ineffcient with a large
>> number of connections, resulting in high softirq overhead.
>>
>> To solve the problem, a 4-tuple hash list is added to udp_table, and is
>> updated when calling connect(). Then __udp4_lib_lookup() firstly
>> searches the 4-tuple hash list, and return directly if success. A new
>> sockopt UDP_HASH4 is added to enable it. So the usage is:
>> 1. socket()
>> 2. bind()
>> 3. setsockopt(UDP_HASH4)
>> 4. connect()
>>
>> AFAICT the patch (if useful) can be further improved by:
>> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
>> once turned on until the socket closed.
>> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
>> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
>> unnecessary.
>> (c) Support early demux and ipv6.
>>
>> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
>
>Adding a 4-tuple hash for UDP has been discussed in the past.

Thanks for the information! we don't know the history.

>
>Main issue is that this is adding one cache line miss per incoming packet.

What about adding something like refcnt in 'struct udp_hslot' ?
if someone enabled uhash4 on the port, we increase the refcnt.
Then we can check if that port have uhash4 enabled. If it's zero,
we can just bypass the uhash4 lookup process and goto the current
udp4_lib_lookup2().

That should avoid the cache line miss per incoming packet ?

Best regards,
Dust
Eric Dumazet Sept. 13, 2024, 2:39 p.m. UTC | #3
On Fri, Sep 13, 2024 at 4:22 PM Dust Li <dust.li@linux.alibaba.com> wrote:
>
> On 2024-09-13 13:49:03, Eric Dumazet wrote:
> >On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >>
> >> This RFC patch introduces 4-tuple hash for connected udp sockets, to
> >> make udp lookup faster. It is a tentative proposal and any comment is
> >> welcome.
> >>
> >> Currently, the udp_table has two hash table, the port hash and portaddr
> >> hash. But for UDP server, all sockets have the same local port and addr,
> >> so they are all on the same hash slot within a reuseport group. And the
> >> target sock is selected by scoring.
> >>
> >> In some applications, the UDP server uses connect() for each incoming
> >> client, and then the socket (fd) is used exclusively by the client. In
> >> such scenarios, current scoring method can be ineffcient with a large
> >> number of connections, resulting in high softirq overhead.
> >>
> >> To solve the problem, a 4-tuple hash list is added to udp_table, and is
> >> updated when calling connect(). Then __udp4_lib_lookup() firstly
> >> searches the 4-tuple hash list, and return directly if success. A new
> >> sockopt UDP_HASH4 is added to enable it. So the usage is:
> >> 1. socket()
> >> 2. bind()
> >> 3. setsockopt(UDP_HASH4)
> >> 4. connect()
> >>
> >> AFAICT the patch (if useful) can be further improved by:
> >> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
> >> once turned on until the socket closed.
> >> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
> >> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
> >> unnecessary.
> >> (c) Support early demux and ipv6.
> >>
> >> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> >
> >Adding a 4-tuple hash for UDP has been discussed in the past.
>
> Thanks for the information! we don't know the history.
>
> >
> >Main issue is that this is adding one cache line miss per incoming packet.
>
> What about adding something like refcnt in 'struct udp_hslot' ?
> if someone enabled uhash4 on the port, we increase the refcnt.
> Then we can check if that port have uhash4 enabled. If it's zero,
> we can just bypass the uhash4 lookup process and goto the current
> udp4_lib_lookup2().
>

Reading anything (thus a refcnt) in 'struct udp_hslot' will need the
same cache line miss.

Note that udp_hslot already has a 'count' field
Dust Li Sept. 13, 2024, 3:06 p.m. UTC | #4
On 2024-09-13 16:39:33, Eric Dumazet wrote:
>On Fri, Sep 13, 2024 at 4:22 PM Dust Li <dust.li@linux.alibaba.com> wrote:
>>
>> On 2024-09-13 13:49:03, Eric Dumazet wrote:
>> >On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>> >>
>> >> This RFC patch introduces 4-tuple hash for connected udp sockets, to
>> >> make udp lookup faster. It is a tentative proposal and any comment is
>> >> welcome.
>> >>
>> >> Currently, the udp_table has two hash table, the port hash and portaddr
>> >> hash. But for UDP server, all sockets have the same local port and addr,
>> >> so they are all on the same hash slot within a reuseport group. And the
>> >> target sock is selected by scoring.
>> >>
>> >> In some applications, the UDP server uses connect() for each incoming
>> >> client, and then the socket (fd) is used exclusively by the client. In
>> >> such scenarios, current scoring method can be ineffcient with a large
>> >> number of connections, resulting in high softirq overhead.
>> >>
>> >> To solve the problem, a 4-tuple hash list is added to udp_table, and is
>> >> updated when calling connect(). Then __udp4_lib_lookup() firstly
>> >> searches the 4-tuple hash list, and return directly if success. A new
>> >> sockopt UDP_HASH4 is added to enable it. So the usage is:
>> >> 1. socket()
>> >> 2. bind()
>> >> 3. setsockopt(UDP_HASH4)
>> >> 4. connect()
>> >>
>> >> AFAICT the patch (if useful) can be further improved by:
>> >> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
>> >> once turned on until the socket closed.
>> >> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
>> >> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
>> >> unnecessary.
>> >> (c) Support early demux and ipv6.
>> >>
>> >> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
>> >
>> >Adding a 4-tuple hash for UDP has been discussed in the past.
>>
>> Thanks for the information! we don't know the history.
>>
>> >
>> >Main issue is that this is adding one cache line miss per incoming packet.
>>
>> What about adding something like refcnt in 'struct udp_hslot' ?
>> if someone enabled uhash4 on the port, we increase the refcnt.
>> Then we can check if that port have uhash4 enabled. If it's zero,
>> we can just bypass the uhash4 lookup process and goto the current
>> udp4_lib_lookup2().
>>
>
>Reading anything (thus a refcnt) in 'struct udp_hslot' will need the
>same cache line miss.

hslot2->head in 'struct udp_hslot' will be read right away in
udp4_lib_lookup2() in any case, it's just a few instructions
later(about 20). So I think cache miss should not be a problem
in this case.

>
>Note that udp_hslot already has a 'count' field

Yes, but that's for uhash/uhash2. I'm thinking of adding something
to indicate that uhash4 was enabled on this port. So we can avoid
the extra memory footprint on some cold memory. Maybe 'struct udp_hslot'
is not a good place.

Best regards,
Dust
Eric Dumazet Sept. 13, 2024, 3:39 p.m. UTC | #5
On Fri, Sep 13, 2024 at 5:07 PM Dust Li <dust.li@linux.alibaba.com> wrote:
>
> On 2024-09-13 16:39:33, Eric Dumazet wrote:
> >On Fri, Sep 13, 2024 at 4:22 PM Dust Li <dust.li@linux.alibaba.com> wrote:
> >>
> >> On 2024-09-13 13:49:03, Eric Dumazet wrote:
> >> >On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >> >>
> >> >> This RFC patch introduces 4-tuple hash for connected udp sockets, to
> >> >> make udp lookup faster. It is a tentative proposal and any comment is
> >> >> welcome.
> >> >>
> >> >> Currently, the udp_table has two hash table, the port hash and portaddr
> >> >> hash. But for UDP server, all sockets have the same local port and addr,
> >> >> so they are all on the same hash slot within a reuseport group. And the
> >> >> target sock is selected by scoring.
> >> >>
> >> >> In some applications, the UDP server uses connect() for each incoming
> >> >> client, and then the socket (fd) is used exclusively by the client. In
> >> >> such scenarios, current scoring method can be ineffcient with a large
> >> >> number of connections, resulting in high softirq overhead.
> >> >>
> >> >> To solve the problem, a 4-tuple hash list is added to udp_table, and is
> >> >> updated when calling connect(). Then __udp4_lib_lookup() firstly
> >> >> searches the 4-tuple hash list, and return directly if success. A new
> >> >> sockopt UDP_HASH4 is added to enable it. So the usage is:
> >> >> 1. socket()
> >> >> 2. bind()
> >> >> 3. setsockopt(UDP_HASH4)
> >> >> 4. connect()
> >> >>
> >> >> AFAICT the patch (if useful) can be further improved by:
> >> >> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
> >> >> once turned on until the socket closed.
> >> >> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
> >> >> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
> >> >> unnecessary.
> >> >> (c) Support early demux and ipv6.
> >> >>
> >> >> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> >> >
> >> >Adding a 4-tuple hash for UDP has been discussed in the past.
> >>
> >> Thanks for the information! we don't know the history.
> >>
> >> >
> >> >Main issue is that this is adding one cache line miss per incoming packet.
> >>
> >> What about adding something like refcnt in 'struct udp_hslot' ?
> >> if someone enabled uhash4 on the port, we increase the refcnt.
> >> Then we can check if that port have uhash4 enabled. If it's zero,
> >> we can just bypass the uhash4 lookup process and goto the current
> >> udp4_lib_lookup2().
> >>
> >
> >Reading anything (thus a refcnt) in 'struct udp_hslot' will need the
> >same cache line miss.
>
> hslot2->head in 'struct udp_hslot' will be read right away in
> udp4_lib_lookup2() in any case, it's just a few instructions
> later(about 20). So I think cache miss should not be a problem
> in this case.

I guess this could work.
Philo Lu Sept. 23, 2024, 8:40 a.m. UTC | #6
Hi Eric, sorry for the late response.

On 2024/9/13 19:49, Eric Dumazet wrote:
> On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>>
>> This RFC patch introduces 4-tuple hash for connected udp sockets, to
>> make udp lookup faster. It is a tentative proposal and any comment is
>> welcome.
>>
>> Currently, the udp_table has two hash table, the port hash and portaddr
>> hash. But for UDP server, all sockets have the same local port and addr,
>> so they are all on the same hash slot within a reuseport group. And the
>> target sock is selected by scoring.
>>
>> In some applications, the UDP server uses connect() for each incoming
>> client, and then the socket (fd) is used exclusively by the client. In
>> such scenarios, current scoring method can be ineffcient with a large
>> number of connections, resulting in high softirq overhead.
>>
>> To solve the problem, a 4-tuple hash list is added to udp_table, and is
>> updated when calling connect(). Then __udp4_lib_lookup() firstly
>> searches the 4-tuple hash list, and return directly if success. A new
>> sockopt UDP_HASH4 is added to enable it. So the usage is:
>> 1. socket()
>> 2. bind()
>> 3. setsockopt(UDP_HASH4)
>> 4. connect()
>>
>> AFAICT the patch (if useful) can be further improved by:
>> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
>> once turned on until the socket closed.
>> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
>> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
>> unnecessary.
>> (c) Support early demux and ipv6.
>>
>> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> 
> Adding a 4-tuple hash for UDP has been discussed in the past.
> 
> Main issue is that this is adding one cache line miss per incoming packet.
> 

Thanks to Dust's idea, we can create a new field for hslot2 (or create a 
new struct for hslot2), indicating whether there are connected sockets 
in this hslot (i.e., local port and local address), and run hash4 lookup 
only when it's true. Then there would be no cache line miss.

The detailed patch is attached below.

> Most heavy duty UDP servers (DNS, QUIC), use non connected sockets,
> because having one million UDP sockets has huge kernel memory cost,
> not counting poor cache locality.

Some of our applications do use connected UDP sockets (~10,000 conns), 
and will get significant benefits from hash4. We use connect() to 
separate receiving sockets and listening ones, and then it's easier to 
manage them (just like TCP), especially during live-upgrading, such as 
nginx reload. Besides, I believe hash4 is harmless to those servers 
without connected sockets.

Suggestions are always welcome, and I'll keep improving this patch.

Thanks.

---
  include/net/udp.h |  3 +++
  net/ipv4/udp.c    | 17 ++++++++++++-----
  2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index a05d79d35fbba..bec04c0e753d0 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -54,11 +54,14 @@ struct udp_skb_cb {
   *
   *	@head:	head of list of sockets
   *	@count:	number of sockets in 'head' list
+ *	@hash4_cnt: number of sockets in 'hash4' table of the same (local 
port, local address),
+ *		    Only used by hash2.
   *	@lock:	spinlock protecting changes to head/count
   */
  struct udp_hslot {
  	struct hlist_head	head;
  	int			count;
+	u32			hash4_cnt;
  	spinlock_t		lock;
  } __attribute__((aligned(2 * sizeof(long))));

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aac0251ff6fac..dfa8b3c091def 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -511,14 +511,16 @@ struct sock *__udp4_lib_lookup(const struct net 
*net, __be32 saddr,
  	struct udp_hslot *hslot2;
  	struct sock *result, *sk;

-	result = udp4_lib_lookup4(net, saddr, sport, daddr, hnum, dif, sdif, 
udptable);
-	if (result)
-		return result;
-
  	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
  	slot2 = hash2 & udptable->mask;
  	hslot2 = &udptable->hash2[slot2];
  ·
+	if (hslot2->hash4_cnt) {
+		result = udp4_lib_lookup4(net, saddr, sport, daddr, hnum, dif, sdif, 
udptable);
+		if (result)
+			return result;
+	}
+
  	/* Lookup connected or non-wildcard socket */
  	result = udp4_lib_lookup2(net, saddr, sport,
  				  daddr, hnum, dif, sdif,
@@ -1961,7 +1963,7 @@ EXPORT_SYMBOL(udp_pre_connect);
  /* call with sock lock */
  static void udp4_hash4(struct sock *sk)
  {
-	struct udp_hslot *hslot, *hslot4;
+	struct udp_hslot *hslot, *hslot2, *hslot4;
  	struct net *net = sock_net(sk);
  	struct udp_table *udptable;
  	unsigned int hash;
@@ -1975,6 +1977,7 @@ static void udp4_hash4(struct sock *sk)

  	udptable = net->ipv4.udp_table;
  	hslot = udp_hashslot(udptable, net, udp_sk(sk)->udp_port_hash);
+	hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
  	hslot4 = udp_hashslot4(udptable, hash);
  	udp_sk(sk)->udp_lrpa_hash = hash;

@@ -1985,6 +1988,7 @@ static void udp4_hash4(struct sock *sk)
  	spin_lock(&hslot4->lock);
  	hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &hslot4->head);
  	hslot4->count++;
+	hslot2->hash4_cnt++;
  	spin_unlock(&hslot4->lock);

  	spin_unlock_bh(&hslot->lock);
@@ -2068,6 +2072,7 @@ void udp_lib_unhash(struct sock *sk)
  				spin_lock(&hslot4->lock);
  				hlist_del_init_rcu(&udp_sk(sk)->udp_lrpa_node);
  				hslot4->count--;
+				hslot2->hash4_cnt--;
  				spin_unlock(&hslot4->lock);
  			}
  		}
@@ -2119,11 +2124,13 @@ void udp_lib_rehash(struct sock *sk, u16 
newhash, u16 newhash4)
  				spin_lock(&hslot4->lock);
  				hlist_del_init_rcu(&udp_sk(sk)->udp_lrpa_node);
  				hslot4->count--;
+				hslot2->hash4_cnt--;
  				spin_unlock(&hslot4->lock);

  				spin_lock(&nhslot4->lock);
  				hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &nhslot4->head);
  				nhslot4->count++;
+				nhslot2->hash4_cnt++;
  				spin_unlock(&nhslot4->lock);
  			}
Eric Dumazet Sept. 23, 2024, 9:19 a.m. UTC | #7
On Mon, Sep 23, 2024 at 10:40 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> Hi Eric, sorry for the late response.
>
> On 2024/9/13 19:49, Eric Dumazet wrote:
> > On Fri, Sep 13, 2024 at 12:09 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >>
> >> This RFC patch introduces 4-tuple hash for connected udp sockets, to
> >> make udp lookup faster. It is a tentative proposal and any comment is
> >> welcome.
> >>
> >> Currently, the udp_table has two hash table, the port hash and portaddr
> >> hash. But for UDP server, all sockets have the same local port and addr,
> >> so they are all on the same hash slot within a reuseport group. And the
> >> target sock is selected by scoring.
> >>
> >> In some applications, the UDP server uses connect() for each incoming
> >> client, and then the socket (fd) is used exclusively by the client. In
> >> such scenarios, current scoring method can be ineffcient with a large
> >> number of connections, resulting in high softirq overhead.
> >>
> >> To solve the problem, a 4-tuple hash list is added to udp_table, and is
> >> updated when calling connect(). Then __udp4_lib_lookup() firstly
> >> searches the 4-tuple hash list, and return directly if success. A new
> >> sockopt UDP_HASH4 is added to enable it. So the usage is:
> >> 1. socket()
> >> 2. bind()
> >> 3. setsockopt(UDP_HASH4)
> >> 4. connect()
> >>
> >> AFAICT the patch (if useful) can be further improved by:
> >> (a) Support disable with sockopt UDP_HASH4. Now it cannot be disabled
> >> once turned on until the socket closed.
> >> (b) Better interact with hash2/reuseport. Now hash4 hardly affects other
> >> mechanisms, but maintaining sockets in both hash4 and hash2 lists seems
> >> unnecessary.
> >> (c) Support early demux and ipv6.
> >>
> >> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> >
> > Adding a 4-tuple hash for UDP has been discussed in the past.
> >
> > Main issue is that this is adding one cache line miss per incoming packet.
> >
>
> Thanks to Dust's idea, we can create a new field for hslot2 (or create a
> new struct for hslot2), indicating whether there are connected sockets
> in this hslot (i.e., local port and local address), and run hash4 lookup
> only when it's true. Then there would be no cache line miss.
>
> The detailed patch is attached below.
>
> > Most heavy duty UDP servers (DNS, QUIC), use non connected sockets,
> > because having one million UDP sockets has huge kernel memory cost,
> > not counting poor cache locality.
>
> Some of our applications do use connected UDP sockets (~10,000 conns),
> and will get significant benefits from hash4. We use connect() to
> separate receiving sockets and listening ones, and then it's easier to
> manage them (just like TCP), especially during live-upgrading, such as
> nginx reload. Besides, I believe hash4 is harmless to those servers
> without connected sockets.
>
> Suggestions are always welcome, and I'll keep improving this patch.
>
> Thanks.
>
> ---
>   include/net/udp.h |  3 +++
>   net/ipv4/udp.c    | 17 ++++++++++++-----
>   2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/udp.h b/include/net/udp.h
> index a05d79d35fbba..bec04c0e753d0 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -54,11 +54,14 @@ struct udp_skb_cb {
>    *
>    *    @head:  head of list of sockets
>    *    @count: number of sockets in 'head' list
> + *     @hash4_cnt: number of sockets in 'hash4' table of the same (local
> port, local address),
> + *                 Only used by hash2.
>    *    @lock:  spinlock protecting changes to head/count
>    */
>   struct udp_hslot {
>         struct hlist_head       head;
>         int                     count;
> +       u32                     hash4_cnt;
>         spinlock_t              lock;
>   } __attribute__((aligned(2 * sizeof(long))));

This would double the size of this structure (from 16 to 32 bytes)

Perhaps add another 'struct udp_hslot_main' for the relevant hash table,
and keep the smaller 'struct udp_hslot' for the two others.

Current cumulative size of the hash tables is 2 MB.

Alternatively we could move the locks out of the structure, this is
not used in the fast path.
diff mbox series

Patch

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 3eb3f2b9a2a0..c7b28e52fc49 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -42,6 +42,7 @@  enum {
 	UDP_FLAGS_ENCAP_ENABLED, /* This socket enabled encap */
 	UDP_FLAGS_UDPLITE_SEND_CC, /* set via udplite setsockopt */
 	UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
+	UDP_FLAGS_HASH4_ENABLED, /* Use 4-tuple hash */
 };
 
 struct udp_sock {
@@ -56,6 +57,10 @@  struct udp_sock {
 	int		 pending;	/* Any pending frames ? */
 	__u8		 encap_type;	/* Is this an Encapsulation socket? */
 
+	/* For UDP 4-tuple hash */
+	__u16 udp_lrpa_hash;
+	struct hlist_node udp_lrpa_node;
+
 	/*
 	 * Following member retains the information to create a UDP header
 	 * when the socket is uncorked.
@@ -206,6 +211,9 @@  static inline void udp_allow_gso(struct sock *sk)
 #define udp_portaddr_for_each_entry_rcu(__sk, list) \
 	hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
 
+#define udp_lrpa_for_each_entry_rcu(__up, list) \
+	hlist_for_each_entry_rcu(__up, list, udp_lrpa_node)
+
 #define IS_UDPLITE(__sk) (__sk->sk_protocol == IPPROTO_UDPLITE)
 
 #endif	/* _LINUX_UDP_H */
diff --git a/include/net/udp.h b/include/net/udp.h
index 61222545ab1c..a05d79d35fbb 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -67,12 +67,15 @@  struct udp_hslot {
  *
  *	@hash:	hash table, sockets are hashed on (local port)
  *	@hash2:	hash table, sockets are hashed on (local port, local address)
+ *	@hash4:	hash table, sockets are hashed on
+ *		(local port, local address, remote port, remote address)
  *	@mask:	number of slots in hash tables, minus 1
  *	@log:	log2(number of slots in hash table)
  */
 struct udp_table {
 	struct udp_hslot	*hash;
 	struct udp_hslot	*hash2;
+	struct udp_hslot	*hash4;
 	unsigned int		mask;
 	unsigned int		log;
 };
@@ -94,6 +97,17 @@  static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 	return &table->hash2[hash & table->mask];
 }
 
+static inline struct udp_hslot *udp_hashslot4(struct udp_table *table,
+					      unsigned int hash)
+{
+	return &table->hash4[hash & table->mask];
+}
+
+static inline bool udp_hashed4(const struct sock *sk)
+{
+	return !hlist_unhashed(&udp_sk(sk)->udp_lrpa_node);
+}
+
 extern struct proto udp_prot;
 
 extern atomic_long_t udp_memory_allocated;
@@ -193,7 +207,7 @@  static inline int udp_lib_hash(struct sock *sk)
 }
 
 void udp_lib_unhash(struct sock *sk);
-void udp_lib_rehash(struct sock *sk, u16 new_hash);
+void udp_lib_rehash(struct sock *sk, u16 new_hash, u16 new_hash4);
 
 static inline void udp_lib_close(struct sock *sk, long timeout)
 {
@@ -286,6 +300,7 @@  int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, int *karg);
 int udp_init_sock(struct sock *sk);
 int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
+int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int __udp_disconnect(struct sock *sk, int flags);
 int udp_disconnect(struct sock *sk, int flags);
 __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 1a0fe8b151fb..5b9ecbbec144 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -34,6 +34,7 @@  struct udphdr {
 #define UDP_NO_CHECK6_RX 102	/* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT	103	/* Set GSO segmentation size */
 #define UDP_GRO		104	/* This socket can receive UDP GRO packets */
+#define UDP_HASH4	105	/* Enable 4-tuple hash with connect() */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE	1 /* unused  draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8accbf4cb295..aac0251ff6fa 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -478,6 +478,27 @@  static struct sock *udp4_lib_lookup2(const struct net *net,
 	return result;
 }
 
+static struct sock *udp4_lib_lookup4(const struct net *net,
+				     __be32 saddr, __be16 sport,
+				     __be32 daddr, unsigned int hnum,
+				     int dif, int sdif,
+				     struct udp_table *udptable)
+{
+	unsigned int hash4 = udp_ehashfn(net, daddr, hnum, saddr, sport);
+	const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
+	struct udp_hslot *hslot4 = udp_hashslot4(udptable, hash4);
+	struct udp_sock *up;
+	struct sock *sk;
+
+	INET_ADDR_COOKIE(acookie, saddr, daddr);
+	udp_lrpa_for_each_entry_rcu(up, &hslot4->head) {
+		sk = (struct sock *)up;
+		if (inet_match(net, sk, acookie, ports, dif, sdif))
+			return sk;
+	}
+	return NULL;
+}
+
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
  * harder than this. -DaveM
  */
@@ -490,6 +511,10 @@  struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
 	struct udp_hslot *hslot2;
 	struct sock *result, *sk;
 
+	result = udp4_lib_lookup4(net, saddr, sport, daddr, hnum, dif, sdif, udptable);
+	if (result)
+		return result;
+
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
@@ -1933,6 +1958,51 @@  int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 }
 EXPORT_SYMBOL(udp_pre_connect);
 
+/* call with sock lock */
+static void udp4_hash4(struct sock *sk)
+{
+	struct udp_hslot *hslot, *hslot4;
+	struct net *net = sock_net(sk);
+	struct udp_table *udptable;
+	unsigned int hash;
+
+	if (sk_unhashed(sk) || udp_hashed4(sk) ||
+	    inet_sk(sk)->inet_rcv_saddr == htonl(INADDR_ANY))
+		return;
+
+	hash = udp_ehashfn(net, inet_sk(sk)->inet_rcv_saddr, inet_sk(sk)->inet_num,
+			   inet_sk(sk)->inet_daddr, inet_sk(sk)->inet_dport);
+
+	udptable = net->ipv4.udp_table;
+	hslot = udp_hashslot(udptable, net, udp_sk(sk)->udp_port_hash);
+	hslot4 = udp_hashslot4(udptable, hash);
+	udp_sk(sk)->udp_lrpa_hash = hash;
+
+	spin_lock_bh(&hslot->lock);
+	if (rcu_access_pointer(sk->sk_reuseport_cb))
+		reuseport_detach_sock(sk);
+
+	spin_lock(&hslot4->lock);
+	hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &hslot4->head);
+	hslot4->count++;
+	spin_unlock(&hslot4->lock);
+
+	spin_unlock_bh(&hslot->lock);
+}
+
+int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	int res;
+
+	lock_sock(sk);
+	res = __ip4_datagram_connect(sk, uaddr, addr_len);
+	if (!res && udp_test_bit(HASH4_ENABLED, sk))
+		udp4_hash4(sk);
+	release_sock(sk);
+	return res;
+}
+EXPORT_SYMBOL(udp_connect);
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1974,7 +2044,7 @@  void udp_lib_unhash(struct sock *sk)
 {
 	if (sk_hashed(sk)) {
 		struct udp_table *udptable = udp_get_table_prot(sk);
-		struct udp_hslot *hslot, *hslot2;
+		struct udp_hslot *hslot, *hslot2, *hslot4;
 
 		hslot  = udp_hashslot(udptable, sock_net(sk),
 				      udp_sk(sk)->udp_port_hash);
@@ -1992,6 +2062,14 @@  void udp_lib_unhash(struct sock *sk)
 			hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
 			hslot2->count--;
 			spin_unlock(&hslot2->lock);
+
+			if (udp_hashed4(sk)) {
+				hslot4 = udp_hashslot4(udptable, udp_sk(sk)->udp_lrpa_hash);
+				spin_lock(&hslot4->lock);
+				hlist_del_init_rcu(&udp_sk(sk)->udp_lrpa_node);
+				hslot4->count--;
+				spin_unlock(&hslot4->lock);
+			}
 		}
 		spin_unlock_bh(&hslot->lock);
 	}
@@ -2001,16 +2079,20 @@  EXPORT_SYMBOL(udp_lib_unhash);
 /*
  * inet_rcv_saddr was changed, we must rehash secondary hash
  */
-void udp_lib_rehash(struct sock *sk, u16 newhash)
+void udp_lib_rehash(struct sock *sk, u16 newhash, u16 newhash4)
 {
 	if (sk_hashed(sk)) {
+		struct udp_hslot *hslot, *hslot2, *nhslot2, *hslot4, *nhslot4;
 		struct udp_table *udptable = udp_get_table_prot(sk);
-		struct udp_hslot *hslot, *hslot2, *nhslot2;
 
 		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
 		nhslot2 = udp_hashslot2(udptable, newhash);
 		udp_sk(sk)->udp_portaddr_hash = newhash;
 
+		hslot4 = udp_hashslot4(udptable, udp_sk(sk)->udp_lrpa_hash);
+		nhslot4 = udp_hashslot4(udptable, newhash4);
+		udp_sk(sk)->udp_lrpa_hash = newhash4;
+
 		if (hslot2 != nhslot2 ||
 		    rcu_access_pointer(sk->sk_reuseport_cb)) {
 			hslot = udp_hashslot(udptable, sock_net(sk),
@@ -2033,6 +2115,18 @@  void udp_lib_rehash(struct sock *sk, u16 newhash)
 				spin_unlock(&nhslot2->lock);
 			}
 
+			if (udp_hashed4(sk) && hslot4 != nhslot4) {
+				spin_lock(&hslot4->lock);
+				hlist_del_init_rcu(&udp_sk(sk)->udp_lrpa_node);
+				hslot4->count--;
+				spin_unlock(&hslot4->lock);
+
+				spin_lock(&nhslot4->lock);
+				hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &nhslot4->head);
+				nhslot4->count++;
+				spin_unlock(&nhslot4->lock);
+			}
+
 			spin_unlock_bh(&hslot->lock);
 		}
 	}
@@ -2044,7 +2138,10 @@  void udp_v4_rehash(struct sock *sk)
 	u16 new_hash = ipv4_portaddr_hash(sock_net(sk),
 					  inet_sk(sk)->inet_rcv_saddr,
 					  inet_sk(sk)->inet_num);
-	udp_lib_rehash(sk, new_hash);
+	u16 new_hash4 = udp_ehashfn(sock_net(sk),
+				    inet_sk(sk)->inet_rcv_saddr, inet_sk(sk)->inet_num,
+				    inet_sk(sk)->inet_daddr, inet_sk(sk)->inet_dport);
+	udp_lib_rehash(sk, new_hash, new_hash4);
 }
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
@@ -2757,6 +2854,14 @@  int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		udp_assign_bit(ACCEPT_L4, sk, valbool);
 		set_xfrm_gro_udp_encap_rcv(up->encap_type, sk->sk_family, sk);
 		break;
+	case UDP_HASH4:
+		/* Currently, reset HASH4_ENABLED is not supported */
+		if (!valbool && udp_test_bit(HASH4_ENABLED, sk))
+			return -EPERM;
+
+		if (valbool && !udp_test_bit(HASH4_ENABLED, sk))
+			udp_set_bit(HASH4_ENABLED, sk);
+		break;
 
 	/*
 	 * 	UDP-Lite's partial checksum coverage (RFC 3828).
@@ -2846,6 +2951,10 @@  int udp_lib_getsockopt(struct sock *sk, int level, int optname,
 		val = udp_test_bit(GRO_ENABLED, sk);
 		break;
 
+	case UDP_HASH4:
+		val = udp_test_bit(HASH4_ENABLED, sk);
+		break;
+
 	/* The following two cannot be changed on UDP sockets, the return is
 	 * always 0 (which corresponds to the full checksum coverage of UDP). */
 	case UDPLITE_SEND_CSCOV:
@@ -2938,7 +3047,7 @@  struct proto udp_prot = {
 	.owner			= THIS_MODULE,
 	.close			= udp_lib_close,
 	.pre_connect		= udp_pre_connect,
-	.connect		= ip4_datagram_connect,
+	.connect		= udp_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
 	.init			= udp_init_sock,
@@ -3429,7 +3538,7 @@  void __init udp_table_init(struct udp_table *table, const char *name)
 	unsigned int i;
 
 	table->hash = alloc_large_system_hash(name,
-					      2 * sizeof(struct udp_hslot),
+					      3 * sizeof(struct udp_hslot),
 					      uhash_entries,
 					      21, /* one slot per 2 MB */
 					      0,
@@ -3449,6 +3558,12 @@  void __init udp_table_init(struct udp_table *table, const char *name)
 		table->hash2[i].count = 0;
 		spin_lock_init(&table->hash2[i].lock);
 	}
+	table->hash4 = table->hash2 + (table->mask + 1);
+	for (i = 0; i <= table->mask; i++) {
+		INIT_HLIST_HEAD(&table->hash4[i].head);
+		table->hash4[i].count = 0;
+		spin_lock_init(&table->hash4[i].lock);
+	}
 }
 
 u32 udp_flow_hashrnd(void)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52dfbb2ff1a8..47659381222d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -111,7 +111,7 @@  void udp_v6_rehash(struct sock *sk)
 					  &sk->sk_v6_rcv_saddr,
 					  inet_sk(sk)->inet_num);
 
-	udp_lib_rehash(sk, new_hash);
+	udp_lib_rehash(sk, new_hash, 0); /* 4-tuple hash not implemented */
 }
 
 static int compute_score(struct sock *sk, const struct net *net,