diff mbox series

[net-next,v3] net: enable SOCK_NOSPACE for UDP

Message ID 0e2077519aafb2a47b6a6f25532bfd43c8b931aa.1712581881.git.asml.silence@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v3] net: enable SOCK_NOSPACE for UDP | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3407 this patch: 3407
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: willemdebruijn.kernel@gmail.com
netdev/build_clang success Errors and warnings before: 997 this patch: 997
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3596 this patch: 3596
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 50 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 16 this patch: 16
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-04-09--00-00 (tests: 953)

Commit Message

Pavel Begunkov April 8, 2024, 2:16 p.m. UTC
wake_up_poll() and variants can be expensive even if they don't actually
wake anything up as it involves disabling irqs, taking a spinlock and
walking through the poll list, which is fraught with cache bounces.
That might happen when someone waits for POLLOUT or even POLLIN as the
waitqueue is shared, even though we should be able to skip these
false positive calls when the tx queue is not full.

Add support for SOCK_NOSPACE for UDP sockets. The udp_poll() change is
straightforward and repeats after tcp_poll() and others. In sock_wfree()
it's done as an optional feature because it requires support from the
poll handlers, however there are users of sock_wfree() that might be
unprepared to that.

Note, it optimises the sock_wfree() path but not sock_def_write_space().
That's fine because it leads to more false positive wake ups, which is
tolerable and not performance critical.

It wins +5% to throughput testing with a CPU bound tx only io_uring
based benchmark and showed 0.5-3% in more realistic workloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v3: fix a race in udp_poll() (Eric)
    clear SOCK_NOSPACE in sock_wfree()

v2: implement it in sock_wfree instead of adding a UDP specific
    free callback.

 include/net/sock.h |  1 +
 net/core/sock.c    |  9 +++++++++
 net/ipv4/udp.c     | 15 ++++++++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

Comments

Eric Dumazet April 8, 2024, 2:31 p.m. UTC | #1
On Mon, Apr 8, 2024 at 4:16 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> wake_up_poll() and variants can be expensive even if they don't actually
> wake anything up as it involves disabling irqs, taking a spinlock and
> walking through the poll list, which is fraught with cache bounces.
> That might happen when someone waits for POLLOUT or even POLLIN as the
> waitqueue is shared, even though we should be able to skip these
> false positive calls when the tx queue is not full.
>
> Add support for SOCK_NOSPACE for UDP sockets. The udp_poll() change is
> straightforward and repeats after tcp_poll() and others. In sock_wfree()
> it's done as an optional feature because it requires support from the
> poll handlers, however there are users of sock_wfree() that might be
> unprepared to that.
>
> Note, it optimises the sock_wfree() path but not sock_def_write_space().
> That's fine because it leads to more false positive wake ups, which is
> tolerable and not performance critical.
>
> It wins +5% to throughput testing with a CPU bound tx only io_uring
> based benchmark and showed 0.5-3% in more realistic workloads.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v3: fix a race in udp_poll() (Eric)
>     clear SOCK_NOSPACE in sock_wfree()
>
> v2: implement it in sock_wfree instead of adding a UDP specific
>     free callback.
>
>  include/net/sock.h |  1 +
>  net/core/sock.c    |  9 +++++++++
>  net/ipv4/udp.c     | 15 ++++++++++++++-
>  3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2253eefe2848..027a398471c4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -944,6 +944,7 @@ enum sock_flags {
>         SOCK_XDP, /* XDP is attached */
>         SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
>         SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> +       SOCK_NOSPACE_SUPPORTED, /* socket supports the SOCK_NOSPACE flag */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5ed411231fc7..ae7446570726 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3393,6 +3393,15 @@ static void sock_def_write_space_wfree(struct sock *sk)
>
>                 /* rely on refcount_sub from sock_wfree() */
>                 smp_mb__after_atomic();
> +
> +               if (sock_flag(sk, SOCK_NOSPACE_SUPPORTED)) {
> +                       struct socket *sock = sk->sk_socket;

It seems sk->sk_socket could be NULL, according to similar helpers
like sk_stream_write_space()

udp_lib_close() -> sk_common_release() -> sock_orphan() ...
Pavel Begunkov April 8, 2024, 3:06 p.m. UTC | #2
On 4/8/24 15:31, Eric Dumazet wrote:
> On Mon, Apr 8, 2024 at 4:16 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> wake_up_poll() and variants can be expensive even if they don't actually
>> wake anything up as it involves disabling irqs, taking a spinlock and
>> walking through the poll list, which is fraught with cache bounces.
>> That might happen when someone waits for POLLOUT or even POLLIN as the
>> waitqueue is shared, even though we should be able to skip these
>> false positive calls when the tx queue is not full.
>>
>> Add support for SOCK_NOSPACE for UDP sockets. The udp_poll() change is
>> straightforward and repeats after tcp_poll() and others. In sock_wfree()
>> it's done as an optional feature because it requires support from the
>> poll handlers, however there are users of sock_wfree() that might be
>> unprepared to that.
>>
>> Note, it optimises the sock_wfree() path but not sock_def_write_space().
>> That's fine because it leads to more false positive wake ups, which is
>> tolerable and not performance critical.
>>
>> It wins +5% to throughput testing with a CPU bound tx only io_uring
>> based benchmark and showed 0.5-3% in more realistic workloads.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v3: fix a race in udp_poll() (Eric)
>>      clear SOCK_NOSPACE in sock_wfree()
>>
>> v2: implement it in sock_wfree instead of adding a UDP specific
>>      free callback.
>>
>>   include/net/sock.h |  1 +
>>   net/core/sock.c    |  9 +++++++++
>>   net/ipv4/udp.c     | 15 ++++++++++++++-
>>   3 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 2253eefe2848..027a398471c4 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -944,6 +944,7 @@ enum sock_flags {
>>          SOCK_XDP, /* XDP is attached */
>>          SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
>>          SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
>> +       SOCK_NOSPACE_SUPPORTED, /* socket supports the SOCK_NOSPACE flag */
>>   };
>>
>>   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 5ed411231fc7..ae7446570726 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -3393,6 +3393,15 @@ static void sock_def_write_space_wfree(struct sock *sk)
>>
>>                  /* rely on refcount_sub from sock_wfree() */
>>                  smp_mb__after_atomic();
>> +
>> +               if (sock_flag(sk, SOCK_NOSPACE_SUPPORTED)) {
>> +                       struct socket *sock = sk->sk_socket;
> 
> It seems sk->sk_socket could be NULL, according to similar helpers
> like sk_stream_write_space()
> 
> udp_lib_close() -> sk_common_release() -> sock_orphan() ...

Yeah, thanks. So sk_socket stays alive even after it's removed,
makes sense, but I wonder why there is no READ_ONCE in likes of
sk_stream_write_space() as seems sk_socket can just randomly
change?
Eric Dumazet April 8, 2024, 4:05 p.m. UTC | #3
On Mon, Apr 8, 2024 at 5:06 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 4/8/24 15:31, Eric Dumazet wrote:
> > On Mon, Apr 8, 2024 at 4:16 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> wake_up_poll() and variants can be expensive even if they don't actually
> >> wake anything up as it involves disabling irqs, taking a spinlock and
> >> walking through the poll list, which is fraught with cache bounces.
> >> That might happen when someone waits for POLLOUT or even POLLIN as the
> >> waitqueue is shared, even though we should be able to skip these
> >> false positive calls when the tx queue is not full.
> >>
> >> Add support for SOCK_NOSPACE for UDP sockets. The udp_poll() change is
> >> straightforward and repeats after tcp_poll() and others. In sock_wfree()
> >> it's done as an optional feature because it requires support from the
> >> poll handlers, however there are users of sock_wfree() that might be
> >> unprepared to that.
> >>
> >> Note, it optimises the sock_wfree() path but not sock_def_write_space().
> >> That's fine because it leads to more false positive wake ups, which is
> >> tolerable and not performance critical.
> >>
> >> It wins +5% to throughput testing with a CPU bound tx only io_uring
> >> based benchmark and showed 0.5-3% in more realistic workloads.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>
> >> v3: fix a race in udp_poll() (Eric)
> >>      clear SOCK_NOSPACE in sock_wfree()
> >>
> >> v2: implement it in sock_wfree instead of adding a UDP specific
> >>      free callback.
> >>
> >>   include/net/sock.h |  1 +
> >>   net/core/sock.c    |  9 +++++++++
> >>   net/ipv4/udp.c     | 15 ++++++++++++++-
> >>   3 files changed, 24 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/net/sock.h b/include/net/sock.h
> >> index 2253eefe2848..027a398471c4 100644
> >> --- a/include/net/sock.h
> >> +++ b/include/net/sock.h
> >> @@ -944,6 +944,7 @@ enum sock_flags {
> >>          SOCK_XDP, /* XDP is attached */
> >>          SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> >>          SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> >> +       SOCK_NOSPACE_SUPPORTED, /* socket supports the SOCK_NOSPACE flag */
> >>   };
> >>
> >>   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> >> diff --git a/net/core/sock.c b/net/core/sock.c
> >> index 5ed411231fc7..ae7446570726 100644
> >> --- a/net/core/sock.c
> >> +++ b/net/core/sock.c
> >> @@ -3393,6 +3393,15 @@ static void sock_def_write_space_wfree(struct sock *sk)
> >>
> >>                  /* rely on refcount_sub from sock_wfree() */
> >>                  smp_mb__after_atomic();
> >> +
> >> +               if (sock_flag(sk, SOCK_NOSPACE_SUPPORTED)) {
> >> +                       struct socket *sock = sk->sk_socket;
> >
> > It seems sk->sk_socket could be NULL, according to similar helpers
> > like sk_stream_write_space()
> >
> > udp_lib_close() -> sk_common_release() -> sock_orphan() ...
>
> Yeah, thanks. So sk_socket stays alive even after it's removed,
> makes sense, but I wonder why there is no READ_ONCE in likes of
> sk_stream_write_space() as seems sk_socket can just randomly
> change?

sk_stream_write_space() is called with socket lock held (so far)

But for lockless UDP tx path, things are a bit different.
diff mbox series

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 2253eefe2848..027a398471c4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -944,6 +944,7 @@  enum sock_flags {
 	SOCK_XDP, /* XDP is attached */
 	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
 	SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
+	SOCK_NOSPACE_SUPPORTED, /* socket supports the SOCK_NOSPACE flag */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/net/core/sock.c b/net/core/sock.c
index 5ed411231fc7..ae7446570726 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3393,6 +3393,15 @@  static void sock_def_write_space_wfree(struct sock *sk)
 
 		/* rely on refcount_sub from sock_wfree() */
 		smp_mb__after_atomic();
+
+		if (sock_flag(sk, SOCK_NOSPACE_SUPPORTED)) {
+			struct socket *sock = sk->sk_socket;
+
+			if (!test_bit(SOCK_NOSPACE, &sock->flags))
+				return;
+			clear_bit(SOCK_NOSPACE, &sock->flags);
+		}
+
 		if (wq && waitqueue_active(&wq->wait))
 			wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT |
 						EPOLLWRNORM | EPOLLWRBAND);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 11460d751e73..e23973bafed6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -342,6 +342,7 @@  int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		hslot2->count++;
 		spin_unlock(&hslot2->lock);
 	}
+	sock_set_flag(sk, SOCK_NOSPACE_SUPPORTED);
 	sock_set_flag(sk, SOCK_RCU_FREE);
 	error = 0;
 fail_unlock:
@@ -2885,8 +2886,20 @@  __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	/* psock ingress_msg queue should not contain any bad checksum frames */
 	if (sk_is_readable(sk))
 		mask |= EPOLLIN | EPOLLRDNORM;
-	return mask;
 
+	if (!(mask & (EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND))) {
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+
+		/* Order with the wspace read so either we observe it
+		 * writeable or udp_sock_wfree() would find SOCK_NOSPACE and
+		 * wake us up.
+		 */
+		smp_mb__after_atomic();
+
+		if (sock_writeable(sk))
+			mask |= EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
+	}
+	return mask;
 }
 EXPORT_SYMBOL(udp_poll);