net: skip RPS if packet is already on target CPU

Message ID	20241029182703.2698171-1-csander@purestorage.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-qv1-f100.google.com (mail-qv1-f100.google.com [209.85.219.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2ABC71F429A for <netdev@vger.kernel.org>; Tue, 29 Oct 2024 18:27:50 +0000 (UTC) From: Caleb Sander Mateos <csander@purestorage.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org> Cc: Caleb Sander Mateos <csander@purestorage.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] net: skip RPS if packet is already on target CPU Date: Tue, 29 Oct 2024 12:26:58 -0600 Message-ID: <20241029182703.2698171-1-csander@purestorage.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net: skip RPS if packet is already on target CPU \| expand net: skip RPS if packet is already on target CPU

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 5 this patch: 5
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	success	CCed 5 of 5 maintainers
netdev/build_clang	success	Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 9 this patch: 9
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 22 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 79 this patch: 79
netdev/source_inline	success	Was 0 now: 0

Caleb Sander Oct. 29, 2024, 6:26 p.m. UTC

If RPS is enabled, all packets with a CPU flow hint are enqueued to the
target CPU's input_pkt_queue and process_backlog() is scheduled on that
CPU to dequeue and process the packets. If ARFS has already steered the
packets to the correct CPU, this additional queuing is unnecessary and
the spinlocks involved incur significant CPU overhead.

In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
so, bypass input_pkt_queue and immediately process the packet(s) on the
current CPU.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 net/core/dev.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Eric Dumazet Oct. 29, 2024, 7:02 p.m. UTC | #1

On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> target CPU's input_pkt_queue and process_backlog() is scheduled on that
> CPU to dequeue and process the packets. If ARFS has already steered the
> packets to the correct CPU, this additional queuing is unnecessary and
> the spinlocks involved incur significant CPU overhead.
>
> In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> so, bypass input_pkt_queue and immediately process the packet(s) on the
> current CPU.
>
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>

Current implementation was a conscious choice. This has been discussed
several times.

By processing packets inline, you are actually increasing latencies of
packets queued to other cpus.

Caleb Sander Oct. 29, 2024, 8:38 p.m. UTC | #2

On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > CPU to dequeue and process the packets. If ARFS has already steered the
> > packets to the correct CPU, this additional queuing is unnecessary and
> > the spinlocks involved incur significant CPU overhead.
> >
> > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > current CPU.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>
> Current implementation was a conscious choice. This has been discussed
> several times.
>
> By processing packets inline, you are actually increasing latencies of
> packets queued to other cpus.

Sorry, I wasn't aware of these prior discussions. I take it you are
referring to threads like
https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
? I see what you mean about the latency penalty for packets that do
require cross-CPU steering.

Do you have an alternate suggestion for how to avoid the overhead of
acquiring a spinlock for every packet? The atomic instruction in
rps_lock_irq_disable() called from process_backlog() is consuming 5%
of our CPU time. For our use case, we don't really want software RPS;
we are expecting ARFS to steer all high-bandwidth traffic to the
desired CPUs. We would happily turn off software RPS entirely if we
could, which seems like it would avoid the concerns about higher
latency for packets that need to be steering to a different CPU. But
my understanding is that using ARFS requires RPS to be enabled
(rps_sock_flow_entries set globally and rps_flow_cnt set on each
queue), which enables these rps_needed static branches. Is that
correct? If so, would you be open to adding a sysctl that disables
software RPS and relies upon ARFS to do the packet steering?

Thanks,
Caleb

Eric Dumazet Oct. 30, 2024, 12:55 p.m. UTC | #3

On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > packets to the correct CPU, this additional queuing is unnecessary and
> > > the spinlocks involved incur significant CPU overhead.
> > >
> > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > current CPU.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >
> > Current implementation was a conscious choice. This has been discussed
> > several times.
> >
> > By processing packets inline, you are actually increasing latencies of
> > packets queued to other cpus.
>
> Sorry, I wasn't aware of these prior discussions. I take it you are
> referring to threads like
> https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> ? I see what you mean about the latency penalty for packets that do
> require cross-CPU steering.
>
> Do you have an alternate suggestion for how to avoid the overhead of
> acquiring a spinlock for every packet? The atomic instruction in
> rps_lock_irq_disable() called from process_backlog() is consuming 5%
> of our CPU time. For our use case, we don't really want software RPS;
> we are expecting ARFS to steer all high-bandwidth traffic to the
> desired CPUs. We would happily turn off software RPS entirely if we
> could, which seems like it would avoid the concerns about higher
> latency for packets that need to be steering to a different CPU. But
> my understanding is that using ARFS requires RPS to be enabled
> (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> queue), which enables these rps_needed static branches. Is that
> correct? If so, would you be open to adding a sysctl that disables
> software RPS and relies upon ARFS to do the packet steering?

A sysctl will not avoid the fundamental issue.
Why not instead address the past feedback ?
Can you test the following ?

diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
net_device *dev, struct netdev_bpf *xdp)
        return ret;
 }

+#ifdef CONFIG_RPS
+static bool net_must_use_backlog(int tcpu)
+{
+       if (tcpu < 0)
+               return false;
+       if (tcpu != smp_processor_id())
+               return true;
+       /* target cpu is ourself. We must use our backlog
+        * if we have deferred IPI or packets.
+        */
+       return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
+              this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
+}
+#endif
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
        int ret;
@@ -5857,7 +5872,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu = get_rps_cpu(skb->dev, skb, &rflow);

-               if (cpu >= 0) {
+               if (net_must_use_backlog(cpu)) {
                        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
                        rcu_read_unlock();
                        return ret;
@@ -5890,7 +5905,7 @@ void netif_receive_skb_list_internal(struct
list_head *head)
                        struct rps_dev_flow voidflow, *rflow = &voidflow;
                        int cpu = get_rps_cpu(skb->dev, skb, &rflow);

-                       if (cpu >= 0) {
+                       if (net_must_use_backlog(cpu)) {
                                /* Will be handled, remove from list */
                                skb_list_del_init(skb);
                                enqueue_to_backlog(skb, cpu,
&rflow->last_qtail);

Caleb Sander Oct. 30, 2024, 8:26 p.m. UTC | #4

On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.

Sorry if my suggestion was unclear. I mean that we would ideally like
to use only hardware ARFS for packet steering, and disable software
RPS.
In our testing, ARFS reliably steers packets to the desired CPUs. (Our
application has long-lived TCP sockets, each processed on a single
thread affinitized to one of the interrupt CPUs for the Ethernet
device.) In the off chance that ARFS doesn't steer the packet to the
correct CPU, we would rather just process it on the CPU that receives
it instead of going through the RPS queues. If software RPS is never
used, then there wouldn't be any concerns about higher latency for
RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
computation is also not cheap, so it would be nice to skip it too.
Basically, we want to program ARFS but skip these
static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
way to do that currently. (Please let me know if it's already possible
to do that.)

> Why not instead address the past feedback ?
> Can you test the following ?

Sure, I will test the performance on our setup with this patch. Still,
we would prefer to skip the get_rps_cpu() computation and these extra
checks, and just process the packets on the CPUs they arrive on.

Thanks for your help,
Caleb

Caleb Sander Nov. 4, 2024, 6:42 p.m. UTC | #5

On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > <csander@purestorage.com> wrote:
> > > > >
> > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > the spinlocks involved incur significant CPU overhead.
> > > > >
> > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > current CPU.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > >
> > > > Current implementation was a conscious choice. This has been discussed
> > > > several times.
> > > >
> > > > By processing packets inline, you are actually increasing latencies of
> > > > packets queued to other cpus.
> > >
> > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > referring to threads like
> > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > ? I see what you mean about the latency penalty for packets that do
> > > require cross-CPU steering.
> > >
> > > Do you have an alternate suggestion for how to avoid the overhead of
> > > acquiring a spinlock for every packet? The atomic instruction in
> > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > of our CPU time. For our use case, we don't really want software RPS;
> > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > desired CPUs. We would happily turn off software RPS entirely if we
> > > could, which seems like it would avoid the concerns about higher
> > > latency for packets that need to be steering to a different CPU. But
> > > my understanding is that using ARFS requires RPS to be enabled
> > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > queue), which enables these rps_needed static branches. Is that
> > > correct? If so, would you be open to adding a sysctl that disables
> > > software RPS and relies upon ARFS to do the packet steering?
> >
> > A sysctl will not avoid the fundamental issue.
>
> Sorry if my suggestion was unclear. I mean that we would ideally like
> to use only hardware ARFS for packet steering, and disable software
> RPS.
> In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> application has long-lived TCP sockets, each processed on a single
> thread affinitized to one of the interrupt CPUs for the Ethernet
> device.) In the off chance that ARFS doesn't steer the packet to the
> correct CPU, we would rather just process it on the CPU that receives
> it instead of going through the RPS queues. If software RPS is never
> used, then there wouldn't be any concerns about higher latency for
> RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> computation is also not cheap, so it would be nice to skip it too.
> Basically, we want to program ARFS but skip these
> static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> way to do that currently. (Please let me know if it's already possible
> to do that.)

I see now that get_rps_cpu() still needs to be called even if software
RPS isn't going to be used, because it's also responsible for calling
set_rps_cpu() to program ARFS. So looks like avoiding the calls to
get_rps_cpu() entirely is not possible.

> > Why not instead address the past feedback ?
> > Can you test the following ?
>
> Sure, I will test the performance on our setup with this patch. Still,
> we would prefer to skip the get_rps_cpu() computation and these extra
> checks, and just process the packets on the CPUs they arrive on.

Hi Eric,
I tried out your patch and it seems to work equally well at
eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
added checks contribute a bit of overhead in
netif_receive_skb_list_internal(): it has increased from 0.15% to
0.28% of CPU time. But that's definitely worth it to remove the 5%
hotspot acquiring the RPS spinlock.

Feel free to add:
Tested-by: Caleb Sander Mateos <csander@purestorage.com>

Some minor comments on the patch:
- Probably should be using either skb_queue_len_lockless() or
skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
the RPS lock is not held at this point
- Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
a single this_cpu_ptr(&softnet_data) call

Thanks,
Caleb

Eric Dumazet Nov. 4, 2024, 6:58 p.m. UTC | #6

On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > <csander@purestorage.com> wrote:
> > > > > >
> > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > >
> > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > current CPU.
> > > > > >
> > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > >
> > > > > Current implementation was a conscious choice. This has been discussed
> > > > > several times.
> > > > >
> > > > > By processing packets inline, you are actually increasing latencies of
> > > > > packets queued to other cpus.
> > > >
> > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > referring to threads like
> > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > ? I see what you mean about the latency penalty for packets that do
> > > > require cross-CPU steering.
> > > >
> > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > could, which seems like it would avoid the concerns about higher
> > > > latency for packets that need to be steering to a different CPU. But
> > > > my understanding is that using ARFS requires RPS to be enabled
> > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > queue), which enables these rps_needed static branches. Is that
> > > > correct? If so, would you be open to adding a sysctl that disables
> > > > software RPS and relies upon ARFS to do the packet steering?
> > >
> > > A sysctl will not avoid the fundamental issue.
> >
> > Sorry if my suggestion was unclear. I mean that we would ideally like
> > to use only hardware ARFS for packet steering, and disable software
> > RPS.
> > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > application has long-lived TCP sockets, each processed on a single
> > thread affinitized to one of the interrupt CPUs for the Ethernet
> > device.) In the off chance that ARFS doesn't steer the packet to the
> > correct CPU, we would rather just process it on the CPU that receives
> > it instead of going through the RPS queues. If software RPS is never
> > used, then there wouldn't be any concerns about higher latency for
> > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > computation is also not cheap, so it would be nice to skip it too.
> > Basically, we want to program ARFS but skip these
> > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > way to do that currently. (Please let me know if it's already possible
> > to do that.)
>
> I see now that get_rps_cpu() still needs to be called even if software
> RPS isn't going to be used, because it's also responsible for calling
> set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> get_rps_cpu() entirely is not possible.

Yes, this is the reason. Things would be different if ARFS was done at
dev_queue_xmit() time.

>
> > > Why not instead address the past feedback ?
> > > Can you test the following ?
> >
> > Sure, I will test the performance on our setup with this patch. Still,
> > we would prefer to skip the get_rps_cpu() computation and these extra
> > checks, and just process the packets on the CPUs they arrive on.
>
> Hi Eric,
> I tried out your patch and it seems to work equally well at
> eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> added checks contribute a bit of overhead in
> netif_receive_skb_list_internal(): it has increased from 0.15% to
> 0.28% of CPU time. But that's definitely worth it to remove the 5%
> hotspot acquiring the RPS spinlock.
>
> Feel free to add:
> Tested-by: Caleb Sander Mateos <csander@purestorage.com>
>
> Some minor comments on the patch:
> - Probably should be using either skb_queue_len_lockless() or
> skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> the RPS lock is not held at this point

this_cpu_read() has implicit READ_ONCE() semantic.

> - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> a single this_cpu_ptr(&softnet_data) call

Double check the generated assembly first :)

this_cpu_read() is faster than going through
this_cpu_ptr(&softnet_data), at least on x86,
thanks to %gs: prefix.

No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).

Caleb Sander Nov. 4, 2024, 7:18 p.m. UTC | #7

On Mon, Nov 4, 2024 at 10:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > > <csander@purestorage.com> wrote:
> > > > > > >
> > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > > >
> > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > > current CPU.
> > > > > > >
> > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > >
> > > > > > Current implementation was a conscious choice. This has been discussed
> > > > > > several times.
> > > > > >
> > > > > > By processing packets inline, you are actually increasing latencies of
> > > > > > packets queued to other cpus.
> > > > >
> > > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > > referring to threads like
> > > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > > ? I see what you mean about the latency penalty for packets that do
> > > > > require cross-CPU steering.
> > > > >
> > > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > > could, which seems like it would avoid the concerns about higher
> > > > > latency for packets that need to be steering to a different CPU. But
> > > > > my understanding is that using ARFS requires RPS to be enabled
> > > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > > queue), which enables these rps_needed static branches. Is that
> > > > > correct? If so, would you be open to adding a sysctl that disables
> > > > > software RPS and relies upon ARFS to do the packet steering?
> > > >
> > > > A sysctl will not avoid the fundamental issue.
> > >
> > > Sorry if my suggestion was unclear. I mean that we would ideally like
> > > to use only hardware ARFS for packet steering, and disable software
> > > RPS.
> > > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > > application has long-lived TCP sockets, each processed on a single
> > > thread affinitized to one of the interrupt CPUs for the Ethernet
> > > device.) In the off chance that ARFS doesn't steer the packet to the
> > > correct CPU, we would rather just process it on the CPU that receives
> > > it instead of going through the RPS queues. If software RPS is never
> > > used, then there wouldn't be any concerns about higher latency for
> > > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > > computation is also not cheap, so it would be nice to skip it too.
> > > Basically, we want to program ARFS but skip these
> > > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > > way to do that currently. (Please let me know if it's already possible
> > > to do that.)
> >
> > I see now that get_rps_cpu() still needs to be called even if software
> > RPS isn't going to be used, because it's also responsible for calling
> > set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> > get_rps_cpu() entirely is not possible.
>
> Yes, this is the reason. Things would be different if ARFS was done at
> dev_queue_xmit() time.
>
> >
> > > > Why not instead address the past feedback ?
> > > > Can you test the following ?
> > >
> > > Sure, I will test the performance on our setup with this patch. Still,
> > > we would prefer to skip the get_rps_cpu() computation and these extra
> > > checks, and just process the packets on the CPUs they arrive on.
> >
> > Hi Eric,
> > I tried out your patch and it seems to work equally well at
> > eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> > added checks contribute a bit of overhead in
> > netif_receive_skb_list_internal(): it has increased from 0.15% to
> > 0.28% of CPU time. But that's definitely worth it to remove the 5%
> > hotspot acquiring the RPS spinlock.
> >
> > Feel free to add:
> > Tested-by: Caleb Sander Mateos <csander@purestorage.com>
> >
> > Some minor comments on the patch:
> > - Probably should be using either skb_queue_len_lockless() or
> > skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> > the RPS lock is not held at this point
>
> this_cpu_read() has implicit READ_ONCE() semantic.

Okay. Seems like it would be nicer not to reach into the sk_buff_head
internals, but up to you.

>
> > - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> > a single this_cpu_ptr(&softnet_data) call
>
> Double check the generated assembly first :)
>
> this_cpu_read() is faster than going through
> this_cpu_ptr(&softnet_data), at least on x86,
> thanks to %gs: prefix.
>
> No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).

Got it, thanks.

Caleb Sander Nov. 14, 2024, 9:35 p.m. UTC | #8

On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.
> Why not instead address the past feedback ?
> Can you test the following ?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
> net_device *dev, struct netdev_bpf *xdp)
>         return ret;
>  }
>
> +#ifdef CONFIG_RPS
> +static bool net_must_use_backlog(int tcpu)
> +{
> +       if (tcpu < 0)
> +               return false;
> +       if (tcpu != smp_processor_id())
> +               return true;
> +       /* target cpu is ourself. We must use our backlog
> +        * if we have deferred IPI or packets.
> +        */
> +       return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
> +              this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
> +}
> +#endif

Hi Eric,
Look at this patch again, I am wondering why the tcpu < 0 case is
treated differently from the tcpu == smp_processor_id() case. If I
understand correctly, packets without a CPU flow hint are allowed to
be processed immediately on the current CPU. They will leapfrog
packets that other CPUs have queued onto this CPU via RPS and any RPS
IPIs waiting to be issued to other CPUs. I see this is the behavior in
the current code too, but why don't the same concerns about higher
latency for packets steered cross-CPU apply?

Thanks,
Caleb

net: skip RPS if packet is already on target CPU

Checks

Commit Message

Comments

Patch