diff mbox series

net: skip RPS if packet is already on target CPU

Message ID 20241029182703.2698171-1-csander@purestorage.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: skip RPS if packet is already on target CPU | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 9 this patch: 9
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 22 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 79 this patch: 79
netdev/source_inline success Was 0 now: 0

Commit Message

Caleb Sander Oct. 29, 2024, 6:26 p.m. UTC
If RPS is enabled, all packets with a CPU flow hint are enqueued to the
target CPU's input_pkt_queue and process_backlog() is scheduled on that
CPU to dequeue and process the packets. If ARFS has already steered the
packets to the correct CPU, this additional queuing is unnecessary and
the spinlocks involved incur significant CPU overhead.

In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
so, bypass input_pkt_queue and immediately process the packet(s) on the
current CPU.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 net/core/dev.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Eric Dumazet Oct. 29, 2024, 7:02 p.m. UTC | #1
On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> target CPU's input_pkt_queue and process_backlog() is scheduled on that
> CPU to dequeue and process the packets. If ARFS has already steered the
> packets to the correct CPU, this additional queuing is unnecessary and
> the spinlocks involved incur significant CPU overhead.
>
> In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> so, bypass input_pkt_queue and immediately process the packet(s) on the
> current CPU.
>
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>

Current implementation was a conscious choice. This has been discussed
several times.

By processing packets inline, you are actually increasing latencies of
packets queued to other cpus.
Caleb Sander Oct. 29, 2024, 8:38 p.m. UTC | #2
On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > CPU to dequeue and process the packets. If ARFS has already steered the
> > packets to the correct CPU, this additional queuing is unnecessary and
> > the spinlocks involved incur significant CPU overhead.
> >
> > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > current CPU.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>
> Current implementation was a conscious choice. This has been discussed
> several times.
>
> By processing packets inline, you are actually increasing latencies of
> packets queued to other cpus.

Sorry, I wasn't aware of these prior discussions. I take it you are
referring to threads like
https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
? I see what you mean about the latency penalty for packets that do
require cross-CPU steering.

Do you have an alternate suggestion for how to avoid the overhead of
acquiring a spinlock for every packet? The atomic instruction in
rps_lock_irq_disable() called from process_backlog() is consuming 5%
of our CPU time. For our use case, we don't really want software RPS;
we are expecting ARFS to steer all high-bandwidth traffic to the
desired CPUs. We would happily turn off software RPS entirely if we
could, which seems like it would avoid the concerns about higher
latency for packets that need to be steering to a different CPU. But
my understanding is that using ARFS requires RPS to be enabled
(rps_sock_flow_entries set globally and rps_flow_cnt set on each
queue), which enables these rps_needed static branches. Is that
correct? If so, would you be open to adding a sysctl that disables
software RPS and relies upon ARFS to do the packet steering?

Thanks,
Caleb
Eric Dumazet Oct. 30, 2024, 12:55 p.m. UTC | #3
On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > packets to the correct CPU, this additional queuing is unnecessary and
> > > the spinlocks involved incur significant CPU overhead.
> > >
> > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > current CPU.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >
> > Current implementation was a conscious choice. This has been discussed
> > several times.
> >
> > By processing packets inline, you are actually increasing latencies of
> > packets queued to other cpus.
>
> Sorry, I wasn't aware of these prior discussions. I take it you are
> referring to threads like
> https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> ? I see what you mean about the latency penalty for packets that do
> require cross-CPU steering.
>
> Do you have an alternate suggestion for how to avoid the overhead of
> acquiring a spinlock for every packet? The atomic instruction in
> rps_lock_irq_disable() called from process_backlog() is consuming 5%
> of our CPU time. For our use case, we don't really want software RPS;
> we are expecting ARFS to steer all high-bandwidth traffic to the
> desired CPUs. We would happily turn off software RPS entirely if we
> could, which seems like it would avoid the concerns about higher
> latency for packets that need to be steering to a different CPU. But
> my understanding is that using ARFS requires RPS to be enabled
> (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> queue), which enables these rps_needed static branches. Is that
> correct? If so, would you be open to adding a sysctl that disables
> software RPS and relies upon ARFS to do the packet steering?

A sysctl will not avoid the fundamental issue.
Why not instead address the past feedback ?
Can you test the following ?

diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
net_device *dev, struct netdev_bpf *xdp)
        return ret;
 }

+#ifdef CONFIG_RPS
+static bool net_must_use_backlog(int tcpu)
+{
+       if (tcpu < 0)
+               return false;
+       if (tcpu != smp_processor_id())
+               return true;
+       /* target cpu is ourself. We must use our backlog
+        * if we have deferred IPI or packets.
+        */
+       return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
+              this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
+}
+#endif
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
        int ret;
@@ -5857,7 +5872,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu = get_rps_cpu(skb->dev, skb, &rflow);

-               if (cpu >= 0) {
+               if (net_must_use_backlog(cpu)) {
                        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
                        rcu_read_unlock();
                        return ret;
@@ -5890,7 +5905,7 @@ void netif_receive_skb_list_internal(struct
list_head *head)
                        struct rps_dev_flow voidflow, *rflow = &voidflow;
                        int cpu = get_rps_cpu(skb->dev, skb, &rflow);

-                       if (cpu >= 0) {
+                       if (net_must_use_backlog(cpu)) {
                                /* Will be handled, remove from list */
                                skb_list_del_init(skb);
                                enqueue_to_backlog(skb, cpu,
&rflow->last_qtail);
Caleb Sander Oct. 30, 2024, 8:26 p.m. UTC | #4
On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.

Sorry if my suggestion was unclear. I mean that we would ideally like
to use only hardware ARFS for packet steering, and disable software
RPS.
In our testing, ARFS reliably steers packets to the desired CPUs. (Our
application has long-lived TCP sockets, each processed on a single
thread affinitized to one of the interrupt CPUs for the Ethernet
device.) In the off chance that ARFS doesn't steer the packet to the
correct CPU, we would rather just process it on the CPU that receives
it instead of going through the RPS queues. If software RPS is never
used, then there wouldn't be any concerns about higher latency for
RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
computation is also not cheap, so it would be nice to skip it too.
Basically, we want to program ARFS but skip these
static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
way to do that currently. (Please let me know if it's already possible
to do that.)

> Why not instead address the past feedback ?
> Can you test the following ?

Sure, I will test the performance on our setup with this patch. Still,
we would prefer to skip the get_rps_cpu() computation and these extra
checks, and just process the packets on the CPUs they arrive on.

Thanks for your help,
Caleb
Caleb Sander Nov. 4, 2024, 6:42 p.m. UTC | #5
On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > <csander@purestorage.com> wrote:
> > > > >
> > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > the spinlocks involved incur significant CPU overhead.
> > > > >
> > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > current CPU.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > >
> > > > Current implementation was a conscious choice. This has been discussed
> > > > several times.
> > > >
> > > > By processing packets inline, you are actually increasing latencies of
> > > > packets queued to other cpus.
> > >
> > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > referring to threads like
> > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > ? I see what you mean about the latency penalty for packets that do
> > > require cross-CPU steering.
> > >
> > > Do you have an alternate suggestion for how to avoid the overhead of
> > > acquiring a spinlock for every packet? The atomic instruction in
> > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > of our CPU time. For our use case, we don't really want software RPS;
> > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > desired CPUs. We would happily turn off software RPS entirely if we
> > > could, which seems like it would avoid the concerns about higher
> > > latency for packets that need to be steering to a different CPU. But
> > > my understanding is that using ARFS requires RPS to be enabled
> > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > queue), which enables these rps_needed static branches. Is that
> > > correct? If so, would you be open to adding a sysctl that disables
> > > software RPS and relies upon ARFS to do the packet steering?
> >
> > A sysctl will not avoid the fundamental issue.
>
> Sorry if my suggestion was unclear. I mean that we would ideally like
> to use only hardware ARFS for packet steering, and disable software
> RPS.
> In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> application has long-lived TCP sockets, each processed on a single
> thread affinitized to one of the interrupt CPUs for the Ethernet
> device.) In the off chance that ARFS doesn't steer the packet to the
> correct CPU, we would rather just process it on the CPU that receives
> it instead of going through the RPS queues. If software RPS is never
> used, then there wouldn't be any concerns about higher latency for
> RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> computation is also not cheap, so it would be nice to skip it too.
> Basically, we want to program ARFS but skip these
> static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> way to do that currently. (Please let me know if it's already possible
> to do that.)

I see now that get_rps_cpu() still needs to be called even if software
RPS isn't going to be used, because it's also responsible for calling
set_rps_cpu() to program ARFS. So looks like avoiding the calls to
get_rps_cpu() entirely is not possible.

> > Why not instead address the past feedback ?
> > Can you test the following ?
>
> Sure, I will test the performance on our setup with this patch. Still,
> we would prefer to skip the get_rps_cpu() computation and these extra
> checks, and just process the packets on the CPUs they arrive on.

Hi Eric,
I tried out your patch and it seems to work equally well at
eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
added checks contribute a bit of overhead in
netif_receive_skb_list_internal(): it has increased from 0.15% to
0.28% of CPU time. But that's definitely worth it to remove the 5%
hotspot acquiring the RPS spinlock.

Feel free to add:
Tested-by: Caleb Sander Mateos <csander@purestorage.com>

Some minor comments on the patch:
- Probably should be using either skb_queue_len_lockless() or
skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
the RPS lock is not held at this point
- Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
a single this_cpu_ptr(&softnet_data) call

Thanks,
Caleb
Eric Dumazet Nov. 4, 2024, 6:58 p.m. UTC | #6
On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > <csander@purestorage.com> wrote:
> > > > > >
> > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > >
> > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > current CPU.
> > > > > >
> > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > >
> > > > > Current implementation was a conscious choice. This has been discussed
> > > > > several times.
> > > > >
> > > > > By processing packets inline, you are actually increasing latencies of
> > > > > packets queued to other cpus.
> > > >
> > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > referring to threads like
> > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > ? I see what you mean about the latency penalty for packets that do
> > > > require cross-CPU steering.
> > > >
> > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > could, which seems like it would avoid the concerns about higher
> > > > latency for packets that need to be steering to a different CPU. But
> > > > my understanding is that using ARFS requires RPS to be enabled
> > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > queue), which enables these rps_needed static branches. Is that
> > > > correct? If so, would you be open to adding a sysctl that disables
> > > > software RPS and relies upon ARFS to do the packet steering?
> > >
> > > A sysctl will not avoid the fundamental issue.
> >
> > Sorry if my suggestion was unclear. I mean that we would ideally like
> > to use only hardware ARFS for packet steering, and disable software
> > RPS.
> > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > application has long-lived TCP sockets, each processed on a single
> > thread affinitized to one of the interrupt CPUs for the Ethernet
> > device.) In the off chance that ARFS doesn't steer the packet to the
> > correct CPU, we would rather just process it on the CPU that receives
> > it instead of going through the RPS queues. If software RPS is never
> > used, then there wouldn't be any concerns about higher latency for
> > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > computation is also not cheap, so it would be nice to skip it too.
> > Basically, we want to program ARFS but skip these
> > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > way to do that currently. (Please let me know if it's already possible
> > to do that.)
>
> I see now that get_rps_cpu() still needs to be called even if software
> RPS isn't going to be used, because it's also responsible for calling
> set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> get_rps_cpu() entirely is not possible.

Yes, this is the reason. Things would be different if ARFS was done at
dev_queue_xmit() time.

>
> > > Why not instead address the past feedback ?
> > > Can you test the following ?
> >
> > Sure, I will test the performance on our setup with this patch. Still,
> > we would prefer to skip the get_rps_cpu() computation and these extra
> > checks, and just process the packets on the CPUs they arrive on.
>
> Hi Eric,
> I tried out your patch and it seems to work equally well at
> eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> added checks contribute a bit of overhead in
> netif_receive_skb_list_internal(): it has increased from 0.15% to
> 0.28% of CPU time. But that's definitely worth it to remove the 5%
> hotspot acquiring the RPS spinlock.
>
> Feel free to add:
> Tested-by: Caleb Sander Mateos <csander@purestorage.com>
>
> Some minor comments on the patch:
> - Probably should be using either skb_queue_len_lockless() or
> skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> the RPS lock is not held at this point

this_cpu_read() has implicit READ_ONCE() semantic.

> - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> a single this_cpu_ptr(&softnet_data) call

Double check the generated assembly first :)

this_cpu_read() is faster than going through
this_cpu_ptr(&softnet_data), at least on x86,
thanks to %gs: prefix.

No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).
Caleb Sander Nov. 4, 2024, 7:18 p.m. UTC | #7
On Mon, Nov 4, 2024 at 10:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@purestorage.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > > <csander@purestorage.com> wrote:
> > > > > > >
> > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > > >
> > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > > current CPU.
> > > > > > >
> > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > >
> > > > > > Current implementation was a conscious choice. This has been discussed
> > > > > > several times.
> > > > > >
> > > > > > By processing packets inline, you are actually increasing latencies of
> > > > > > packets queued to other cpus.
> > > > >
> > > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > > referring to threads like
> > > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > > ? I see what you mean about the latency penalty for packets that do
> > > > > require cross-CPU steering.
> > > > >
> > > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > > could, which seems like it would avoid the concerns about higher
> > > > > latency for packets that need to be steering to a different CPU. But
> > > > > my understanding is that using ARFS requires RPS to be enabled
> > > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > > queue), which enables these rps_needed static branches. Is that
> > > > > correct? If so, would you be open to adding a sysctl that disables
> > > > > software RPS and relies upon ARFS to do the packet steering?
> > > >
> > > > A sysctl will not avoid the fundamental issue.
> > >
> > > Sorry if my suggestion was unclear. I mean that we would ideally like
> > > to use only hardware ARFS for packet steering, and disable software
> > > RPS.
> > > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > > application has long-lived TCP sockets, each processed on a single
> > > thread affinitized to one of the interrupt CPUs for the Ethernet
> > > device.) In the off chance that ARFS doesn't steer the packet to the
> > > correct CPU, we would rather just process it on the CPU that receives
> > > it instead of going through the RPS queues. If software RPS is never
> > > used, then there wouldn't be any concerns about higher latency for
> > > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > > computation is also not cheap, so it would be nice to skip it too.
> > > Basically, we want to program ARFS but skip these
> > > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > > way to do that currently. (Please let me know if it's already possible
> > > to do that.)
> >
> > I see now that get_rps_cpu() still needs to be called even if software
> > RPS isn't going to be used, because it's also responsible for calling
> > set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> > get_rps_cpu() entirely is not possible.
>
> Yes, this is the reason. Things would be different if ARFS was done at
> dev_queue_xmit() time.
>
> >
> > > > Why not instead address the past feedback ?
> > > > Can you test the following ?
> > >
> > > Sure, I will test the performance on our setup with this patch. Still,
> > > we would prefer to skip the get_rps_cpu() computation and these extra
> > > checks, and just process the packets on the CPUs they arrive on.
> >
> > Hi Eric,
> > I tried out your patch and it seems to work equally well at
> > eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> > added checks contribute a bit of overhead in
> > netif_receive_skb_list_internal(): it has increased from 0.15% to
> > 0.28% of CPU time. But that's definitely worth it to remove the 5%
> > hotspot acquiring the RPS spinlock.
> >
> > Feel free to add:
> > Tested-by: Caleb Sander Mateos <csander@purestorage.com>
> >
> > Some minor comments on the patch:
> > - Probably should be using either skb_queue_len_lockless() or
> > skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> > the RPS lock is not held at this point
>
> this_cpu_read() has implicit READ_ONCE() semantic.

Okay. Seems like it would be nicer not to reach into the sk_buff_head
internals, but up to you.

>
> > - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> > a single this_cpu_ptr(&softnet_data) call
>
> Double check the generated assembly first :)
>
> this_cpu_read() is faster than going through
> this_cpu_ptr(&softnet_data), at least on x86,
> thanks to %gs: prefix.
>
> No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).

Got it, thanks.
Caleb Sander Nov. 14, 2024, 9:35 p.m. UTC | #8
On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@purestorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.
> Why not instead address the past feedback ?
> Can you test the following ?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
> net_device *dev, struct netdev_bpf *xdp)
>         return ret;
>  }
>
> +#ifdef CONFIG_RPS
> +static bool net_must_use_backlog(int tcpu)
> +{
> +       if (tcpu < 0)
> +               return false;
> +       if (tcpu != smp_processor_id())
> +               return true;
> +       /* target cpu is ourself. We must use our backlog
> +        * if we have deferred IPI or packets.
> +        */
> +       return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
> +              this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
> +}
> +#endif

Hi Eric,
Look at this patch again, I am wondering why the tcpu < 0 case is
treated differently from the tcpu == smp_processor_id() case. If I
understand correctly, packets without a CPU flow hint are allowed to
be processed immediately on the current CPU. They will leapfrog
packets that other CPUs have queued onto this CPU via RPS and any RPS
IPIs waiting to be issued to other CPUs. I see this is the behavior in
the current code too, but why don't the same concerns about higher
latency for packets steered cross-CPU apply?

Thanks,
Caleb
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a7642..714a47897c75 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5855,11 +5855,11 @@  static int netif_receive_skb_internal(struct sk_buff *skb)
 #ifdef CONFIG_RPS
 	if (static_branch_unlikely(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu = get_rps_cpu(skb->dev, skb, &rflow);
 
-		if (cpu >= 0) {
+		if (cpu >= 0 && cpu != smp_processor_id()) {
 			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 			rcu_read_unlock();
 			return ret;
 		}
 	}
@@ -5884,15 +5884,17 @@  void netif_receive_skb_list_internal(struct list_head *head)
 	list_splice_init(&sublist, head);
 
 	rcu_read_lock();
 #ifdef CONFIG_RPS
 	if (static_branch_unlikely(&rps_needed)) {
+		int curr_cpu = smp_processor_id();
+
 		list_for_each_entry_safe(skb, next, head, list) {
 			struct rps_dev_flow voidflow, *rflow = &voidflow;
 			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
 
-			if (cpu >= 0) {
+			if (cpu >= 0 && cpu != curr_cpu) {
 				/* Will be handled, remove from list */
 				skb_list_del_init(skb);
 				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 			}
 		}