Message ID | 20230324171314.73537-1-nbd@nbd.name (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net/core: add optional threading for backlog processing | expand |
On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote: > When dealing with few flows or an imbalance on CPU utilization, static RPS > CPU assignment can be too inflexible. Add support for enabling threaded NAPI > for backlog processing in order to allow the scheduler to better balance > processing. This helps better spread the load across idle CPUs. Can you explain the use case a little bit more?
On 24.03.23 18:20, Jakub Kicinski wrote: > On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote: >> When dealing with few flows or an imbalance on CPU utilization, static RPS >> CPU assignment can be too inflexible. Add support for enabling threaded NAPI >> for backlog processing in order to allow the scheduler to better balance >> processing. This helps better spread the load across idle CPUs. > > Can you explain the use case a little bit more? I'm primarily testing this on routers with 2 or 4 CPUs and limited processing power, handling routing/NAT. RPS is typically needed to properly distribute the load across all available CPUs. When there is only a small number of flows that are pushing a lot of traffic, a static RPS assignment often leaves some CPUs idle, whereas others become a bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but CPUs can become bottlenecked and fully loaded by a NAPI thread alone. Making backlog processing threaded helps split up the processing work even more and distribute it onto remaining idle CPUs. It can basically be used to make RPS a bit more dynamic and configurable, because you can assign multiple backlog threads to a set of CPUs and selectively steer packets from specific devices / rx queues to them and allow the scheduler to take care of the rest. - Felix
On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: > I'm primarily testing this on routers with 2 or 4 CPUs and limited > processing power, handling routing/NAT. RPS is typically needed to > properly distribute the load across all available CPUs. When there is > only a small number of flows that are pushing a lot of traffic, a static > RPS assignment often leaves some CPUs idle, whereas others become a > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. The NAPI thread becomes a bottleneck with RPS enabled? > Making backlog processing threaded helps split up the processing work > even more and distribute it onto remaining idle CPUs. You'd want to have both threaded NAPI and threaded backlog enabled? > It can basically be used to make RPS a bit more dynamic and > configurable, because you can assign multiple backlog threads to a set > of CPUs and selectively steer packets from specific devices / rx queues Can you give an example? With the 4 CPU example, in case 2 queues are very busy - you're trying to make sure that the RPS does not end up landing on the same CPU as the other busy queue? > to them and allow the scheduler to take care of the rest. You trust the scheduler much more than I do, I think :)
On 24.03.23 18:47, Jakub Kicinski wrote: > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: >> I'm primarily testing this on routers with 2 or 4 CPUs and limited >> processing power, handling routing/NAT. RPS is typically needed to >> properly distribute the load across all available CPUs. When there is >> only a small number of flows that are pushing a lot of traffic, a static >> RPS assignment often leaves some CPUs idle, whereas others become a >> bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but >> CPUs can become bottlenecked and fully loaded by a NAPI thread alone. > > The NAPI thread becomes a bottleneck with RPS enabled? The devices that I work with often only have a single rx queue. That can easily become a bottleneck. >> Making backlog processing threaded helps split up the processing work >> even more and distribute it onto remaining idle CPUs. > > You'd want to have both threaded NAPI and threaded backlog enabled? Yes >> It can basically be used to make RPS a bit more dynamic and >> configurable, because you can assign multiple backlog threads to a set >> of CPUs and selectively steer packets from specific devices / rx queues > > Can you give an example? > > With the 4 CPU example, in case 2 queues are very busy - you're trying > to make sure that the RPS does not end up landing on the same CPU as > the other busy queue? In this part I'm thinking about bigger systems where you want to have a group of CPUs dedicated to dealing with network traffic without assigning a fixed function (e.g. NAPI processing or RPS target) to each one, allowing for more dynamic processing. >> to them and allow the scheduler to take care of the rest. > > You trust the scheduler much more than I do, I think :) In my tests it brings down latency (both avg and p99) considerably in some cases. I posted some numbers here: https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ - Felix
On Fri, 24 Mar 2023 18:57:03 +0100 Felix Fietkau wrote: > >> It can basically be used to make RPS a bit more dynamic and > >> configurable, because you can assign multiple backlog threads to a set > >> of CPUs and selectively steer packets from specific devices / rx queues > > > > Can you give an example? > > > > With the 4 CPU example, in case 2 queues are very busy - you're trying > > to make sure that the RPS does not end up landing on the same CPU as > > the other busy queue? > > In this part I'm thinking about bigger systems where you want to have a > group of CPUs dedicated to dealing with network traffic without > assigning a fixed function (e.g. NAPI processing or RPS target) to each > one, allowing for more dynamic processing. I tried the threaded NAPI on larger systems and helped others try, and so far it's not been beneficial :( Even the load balancing improvements are not significant enough to use it, and there is a large risk of scheduler making the wrong decision. Hence my questioning - I'm trying to understand what you're doing differently. > >> to them and allow the scheduler to take care of the rest. > > > > You trust the scheduler much more than I do, I think :) > > In my tests it brings down latency (both avg and p99) considerably in > some cases. I posted some numbers here: > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ Could you provide the full configuration for this test? In non-threaded mode the RPS is enabled to spread over remaining 3 cores?
On 25.03.23 04:19, Jakub Kicinski wrote: > On Fri, 24 Mar 2023 18:57:03 +0100 Felix Fietkau wrote: >> >> It can basically be used to make RPS a bit more dynamic and >> >> configurable, because you can assign multiple backlog threads to a set >> >> of CPUs and selectively steer packets from specific devices / rx queues >> > >> > Can you give an example? >> > >> > With the 4 CPU example, in case 2 queues are very busy - you're trying >> > to make sure that the RPS does not end up landing on the same CPU as >> > the other busy queue? >> >> In this part I'm thinking about bigger systems where you want to have a >> group of CPUs dedicated to dealing with network traffic without >> assigning a fixed function (e.g. NAPI processing or RPS target) to each >> one, allowing for more dynamic processing. > > I tried the threaded NAPI on larger systems and helped others try, > and so far it's not been beneficial :( Even the load balancing > improvements are not significant enough to use it, and there > is a large risk of scheduler making the wrong decision. > > Hence my questioning - I'm trying to understand what you're doing > differently. I didn't actually run any tests on bigger systems myself, so I don't know how to tune it for those. >> >> to them and allow the scheduler to take care of the rest. >> > >> > You trust the scheduler much more than I do, I think :) >> >> In my tests it brings down latency (both avg and p99) considerably in >> some cases. I posted some numbers here: >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ > > Could you provide the full configuration for this test? > In non-threaded mode the RPS is enabled to spread over remaining > 3 cores? In this test I'm using threaded NAPI and backlog_threaded without any fixed core assignment. - Felix
On Sat, 25 Mar 2023 06:42:43 +0100 Felix Fietkau wrote: > >> In my tests it brings down latency (both avg and p99) considerably in > >> some cases. I posted some numbers here: > >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ > > > > Could you provide the full configuration for this test? > > In non-threaded mode the RPS is enabled to spread over remaining > > 3 cores? > > In this test I'm using threaded NAPI and backlog_threaded without any > fixed core assignment. I was asking about the rps_threaded=0 side of the comparison. So you're saying on that side you were using threaded NAPI with no pinning and RPS across all cores?
On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote: > On 24.03.23 18:47, Jakub Kicinski wrote: > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited > > > processing power, handling routing/NAT. RPS is typically needed to > > > properly distribute the load across all available CPUs. When there is > > > only a small number of flows that are pushing a lot of traffic, a static > > > RPS assignment often leaves some CPUs idle, whereas others become a > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. > > > > The NAPI thread becomes a bottleneck with RPS enabled? > > The devices that I work with often only have a single rx queue. That can > easily become a bottleneck. > > > > Making backlog processing threaded helps split up the processing work > > > even more and distribute it onto remaining idle CPUs. > > > > You'd want to have both threaded NAPI and threaded backlog enabled? > > Yes > > > > It can basically be used to make RPS a bit more dynamic and > > > configurable, because you can assign multiple backlog threads to a set > > > of CPUs and selectively steer packets from specific devices / rx queues > > > > Can you give an example? > > > > With the 4 CPU example, in case 2 queues are very busy - you're trying > > to make sure that the RPS does not end up landing on the same CPU as > > the other busy queue? > > In this part I'm thinking about bigger systems where you want to have a > group of CPUs dedicated to dealing with network traffic without > assigning a fixed function (e.g. NAPI processing or RPS target) to each > one, allowing for more dynamic processing. > > > > to them and allow the scheduler to take care of the rest. > > > > You trust the scheduler much more than I do, I think :) > > In my tests it brings down latency (both avg and p99) considerably in > some cases. I posted some numbers here: > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ It's still not 110% clear to me why/how this additional thread could reduce latency. What/which threads are competing for the busy CPU[s]? I suspect it could be easier/cleaner move away the others (non RPS) threads. Cheers, Paolo
On 28.03.23 11:29, Paolo Abeni wrote: > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote: >> On 24.03.23 18:47, Jakub Kicinski wrote: >> > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: >> > > I'm primarily testing this on routers with 2 or 4 CPUs and limited >> > > processing power, handling routing/NAT. RPS is typically needed to >> > > properly distribute the load across all available CPUs. When there is >> > > only a small number of flows that are pushing a lot of traffic, a static >> > > RPS assignment often leaves some CPUs idle, whereas others become a >> > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but >> > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. >> > >> > The NAPI thread becomes a bottleneck with RPS enabled? >> >> The devices that I work with often only have a single rx queue. That can >> easily become a bottleneck. >> >> > > Making backlog processing threaded helps split up the processing work >> > > even more and distribute it onto remaining idle CPUs. >> > >> > You'd want to have both threaded NAPI and threaded backlog enabled? >> >> Yes >> >> > > It can basically be used to make RPS a bit more dynamic and >> > > configurable, because you can assign multiple backlog threads to a set >> > > of CPUs and selectively steer packets from specific devices / rx queues >> > >> > Can you give an example? >> > >> > With the 4 CPU example, in case 2 queues are very busy - you're trying >> > to make sure that the RPS does not end up landing on the same CPU as >> > the other busy queue? >> >> In this part I'm thinking about bigger systems where you want to have a >> group of CPUs dedicated to dealing with network traffic without >> assigning a fixed function (e.g. NAPI processing or RPS target) to each >> one, allowing for more dynamic processing. >> >> > > to them and allow the scheduler to take care of the rest. >> > >> > You trust the scheduler much more than I do, I think :) >> >> In my tests it brings down latency (both avg and p99) considerably in >> some cases. I posted some numbers here: >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ > > It's still not 110% clear to me why/how this additional thread could > reduce latency. What/which threads are competing for the busy CPU[s]? I > suspect it could be easier/cleaner move away the others (non RPS) > threads. In the tests that I'm doing, network processing load from routing/NAT is enough to occupy all available CPUs. If I dedicate the NAPI thread to one core and use RPS to steer packet processing to the other cores, the core taking care of NAPI has some idle cycles that go to waste, while the other cores are busy. If I include the core in the RPS mask, it can take too much away from the NAPI thread. Also, with a smaller number of flows and depending on the hash distribution of those flows, there can be an imbalance between how much processing each RPS instance gets, making this even more difficult to get right. With the extra threading it works out better in my tests, because the scheduler can deal with these kinds of imbalances. - Felix
On 28.03.23 04:06, Jakub Kicinski wrote: > On Sat, 25 Mar 2023 06:42:43 +0100 Felix Fietkau wrote: >> >> In my tests it brings down latency (both avg and p99) considerably in >> >> some cases. I posted some numbers here: >> >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ >> > >> > Could you provide the full configuration for this test? >> > In non-threaded mode the RPS is enabled to spread over remaining >> > 3 cores? >> >> In this test I'm using threaded NAPI and backlog_threaded without any >> fixed core assignment. > > I was asking about the rps_threaded=0 side of the comparison. > So you're saying on that side you were using threaded NAPI with > no pinning and RPS across all cores? Yes. - Felix
On Tue, 2023-03-28 at 11:45 +0200, Felix Fietkau wrote: > On 28.03.23 11:29, Paolo Abeni wrote: > > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote: > > > On 24.03.23 18:47, Jakub Kicinski wrote: > > > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: > > > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited > > > > > processing power, handling routing/NAT. RPS is typically needed to > > > > > properly distribute the load across all available CPUs. When there is > > > > > only a small number of flows that are pushing a lot of traffic, a static > > > > > RPS assignment often leaves some CPUs idle, whereas others become a > > > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but > > > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. > > > > > > > > The NAPI thread becomes a bottleneck with RPS enabled? > > > > > > The devices that I work with often only have a single rx queue. That can > > > easily become a bottleneck. > > > > > > > > Making backlog processing threaded helps split up the processing work > > > > > even more and distribute it onto remaining idle CPUs. > > > > > > > > You'd want to have both threaded NAPI and threaded backlog enabled? > > > > > > Yes > > > > > > > > It can basically be used to make RPS a bit more dynamic and > > > > > configurable, because you can assign multiple backlog threads to a set > > > > > of CPUs and selectively steer packets from specific devices / rx queues > > > > > > > > Can you give an example? > > > > > > > > With the 4 CPU example, in case 2 queues are very busy - you're trying > > > > to make sure that the RPS does not end up landing on the same CPU as > > > > the other busy queue? > > > > > > In this part I'm thinking about bigger systems where you want to have a > > > group of CPUs dedicated to dealing with network traffic without > > > assigning a fixed function (e.g. NAPI processing or RPS target) to each > > > one, allowing for more dynamic processing. > > > > > > > > to them and allow the scheduler to take care of the rest. > > > > > > > > You trust the scheduler much more than I do, I think :) > > > > > > In my tests it brings down latency (both avg and p99) considerably in > > > some cases. I posted some numbers here: > > > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ > > > > It's still not 110% clear to me why/how this additional thread could > > reduce latency. What/which threads are competing for the busy CPU[s]? I > > suspect it could be easier/cleaner move away the others (non RPS) > > threads. > In the tests that I'm doing, network processing load from routing/NAT is > enough to occupy all available CPUs. > If I dedicate the NAPI thread to one core and use RPS to steer packet > processing to the other cores, the core taking care of NAPI has some > idle cycles that go to waste, while the other cores are busy. > If I include the core in the RPS mask, it can take too much away from > the NAPI thread. I feel like I'm missing some relevant points. If RPS keeps the target CPU fully busy, moving RPS processing in a separate thread still will not allow using more CPU time. Which NIC driver are you using? thanks! Paolo
On 28.03.23 17:13, Paolo Abeni wrote: > On Tue, 2023-03-28 at 11:45 +0200, Felix Fietkau wrote: >> On 28.03.23 11:29, Paolo Abeni wrote: >> > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote: >> > > On 24.03.23 18:47, Jakub Kicinski wrote: >> > > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote: >> > > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited >> > > > > processing power, handling routing/NAT. RPS is typically needed to >> > > > > properly distribute the load across all available CPUs. When there is >> > > > > only a small number of flows that are pushing a lot of traffic, a static >> > > > > RPS assignment often leaves some CPUs idle, whereas others become a >> > > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but >> > > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. >> > > > >> > > > The NAPI thread becomes a bottleneck with RPS enabled? >> > > >> > > The devices that I work with often only have a single rx queue. That can >> > > easily become a bottleneck. >> > > >> > > > > Making backlog processing threaded helps split up the processing work >> > > > > even more and distribute it onto remaining idle CPUs. >> > > > >> > > > You'd want to have both threaded NAPI and threaded backlog enabled? >> > > >> > > Yes >> > > >> > > > > It can basically be used to make RPS a bit more dynamic and >> > > > > configurable, because you can assign multiple backlog threads to a set >> > > > > of CPUs and selectively steer packets from specific devices / rx queues >> > > > >> > > > Can you give an example? >> > > > >> > > > With the 4 CPU example, in case 2 queues are very busy - you're trying >> > > > to make sure that the RPS does not end up landing on the same CPU as >> > > > the other busy queue? >> > > >> > > In this part I'm thinking about bigger systems where you want to have a >> > > group of CPUs dedicated to dealing with network traffic without >> > > assigning a fixed function (e.g. NAPI processing or RPS target) to each >> > > one, allowing for more dynamic processing. >> > > >> > > > > to them and allow the scheduler to take care of the rest. >> > > > >> > > > You trust the scheduler much more than I do, I think :) >> > > >> > > In my tests it brings down latency (both avg and p99) considerably in >> > > some cases. I posted some numbers here: >> > > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ >> > >> > It's still not 110% clear to me why/how this additional thread could >> > reduce latency. What/which threads are competing for the busy CPU[s]? I >> > suspect it could be easier/cleaner move away the others (non RPS) >> > threads. >> In the tests that I'm doing, network processing load from routing/NAT is >> enough to occupy all available CPUs. >> If I dedicate the NAPI thread to one core and use RPS to steer packet >> processing to the other cores, the core taking care of NAPI has some >> idle cycles that go to waste, while the other cores are busy. >> If I include the core in the RPS mask, it can take too much away from >> the NAPI thread. > > I feel like I'm missing some relevant points. > > If RPS keeps the target CPU fully busy, moving RPS processing in a > separate thread still will not allow using more CPU time. RPS doesn't always keep the target CPU fully busy. The combination of NAPI thread + RPS threads is enough to keep the system busy. The number of flows is often small enough, that some (but not all) RPS instances could be keeping their target CPU busy. With my patch, one CPU could be busy with a NAPI thread + remaining cycles allocated to a bit of extra RPS work, while the other CPUs handle the rest of the RPS load. In reality it bounces around CPUs a bit more instead of sticking to this assigned setup, but it allows cores to be more fully utilized by the network processing load, and the resulting throughput/latency improves because of that. > Which NIC driver are you using? I've been testing on multiple platforms. Mainly mtk_eth_soc, but also bgmac. - Felix
On 24/03/2023 18.35, Felix Fietkau wrote: > On 24.03.23 18:20, Jakub Kicinski wrote: >> On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote: >>> When dealing with few flows or an imbalance on CPU utilization, static RPS >>> CPU assignment can be too inflexible. Add support for enabling threaded NAPI >>> for backlog processing in order to allow the scheduler to better balance >>> processing. This helps better spread the load across idle CPUs. >> >> Can you explain the use case a little bit more? > > I'm primarily testing this on routers with 2 or 4 CPUs and limited > processing power, handling routing/NAT. RPS is typically needed to > properly distribute the load across all available CPUs. When there is > only a small number of flows that are pushing a lot of traffic, a static > RPS assignment often leaves some CPUs idle, whereas others become a > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but > CPUs can become bottlenecked and fully loaded by a NAPI thread alone. > > Making backlog processing threaded helps split up the processing work > even more and distribute it onto remaining idle CPUs. > > It can basically be used to make RPS a bit more dynamic and > configurable, because you can assign multiple backlog threads to a set > of CPUs and selectively steer packets from specific devices / rx queues > to them and allow the scheduler to take care of the rest. > My experience with RPS was that it was too slow on the RX-CPU. Meaning that it doesn't really scale, because the RX-CPU becomes the scaling bottleneck. (My data is old and it might scale differently on your ARM boards). This is why I/we created the XDP "cpumap". It also creates a kernel threaded model via a kthread on "map-configured" CPUs. It scales significantly better than RPS, but it doesn't handle flows and packet Out-of-Order (OoO) situations automatically like RPS. That is left up to the BPF-programmer. The kernel samples/bpf xdp_redirect_cpu[0] have code that shows strategies of load-balancing flows. The project xdp-cpumap-tc[1] runs in production (3 ISPs using this) and works in concert with netstack Traffic Control (TC) for scaling bandwidth shaping at the ISPs. OoO is solved by redirecting all customers IPs to the same TX/egress CPU. As the README[1] describes it is recommended to reduce the number of RX-CPUs processing packets, and have more TX-CPUs that basically runs netstack/TC. One ISP with 2x25Gbit/s is using 2 CPUs for RX and 6 CPUs for TX. --Jesper [0] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu.bpf.c [1] https://github.com/xdp-project/xdp-cpumap-tc
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 466c560b0c30..6d037633a52f 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -47,6 +47,15 @@ Table : Subdirectories in /proc/sys/net 1. /proc/sys/net/core - Network core options ============================================ +backlog_threaded +---------------- + +This offloads processing of backlog (input packets steered by RPS, or +queued because the kernel is receiving more than it can handle on the +incoming CPU) to threads (one for each CPU) instead of processing them +in softirq context. This can improve load balancing by allowing the +scheduler to better spread the load across idle CPUs. + bpf_jit_enable -------------- diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst index 3d435caa3ef2..ded6fc713304 100644 --- a/Documentation/networking/scaling.rst +++ b/Documentation/networking/scaling.rst @@ -244,6 +244,26 @@ Setting net.core.netdev_max_backlog to either 1000 or 10000 performed well in experiments. +Threaded Backlog +~~~~~~~~~~~~~~~~ + +When dealing with few flows or an imbalance on CPU utilization, static +RPS CPU assignment can be too inflexible. Making backlog processing +threaded can improve load balancing by allowing the scheduler to spread +the load across idle CPUs. + + +Suggested Configuration +~~~~~~~~~~~~~~~~~~~~~~~ + +If you have CPUs fully utilized with network processing, you can enable +threaded backlog processing by setting /proc/sys/net/core/backlog_threaded +to 1. Afterwards, RPS CPU configuration bits no longer refer to CPU +numbers, but to backlog threads named napi/backlog-<n>. +If necessary, you can change the CPU affinity of these threads to limit +them to specific CPU cores. + + RFS: Receive Flow Steering ========================== diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 674ee5daa7b1..1f67a8c1349e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -524,6 +524,7 @@ static inline bool napi_complete(struct napi_struct *n) } int dev_set_threaded(struct net_device *dev, bool threaded); +int backlog_set_threaded(bool threaded); /** * napi_disable - prevent NAPI from scheduling @@ -3214,6 +3215,7 @@ struct softnet_data { unsigned int cpu; unsigned int input_queue_tail; #endif + unsigned int process_queue_empty; unsigned int received_rps; unsigned int dropped; struct sk_buff_head input_pkt_queue; diff --git a/net/core/dev.c b/net/core/dev.c index 7172334a418f..b029a374b5f2 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4591,7 +4591,7 @@ static int napi_schedule_rps(struct softnet_data *sd) struct softnet_data *mysd = this_cpu_ptr(&softnet_data); #ifdef CONFIG_RPS - if (sd != mysd) { + if (sd != mysd && !test_bit(NAPI_STATE_THREADED, &sd->backlog.state)) { sd->rps_ipi_next = mysd->rps_ipi_list; mysd->rps_ipi_list = sd; @@ -5772,6 +5772,8 @@ static DEFINE_PER_CPU(struct work_struct, flush_works); /* Network device is going away, flush any packets still pending */ static void flush_backlog(struct work_struct *work) { + unsigned int process_queue_empty; + bool threaded, flush_processq; struct sk_buff *skb, *tmp; struct softnet_data *sd; @@ -5786,8 +5788,17 @@ static void flush_backlog(struct work_struct *work) input_queue_head_incr(sd); } } + + threaded = test_bit(NAPI_STATE_THREADED, &sd->backlog.state); + flush_processq = threaded && + !skb_queue_empty_lockless(&sd->process_queue); + if (flush_processq) + process_queue_empty = sd->process_queue_empty; rps_unlock_irq_enable(sd); + if (threaded) + goto out; + skb_queue_walk_safe(&sd->process_queue, skb, tmp) { if (skb->dev->reg_state == NETREG_UNREGISTERING) { __skb_unlink(skb, &sd->process_queue); @@ -5795,7 +5806,16 @@ static void flush_backlog(struct work_struct *work) input_queue_head_incr(sd); } } + +out: local_bh_enable(); + + while (flush_processq) { + msleep(1); + rps_lock_irq_disable(sd); + flush_processq = process_queue_empty == sd->process_queue_empty; + rps_unlock_irq_enable(sd); + } } static bool flush_required(int cpu) @@ -5927,16 +5947,16 @@ static int process_backlog(struct napi_struct *napi, int quota) } rps_lock_irq_disable(sd); + sd->process_queue_empty++; if (skb_queue_empty(&sd->input_pkt_queue)) { /* * Inline a custom version of __napi_complete(). - * only current cpu owns and manipulates this napi, - * and NAPI_STATE_SCHED is the only possible flag set - * on backlog. + * only current cpu owns and manipulates this napi. * We can use a plain write instead of clear_bit(), * and we dont need an smp_mb() memory barrier. */ - napi->state = 0; + napi->state &= ~(NAPIF_STATE_SCHED | + NAPIF_STATE_SCHED_THREADED); again = false; } else { skb_queue_splice_tail_init(&sd->input_pkt_queue, @@ -6350,6 +6370,55 @@ int dev_set_threaded(struct net_device *dev, bool threaded) } EXPORT_SYMBOL(dev_set_threaded); +int backlog_set_threaded(bool threaded) +{ + static bool backlog_threaded; + int err = 0; + int i; + + if (backlog_threaded == threaded) + return 0; + + for_each_possible_cpu(i) { + struct softnet_data *sd = &per_cpu(softnet_data, i); + struct napi_struct *n = &sd->backlog; + + if (n->thread) + continue; + n->thread = kthread_run(napi_threaded_poll, n, "napi/backlog-%d", i); + if (IS_ERR(n->thread)) { + err = PTR_ERR(n->thread); + pr_err("kthread_run failed with err %d\n", err); + n->thread = NULL; + threaded = false; + break; + } + + } + + backlog_threaded = threaded; + + /* Make sure kthread is created before THREADED bit + * is set. + */ + smp_mb__before_atomic(); + + for_each_possible_cpu(i) { + struct softnet_data *sd = &per_cpu(softnet_data, i); + struct napi_struct *n = &sd->backlog; + unsigned long flags; + + rps_lock_irqsave(sd, &flags); + if (threaded) + n->state |= NAPIF_STATE_THREADED; + else + n->state &= ~NAPIF_STATE_THREADED; + rps_unlock_irq_restore(sd, &flags); + } + + return err; +} + void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { @@ -11108,6 +11177,9 @@ static int dev_cpu_dead(unsigned int oldcpu) raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_enable(); + if (test_bit(NAPI_STATE_THREADED, &oldsd->backlog.state)) + return 0; + #ifdef CONFIG_RPS remsd = oldsd->rps_ipi_list; oldsd->rps_ipi_list = NULL; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 74842b453407..77114cd0b021 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -30,6 +30,7 @@ static int int_3600 = 3600; static int min_sndbuf = SOCK_MIN_SNDBUF; static int min_rcvbuf = SOCK_MIN_RCVBUF; static int max_skb_frags = MAX_SKB_FRAGS; +static int backlog_threaded; static int net_msg_warn; /* Unused, but still a sysctl */ @@ -188,6 +189,23 @@ static int rps_sock_flow_sysctl(struct ctl_table *table, int write, } #endif /* CONFIG_RPS */ +static int backlog_threaded_sysctl(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + static DEFINE_MUTEX(backlog_threaded_mutex); + int ret; + + mutex_lock(&backlog_threaded_mutex); + + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (write && !ret) + ret = backlog_set_threaded(backlog_threaded); + + mutex_unlock(&backlog_threaded_mutex); + + return ret; +} + #ifdef CONFIG_NET_FLOW_LIMIT static DEFINE_MUTEX(flow_limit_update_mutex); @@ -532,6 +550,15 @@ static struct ctl_table net_core_table[] = { .proc_handler = rps_sock_flow_sysctl }, #endif + { + .procname = "backlog_threaded", + .data = &backlog_threaded, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = backlog_threaded_sysctl, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE + }, #ifdef CONFIG_NET_FLOW_LIMIT { .procname = "flow_limit_cpu_bitmap",
When dealing with few flows or an imbalance on CPU utilization, static RPS CPU assignment can be too inflexible. Add support for enabling threaded NAPI for backlog processing in order to allow the scheduler to better balance processing. This helps better spread the load across idle CPUs. Signed-off-by: Felix Fietkau <nbd@nbd.name> --- PATCH: - add missing process_queue_empty initialization - fix kthread leak - add documentation RFC v3: - make patch more generic, applies to backlog processing in general - fix process queue access on flush RFC v2: - fix rebase error in rps locking Documentation/admin-guide/sysctl/net.rst | 9 +++ Documentation/networking/scaling.rst | 20 ++++++ include/linux/netdevice.h | 2 + net/core/dev.c | 82 ++++++++++++++++++++++-- net/core/sysctl_net_core.c | 27 ++++++++ 5 files changed, 135 insertions(+), 5 deletions(-)