Message ID | 20230814093528.117342-3-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | net: Use SMP threads for backlog NAPI. | expand |
On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: > This is an undesired situation and it has been attempted to avoid the > situation in which ksoftirqd becomes scheduled. This changed since > commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") > and now a threaded interrupt handler will handle soft interrupts at its > end even if ksoftirqd is pending. That means that they will be processed > in the context in which they were raised. $ git describe --contains d15121be74856 v6.5-rc1~232^2~4 That revert basically removes the "overload" protection that was added to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let ksoftirqd do its job") in UDP overload situations when UDP socket receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost doesn't process packets (because softirq steals CPU/sched time from UDP pid). Warning Cloudflare (Cc) as this might affect their production use-cases, and I recommend getting involved to evaluate the effect of these changes. I do realize/acknowledge that the reverted patch caused other latency issues, given it was a "big-hammer" approach affecting other softirq processing (as can be seen by e.g. the watchdog fixes patches). Thus, the revert makes sense, but how to regain the "overload" protection such that RX networking cannot starve processes reading from the socket? (is this what Sebastian's patchset does?) --Jesper Thread link for people Cc'ed: https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
On Tue, Aug 15, 2023 at 7:08 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: > > This is an undesired situation and it has been attempted to avoid the > > situation in which ksoftirqd becomes scheduled. This changed since > > commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") > > and now a threaded interrupt handler will handle soft interrupts at its > > end even if ksoftirqd is pending. That means that they will be processed > > in the context in which they were raised. > > $ git describe --contains d15121be74856 > v6.5-rc1~232^2~4 > > That revert basically removes the "overload" protection that was added > to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described > in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let > ksoftirqd do its job") in UDP overload situations when UDP socket > receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost > doesn't process packets (because softirq steals CPU/sched time from UDP > pid). Warning Cloudflare (Cc) as this might affect their production > use-cases, and I recommend getting involved to evaluate the effect of > these changes. > > I do realize/acknowledge that the reverted patch caused other latency > issues, given it was a "big-hammer" approach affecting other softirq > processing (as can be seen by e.g. the watchdog fixes patches). > Thus, the revert makes sense, but how to regain the "overload" > protection such that RX networking cannot starve processes reading from > the socket? (is this what Sebastian's patchset does?) > Thanks for notifying us. We will need to evaluate if this is going to change the picture under serious floods. Yan > --Jesper > > Thread link for people Cc'ed: > https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
On 15/08/2023 14.08, Jesper Dangaard Brouer wrote: > > > On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: >> This is an undesired situation and it has been attempted to avoid the >> situation in which ksoftirqd becomes scheduled. This changed since >> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") >> and now a threaded interrupt handler will handle soft interrupts at its >> end even if ksoftirqd is pending. That means that they will be processed >> in the context in which they were raised. > > $ git describe --contains d15121be74856 > v6.5-rc1~232^2~4 > > That revert basically removes the "overload" protection that was added > to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described > in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let > ksoftirqd do its job") in UDP overload situations when UDP socket > receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost > doesn't process packets (because softirq steals CPU/sched time from UDP > pid). Warning Cloudflare (Cc) as this might affect their production > use-cases, and I recommend getting involved to evaluate the effect of > these changes. > I did some testing on net-next (with commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink. And I observe the old overload issue occur again, where userspace process (udp_sink) process very few packets when running on *same* CPU as the NAPI-RX/IRQ processing. The perf report "comm" clearly shows that NAPI runs in the context of the "udp_sink" process, stealing its sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details below). What happens are that NAPI takes 64 packets and queue them to the udp_sink process *socket*, the udp_sink process *wakeup* process 1 packet from socket queue and on exit (__local_bh_enable_ip) runs softirq that starts NAPI (to again process 64 packets... repeat). > I do realize/acknowledge that the reverted patch caused other latency > issues, given it was a "big-hammer" approach affecting other softirq > processing (as can be seen by e.g. the watchdog fixes patches). > Thus, the revert makes sense, but how to regain the "overload" > protection such that RX networking cannot starve processes reading from > the socket? (is this what Sebastian's patchset does?) > I'm no expert in sched / softirq area of the kernel, but I'm willing to help out testing different solution that can regain the "overload" protection e.g. avoid packet processing "falls-of-an-edge" (and thus opens the kernel to be DDoS'ed easily). Is this what Sebastian's patchset does? > > Thread link for people Cc'ed: > https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r --Jesper (some testlab results below) [udp_sink] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c When udp_sink runs on same CPU and NAPI/softirq - UdpInDatagrams: 2,948 packets/sec $ nstat -n && sleep 1 && nstat #kernel IpInReceives 2831056 0.0 IpInDelivers 2831053 0.0 UdpInDatagrams 2948 0.0 UdpInErrors 2828118 0.0 UdpRcvbufErrors 2828118 0.0 IpExtInOctets 130206496 0.0 IpExtInNoECTPkts 2830576 0.0 When udp_sink runs on another CPU than NAPI-RX. - UdpInDatagrams: 1,722,307 pps $ nstat -n && sleep 1 && nstat #kernel IpInReceives 2318560 0.0 IpInDelivers 2318562 0.0 UdpInDatagrams 1722307 0.0 UdpInErrors 596280 0.0 UdpRcvbufErrors 596280 0.0 IpExtInOctets 106634256 0.0 IpExtInNoECTPkts 2318136 0.0
On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > On 15/08/2023 14.08, Jesper Dangaard Brouer wrote: > > > > > > On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: > >> This is an undesired situation and it has been attempted to avoid the > >> situation in which ksoftirqd becomes scheduled. This changed since > >> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") > >> and now a threaded interrupt handler will handle soft interrupts at its > >> end even if ksoftirqd is pending. That means that they will be processed > >> in the context in which they were raised. > > > > $ git describe --contains d15121be74856 > > v6.5-rc1~232^2~4 > > > > That revert basically removes the "overload" protection that was added > > to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described > > in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let > > ksoftirqd do its job") in UDP overload situations when UDP socket > > receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost > > doesn't process packets (because softirq steals CPU/sched time from UDP > > pid). Warning Cloudflare (Cc) as this might affect their production > > use-cases, and I recommend getting involved to evaluate the effect of > > these changes. > > > > I did some testing on net-next (with commit d15121be74856 ("Revert > "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink. > > And I observe the old overload issue occur again, where userspace > process (udp_sink) process very few packets when running on *same* CPU > as the NAPI-RX/IRQ processing. The perf report "comm" clearly shows > that NAPI runs in the context of the "udp_sink" process, stealing its > sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details > below). > What happens are that NAPI takes 64 packets and queue them to the > udp_sink process *socket*, the udp_sink process *wakeup* process 1 > packet from socket queue and on exit (__local_bh_enable_ip) runs softirq > that starts NAPI (to again process 64 packets... repeat). > I think there are two scenarios to consider: 1. Actual DoS scenario. In this case, we would drop DoS packets through XDP, which might actually relieve the stress. According to Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too much time. This is also something I would like to validate again since I cannot tell if those tests were performed before or after the reverted commit. 2. Legit elephant flows (so it should not be just dropped). This one is closer to what you tested above, and it is a much harder issue since packets are legit and should not be dropped early at XDP. Let the scheduler move affected processes away seems to be the non-optimal but straight answer for now. However, I suspect this would impose an overload issue for those programmed with RFS or ARFS, since flows would "follow" the processes. They probably have to force threaded NAPI for tuning. [1] https://blog.cloudflare.com/how-to-drop-10-million-packets/ > > > I do realize/acknowledge that the reverted patch caused other latency > > issues, given it was a "big-hammer" approach affecting other softirq > > processing (as can be seen by e.g. the watchdog fixes patches). > > Thus, the revert makes sense, but how to regain the "overload" > > protection such that RX networking cannot starve processes reading from > > the socket? (is this what Sebastian's patchset does?) > > > > I'm no expert in sched / softirq area of the kernel, but I'm willing to > help out testing different solution that can regain the "overload" > protection e.g. avoid packet processing "falls-of-an-edge" (and thus > opens the kernel to be DDoS'ed easily). > Is this what Sebastian's patchset does? > > > > > > Thread link for people Cc'ed: > > https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r > > --Jesper > (some testlab results below) > > [udp_sink] > https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c > > > When udp_sink runs on same CPU and NAPI/softirq > - UdpInDatagrams: 2,948 packets/sec > > $ nstat -n && sleep 1 && nstat > #kernel > IpInReceives 2831056 0.0 > IpInDelivers 2831053 0.0 > UdpInDatagrams 2948 0.0 > UdpInErrors 2828118 0.0 > UdpRcvbufErrors 2828118 0.0 > IpExtInOctets 130206496 0.0 > IpExtInNoECTPkts 2830576 0.0 > > When udp_sink runs on another CPU than NAPI-RX. > - UdpInDatagrams: 1,722,307 pps > > $ nstat -n && sleep 1 && nstat > #kernel > IpInReceives 2318560 0.0 > IpInDelivers 2318562 0.0 > UdpInDatagrams 1722307 0.0 > UdpInErrors 596280 0.0 > UdpRcvbufErrors 596280 0.0 > IpExtInOctets 106634256 0.0 > IpExtInNoECTPkts 2318136 0.0 > >
On 2023-08-16 16:48:56 [+0200], Jesper Dangaard Brouer wrote: > I'm no expert in sched / softirq area of the kernel, but I'm willing to help > out testing different solution that can regain the "overload" protection > e.g. avoid packet processing "falls-of-an-edge" (and thus opens the kernel > to be DDoS'ed easily). > Is this what Sebastian's patchset does? I was going to respond but didn't know what so far. Can you figure out if you are using backlog or not. If you do, could try my patch. If not could you try to enable napi-thread and see? Either way I will re-read it again. Sebastian
On 16/08/2023 17.15, Yan Zhai wrote: > On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> On 15/08/2023 14.08, Jesper Dangaard Brouer wrote: >>> >>> >>> On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: >>>> This is an undesired situation and it has been attempted to avoid the >>>> situation in which ksoftirqd becomes scheduled. This changed since >>>> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") >>>> and now a threaded interrupt handler will handle soft interrupts at its >>>> end even if ksoftirqd is pending. That means that they will be processed >>>> in the context in which they were raised. >>> >>> $ git describe --contains d15121be74856 >>> v6.5-rc1~232^2~4 >>> >>> That revert basically removes the "overload" protection that was added >>> to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described >>> in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let >>> ksoftirqd do its job") in UDP overload situations when UDP socket >>> receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost >>> doesn't process packets (because softirq steals CPU/sched time from UDP >>> pid). Warning Cloudflare (Cc) as this might affect their production >>> use-cases, and I recommend getting involved to evaluate the effect of >>> these changes. >>> >> >> I did some testing on net-next (with commit d15121be74856 ("Revert >> "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink. >> >> And I observe the old overload issue occur again, where userspace >> process (udp_sink) process very few packets when running on *same* CPU >> as the NAPI-RX/IRQ processing. The perf report "comm" clearly shows >> that NAPI runs in the context of the "udp_sink" process, stealing its >> sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details >> below). >> What happens are that NAPI takes 64 packets and queue them to the >> udp_sink process *socket*, the udp_sink process *wakeup* process 1 >> packet from socket queue and on exit (__local_bh_enable_ip) runs softirq >> that starts NAPI (to again process 64 packets... repeat). >> > I think there are two scenarios to consider: > > 1. Actual DoS scenario. In this case, we would drop DoS packets > through XDP, which might actually relieve the stress. According to > Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too > much time. This is also something I would like to validate again since Yes, using XDP to drop packet will/should relieve the stress, as it basically can discard some of the 64 packets processed by NAPI vs the 1 packet received by userspace (that re-trigger NAPI), giving a better balance. > I cannot tell if those tests were performed before or after the > reverted commit. Marek's tests will likely contain the patch 4cd13c21b207 ("softirq: Let ksoftirqd do its job") as blog is from 2018 and patch from 2016, but shouldn't matter much. > 2. Legit elephant flows (so it should not be just dropped). This one > is closer to what you tested above, and it is a much harder issue > since packets are legit and should not be dropped early at XDP. Let > the scheduler move affected processes away seems to be the non-optimal > but straight answer for now. However, I suspect this would impose an > overload issue for those programmed with RFS or ARFS, since flows > would "follow" the processes. They probably have to force threaded > NAPI for tuning. > True, this is the case I don't know how to solve. For UDP packets it is NOT optimal to let the process "follow"/run on the NAPI-RX CPU. For TCP traffic it is faster to run on same CPU, which could be related to GRO effect, or simply that tcp_recvmsg gets a stream of data (before it invokes __local_bh_enable_ip causing do_softirq). I have also tested with netperf UDP packets[2] in a scenario that doesn't cause "overload" and CPU have idle cycles. When UDP-netserver is running on same CPU as NAPI then I see approx 38% (82020/216362) UdpRcvbufErrors [3] (and separate CPUs 2.8%). Sure, I could increase buffer size, but the point is NAPI can enqueue 64 packet and UDP receiver dequeue 1 packet. This reminded me that kernel have a recvmmsg (extra "m") syscall for multiple packets. I tested this (as udop_sink have support), but no luck. This is because internally in the kernel (do_recvmmsg) is just a loop over ___sys_recvmsg/__skb_recv_udp, which have a BH-spinlock per packet that invokes __local_bh_enable_ip/do_softirq. I guess, we/netdev could fix recvmmsg() to bulk-dequeue from socket queue (BH-socket unlock is triggering __local_bh_enable_ip/do_softirq) and then have a solution for UDP(?). [2] netperf -H 198.18.1.1 -D1 -l 1200 -t UDP_STREAM -T 0,0 -- -m 1472 -N -n [3] $ nstat -n && sleep 1 && nstat #kernel IpInReceives 216362 0.0 IpInDelivers 216354 0.0 UdpInDatagrams 134356 0.0 UdpInErrors 82020 0.0 UdpRcvbufErrors 82020 0.0 IpExtInOctets 324600000 0.0 IpExtInNoECTPkts 216400 0.0 > [1] https://blog.cloudflare.com/how-to-drop-10-million-packets/ > >> >>> I do realize/acknowledge that the reverted patch caused other latency >>> issues, given it was a "big-hammer" approach affecting other softirq >>> processing (as can be seen by e.g. the watchdog fixes patches). >>> Thus, the revert makes sense, but how to regain the "overload" >>> protection such that RX networking cannot starve processes reading from >>> the socket? (is this what Sebastian's patchset does?) >>> >> >> I'm no expert in sched / softirq area of the kernel, but I'm willing to >> help out testing different solution that can regain the "overload" >> protection e.g. avoid packet processing "falls-of-an-edge" (and thus >> opens the kernel to be DDoS'ed easily). >> Is this what Sebastian's patchset does? >> >> >>> >>> Thread link for people Cc'ed: >>> https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r >> >> --Jesper >> (some testlab results below) >> >> [udp_sink] >> https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c >> >> >> When udp_sink runs on same CPU and NAPI/softirq >> - UdpInDatagrams: 2,948 packets/sec >> >> $ nstat -n && sleep 1 && nstat >> #kernel >> IpInReceives 2831056 0.0 >> IpInDelivers 2831053 0.0 >> UdpInDatagrams 2948 0.0 >> UdpInErrors 2828118 0.0 >> UdpRcvbufErrors 2828118 0.0 >> IpExtInOctets 130206496 0.0 >> IpExtInNoECTPkts 2830576 0.0 >> >> When udp_sink runs on another CPU than NAPI-RX. >> - UdpInDatagrams: 1,722,307 pps >> >> $ nstat -n && sleep 1 && nstat >> #kernel >> IpInReceives 2318560 0.0 >> IpInDelivers 2318562 0.0 >> UdpInDatagrams 1722307 0.0 >> UdpInErrors 596280 0.0 >> UdpRcvbufErrors 596280 0.0 >> IpExtInOctets 106634256 0.0 >> IpExtInNoECTPkts 2318136 0.0 >> >> > >
On Wed, Aug 16, 2023 at 4:02 PM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > > On 16/08/2023 17.15, Yan Zhai wrote: > > On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > >> > >> On 15/08/2023 14.08, Jesper Dangaard Brouer wrote: > >>> > >>> > >>> On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote: > >>>> This is an undesired situation and it has been attempted to avoid the > >>>> situation in which ksoftirqd becomes scheduled. This changed since > >>>> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") > >>>> and now a threaded interrupt handler will handle soft interrupts at its > >>>> end even if ksoftirqd is pending. That means that they will be processed > >>>> in the context in which they were raised. > >>> > >>> $ git describe --contains d15121be74856 > >>> v6.5-rc1~232^2~4 > >>> > >>> That revert basically removes the "overload" protection that was added > >>> to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described > >>> in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let > >>> ksoftirqd do its job") in UDP overload situations when UDP socket > >>> receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost > >>> doesn't process packets (because softirq steals CPU/sched time from UDP > >>> pid). Warning Cloudflare (Cc) as this might affect their production > >>> use-cases, and I recommend getting involved to evaluate the effect of > >>> these changes. > >>> > >> > >> I did some testing on net-next (with commit d15121be74856 ("Revert > >> "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink. > >> > >> And I observe the old overload issue occur again, where userspace > >> process (udp_sink) process very few packets when running on *same* CPU > >> as the NAPI-RX/IRQ processing. The perf report "comm" clearly shows > >> that NAPI runs in the context of the "udp_sink" process, stealing its > >> sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details > >> below). > >> What happens are that NAPI takes 64 packets and queue them to the > >> udp_sink process *socket*, the udp_sink process *wakeup* process 1 > >> packet from socket queue and on exit (__local_bh_enable_ip) runs softirq > >> that starts NAPI (to again process 64 packets... repeat). > >> > > I think there are two scenarios to consider: > > > > 1. Actual DoS scenario. In this case, we would drop DoS packets > > through XDP, which might actually relieve the stress. According to > > Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too > > much time. This is also something I would like to validate again since > > Yes, using XDP to drop packet will/should relieve the stress, as it > basically can discard some of the 64 packets processed by NAPI vs the 1 > packet received by userspace (that re-trigger NAPI), giving a better > balance. > > > I cannot tell if those tests were performed before or after the > > reverted commit. > > Marek's tests will likely contain the patch 4cd13c21b207 ("softirq: Let > ksoftirqd do its job") as blog is from 2018 and patch from 2016, but > shouldn't matter much. > > > > 2. Legit elephant flows (so it should not be just dropped). This one > > is closer to what you tested above, and it is a much harder issue > > since packets are legit and should not be dropped early at XDP. Let > > the scheduler move affected processes away seems to be the non-optimal > > but straight answer for now. However, I suspect this would impose an > > overload issue for those programmed with RFS or ARFS, since flows > > would "follow" the processes. They probably have to force threaded > > NAPI for tuning. > > > > True, this is the case I don't know how to solve. > > For UDP packets it is NOT optimal to let the process "follow"/run on the > NAPI-RX CPU. For TCP traffic it is faster to run on same CPU, which > could be related to GRO effect, or simply that tcp_recvmsg gets a stream > of data (before it invokes __local_bh_enable_ip causing do_softirq). > To maximize single flow throughput, it is not optimal to run RX on the same CPU with the receiver, regardless of TCP or UDP. The difference is that TCP does not have tput issue until 10+ Gbps thanks to GRO. In some internal benchmarking effort, I found that pinning iperf server on the same RX would yield ~13-14 Gbps TCP while running on different cores would have 25G NIC saturated (for both same or different NUMA case). Despite single flow throughput upper bound getting hit when running on the same core, CPU cycles to process each packet is actually reduced. So it is likely more friendly to the production environment we are dealing with where there are a lot more smaller flows. It is something I planned to test more (in the past we had major services pinned on dedicated cores, but recently we start to unpin to improve tail latency of other services). But with the protection gone, it adds quite some uncertainty to the picture. > I have also tested with netperf UDP packets[2] in a scenario that > doesn't cause "overload" and CPU have idle cycles. When UDP-netserver > is running on same CPU as NAPI then I see approx 38% (82020/216362) > UdpRcvbufErrors [3] (and separate CPUs 2.8%). Sure, I could increase > buffer size, but the point is NAPI can enqueue 64 packet and UDP > receiver dequeue 1 packet. > > This reminded me that kernel have a recvmmsg (extra "m") syscall for > multiple packets. I tested this (as udop_sink have support), but no > luck. This is because internally in the kernel (do_recvmmsg) is just a > loop over ___sys_recvmsg/__skb_recv_udp, which have a BH-spinlock per > packet that invokes __local_bh_enable_ip/do_softirq. I guess, we/netdev > could fix recvmmsg() to bulk-dequeue from socket queue (BH-socket unlock > is triggering __local_bh_enable_ip/do_softirq) and then have a solution > for UDP(?). > recvmmsg does help getting more packets in a batch, but it has an issue of buffer allocation upfront: when there are millions of connections, preallocate too many buffers can mount up memory pressure a lot. On an alternative view, enable UDP GRO seems a direct help in this context, to bring UDP on par with TCP, and reduce the RX overhead. We already have quite some UDP GRO/GSO use cases for virtual machines and QUIC handling, time to persuade engineers to add more maybe. Yan > > [2] netperf -H 198.18.1.1 -D1 -l 1200 -t UDP_STREAM -T 0,0 -- -m 1472 -N -n > > [3] > $ nstat -n && sleep 1 && nstat > #kernel > IpInReceives 216362 0.0 > IpInDelivers 216354 0.0 > UdpInDatagrams 134356 0.0 > UdpInErrors 82020 0.0 > UdpRcvbufErrors 82020 0.0 > IpExtInOctets 324600000 0.0 > IpExtInNoECTPkts 216400 0.0 > > > > [1] https://blog.cloudflare.com/how-to-drop-10-million-packets/ > > > >> > >>> I do realize/acknowledge that the reverted patch caused other latency > >>> issues, given it was a "big-hammer" approach affecting other softirq > >>> processing (as can be seen by e.g. the watchdog fixes patches). > >>> Thus, the revert makes sense, but how to regain the "overload" > >>> protection such that RX networking cannot starve processes reading from > >>> the socket? (is this what Sebastian's patchset does?) > >>> > >> > >> I'm no expert in sched / softirq area of the kernel, but I'm willing to > >> help out testing different solution that can regain the "overload" > >> protection e.g. avoid packet processing "falls-of-an-edge" (and thus > >> opens the kernel to be DDoS'ed easily). > >> Is this what Sebastian's patchset does? > >> > >> > >>> > >>> Thread link for people Cc'ed: > >>> https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r > >> > >> --Jesper > >> (some testlab results below) > >> > >> [udp_sink] > >> https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c > >> > >> > >> When udp_sink runs on same CPU and NAPI/softirq > >> - UdpInDatagrams: 2,948 packets/sec > >> > >> $ nstat -n && sleep 1 && nstat > >> #kernel > >> IpInReceives 2831056 0.0 > >> IpInDelivers 2831053 0.0 > >> UdpInDatagrams 2948 0.0 > >> UdpInErrors 2828118 0.0 > >> UdpRcvbufErrors 2828118 0.0 > >> IpExtInOctets 130206496 0.0 > >> IpExtInNoECTPkts 2830576 0.0 > >> > >> When udp_sink runs on another CPU than NAPI-RX. > >> - UdpInDatagrams: 1,722,307 pps > >> > >> $ nstat -n && sleep 1 && nstat > >> #kernel > >> IpInReceives 2318560 0.0 > >> IpInDelivers 2318562 0.0 > >> UdpInDatagrams 1722307 0.0 > >> UdpInErrors 596280 0.0 > >> UdpRcvbufErrors 596280 0.0 > >> IpExtInOctets 106634256 0.0 > >> IpExtInNoECTPkts 2318136 0.0 > >> > >> > > > > >
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index a92bce40b04b3..5143ae0ea9356 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -590,9 +590,9 @@ asmlinkage void do_softirq(void); asmlinkage void __do_softirq(void); #ifdef CONFIG_PREEMPT_RT -extern void do_softirq_post_smp_call_flush(unsigned int was_pending); +extern void do_softirq_post_smp_call_flush(void); #else -static inline void do_softirq_post_smp_call_flush(unsigned int unused) +static inline void do_softirq_post_smp_call_flush(void) { do_softirq(); } diff --git a/kernel/smp.c b/kernel/smp.c index 385179dae360e..cd7db5ffe95ab 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -554,7 +554,6 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline) */ void flush_smp_call_function_queue(void) { - unsigned int was_pending; unsigned long flags; if (llist_empty(this_cpu_ptr(&call_single_queue))) @@ -562,10 +561,9 @@ void flush_smp_call_function_queue(void) local_irq_save(flags); /* Get the already pending soft interrupts for RT enabled kernels */ - was_pending = local_softirq_pending(); __flush_smp_call_function_queue(true); if (local_softirq_pending()) - do_softirq_post_smp_call_flush(was_pending); + do_softirq_post_smp_call_flush(); local_irq_restore(flags); } diff --git a/kernel/softirq.c b/kernel/softirq.c index 807b34ccd7973..aa299cb3ff47b 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -281,15 +281,13 @@ static inline void invoke_softirq(void) /* * flush_smp_call_function_queue() can raise a soft interrupt in a function - * call. On RT kernels this is undesired and the only known functionality - * in the block layer which does this is disabled on RT. If soft interrupts - * get raised which haven't been raised before the flush, warn so it can be - * investigated. + * call. On RT kernels this is undesired because the work is no longer processed + * in the context where it originated. It is not especially harmfull but best to + * be avoided. */ -void do_softirq_post_smp_call_flush(unsigned int was_pending) +void do_softirq_post_smp_call_flush(void) { - if (WARN_ON_ONCE(was_pending != local_softirq_pending())) - invoke_softirq(); + invoke_softirq(); } #else /* CONFIG_PREEMPT_RT */
Once ksoftirqd become active, all softirqs which were raised, would not be processed immediately but delayed to ksoftirqd. On PREEMPT_RT this means softirqs, which were raised in a threaded interrupt (at elevated process priority), would not be served after the interrupt handler completed its work but will wait until ksoftirqd (normal priority) becomes running on the CPU. On a busy system with plenty of RT tasks this could be delayed for quite some time and leads to problems in general. This is an undesired situation and it has been attempted to avoid the situation in which ksoftirqd becomes scheduled. This changed since commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") and now a threaded interrupt handler will handle soft interrupts at its end even if ksoftirqd is pending. That means that they will be processed in the context in which they were raised. Unfortunately also all other soft interrupts which were raised (or enqueued) earlier and are not yet handled. This happens if a thread with higher priority is raised and has to catch up. This isn't a new problem and the new high priority thread will PI-boost the current sofitrq owner or start from scratch if ksoftirqd wasn't running yet. Since pending ksoftirqd no longer blocks other interrupt threads from handling soft interrupts I belive the warning can be disabled. The pending softirq work has to be solved differently. Remove the warning and update the comment. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- include/linux/interrupt.h | 4 ++-- kernel/smp.c | 4 +--- kernel/softirq.c | 12 +++++------- 3 files changed, 8 insertions(+), 12 deletions(-)