Message ID | trinity-c7ec7cbd-a186-4a2a-bcb6-cce8993d6a90-1515428770628@3c-app-gmx-bs32 (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 8 Jan 2018, Josef Griebichler wrote: > Hi Maro, > > I tried your mentioned patch but unfortunately no real improvement for me. > dmesg http://ix.io/DOg > tvheadend service log http://ix.io/DOi > Errors during recording are still there. > Errors increase if there is additional tcp load on raspberry. > > Unfortunately there's no usbmon or tshark on libreelec so I can't provide further logs. Can you try running the same test on an x86_64 system? Alan Stern
No I can't sorry. There's no sat connection near to my workstation. Gesendet: Montag, 08. Januar 2018 um 17:31 Uhr Von: "Alan Stern" <stern@rowland.harvard.edu> An: "Josef Griebichler" <griebichler.josef@gmx.at> Cc: "Mauro Carvalho Chehab" <mchehab@s-opensource.com>, "Greg Kroah-Hartman" <gregkh@linuxfoundation.org>, linux-usb@vger.kernel.org, "Eric Dumazet" <edumazet@google.com>, "Rik van Riel" <riel@redhat.com>, "Paolo Abeni" <pabeni@redhat.com>, "Hannes Frederic Sowa" <hannes@redhat.com>, "Jesper Dangaard Brouer" <jbrouer@redhat.com>, linux-kernel <linux-kernel@vger.kernel.org>, netdev <netdev@vger.kernel.org>, "Jonathan Corbet" <corbet@lwn.net>, LMML <linux-media@vger.kernel.org>, "Peter Zijlstra" <peterz@infradead.org>, "David Miller" <davem@davemloft.net>, torvalds@linux-foundation.org Betreff: Re: Aw: Re: dvb usb issues since kernel 4.9 On Mon, 8 Jan 2018, Josef Griebichler wrote: > Hi Maro, > > I tried your mentioned patch but unfortunately no real improvement for me. > dmesg http://ix.io/DOg > tvheadend service log http://ix.io/DOi[http://ix.io/DOi] > Errors during recording are still there. > Errors increase if there is additional tcp load on raspberry. > > Unfortunately there's no usbmon or tshark on libreelec so I can't provide further logs. Can you try running the same test on an x86_64 system? Alan Stern
On Mon, 8 Jan 2018, Josef Griebichler wrote: > No I can't sorry. There's no sat connection near to my workstation. Can we ask the person who made this post: https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965 to run the test? The post says that the testing was done on an x86_64 machine. > Gesendet: Montag, 08. Januar 2018 um 17:31 Uhr > Von: "Alan Stern" <stern@rowland.harvard.edu> > An: "Josef Griebichler" <griebichler.josef@gmx.at> > Cc: "Mauro Carvalho Chehab" <mchehab@s-opensource.com>, "Greg Kroah-Hartman" <gregkh@linuxfoundation.org>, linux-usb@vger.kernel.org, "Eric Dumazet" <edumazet@google.com>, "Rik van Riel" <riel@redhat.com>, "Paolo Abeni" <pabeni@redhat.com>, "Hannes Frederic Sowa" <hannes@redhat.com>, "Jesper Dangaard Brouer" <jbrouer@redhat.com>, linux-kernel <linux-kernel@vger.kernel.org>, netdev <netdev@vger.kernel.org>, "Jonathan Corbet" <corbet@lwn.net>, LMML <linux-media@vger.kernel.org>, "Peter Zijlstra" <peterz@infradead.org>, "David Miller" <davem@davemloft.net>, torvalds@linux-foundation.org > Betreff: Re: Aw: Re: dvb usb issues since kernel 4.9 > On Mon, 8 Jan 2018, Josef Griebichler wrote: > Hi Maro, > > I tried your mentioned patch but unfortunately no real improvement for me. > dmesg http://ix.io/DOg > tvheadend service log http://ix.io/DOi[http://ix.io/DOi] > Errors during recording are still there. > Errors increase if there is additional tcp load on raspberry. > > Unfortunately there's no usbmon or tshark on libreelec so I can't provide further logs. Can you try running the same test on an x86_64 system? Alan Stern It appears that you are using a non-standard kernel. The vanilla kernel does not include any "dwc_otg_hcd" driver. Alan Stern
On Mon, 8 Jan 2018 12:35:08 -0500 (EST) Alan Stern <stern@rowland.harvard.edu> wrote: > On Mon, 8 Jan 2018, Josef Griebichler wrote: > > > No I can't sorry. There's no sat connection near to my workstation. > > Can we ask the person who made this post: > https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965 > > to run the test? The post says that the testing was done on an x86_64 > machine. For >5 years ago I used to play a lot with IPTV multicast MPEG2-TS streams (I implemented the wireshark mp2ts drop detecting, and a out-of-tree netfilter kernel module to detect drops[1]). The web-site is dead, but archive.org have a copy[2]. Let me quote my own Lab-setup documentation[3]. You don't need a live IPTV MPEG2TS signal, you can simply generate your own using VLC: $ vlc ~/Videos/test_video.mkv -I rc --sout '#duplicate{dst=std{access=udp,mux=ts,dst=239.254.1.1:5500}}' Viewing your own signal: You can view your own generated signal, again, by using VLC. $ vlc udp/ts://@239.254.1.1:5500 I hope the vlc syntax is still valid. And remember to join the multicast channels, if you don't have an application requesting the stream, as desc in [4]. [1] https://github.com/netoptimizer/IPTV-Analyzer [2] http://web.archive.org/web/20150328200122/http://www.iptv-analyzer.org:80/wiki/index.php/Main_Page [3] http://web.archive.org/web/20150329095538/http://www.iptv-analyzer.org:80/wiki/index.php/Lab_Setup [4] http://web.archive.org/web/20150328234459/http://www.iptv-analyzer.org:80/wiki/index.php/Multicast_Signal_on_Linux
On Mon, 8 Jan 2018 17:26:10 +0100 "Josef Griebichler" <griebichler.josef@gmx.at> wrote: > I tried your mentioned patch but unfortunately no real improvement for me. > dmesg http://ix.io/DOg > tvheadend service log http://ix.io/DOi > > Errors during recording are still there. Are you _also_ recording the stream on the Raspberry Pi? It seems to me, that you are expecting too much from this small device. > Errors increase if there is additional tcp load on raspberry. I did expected the issue to get worse, when you load the Pi with network traffic, as now the softirq time-budget have to be shared between networking and USB/DVB. Thus, I guess you are running TCP and USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...) If you expect/want to get stable performance out of such a small box, then you (or LibreELEC) need to tune the box for this usage. And it does not have to be that complicated. First step is to move IRQ handling for the NIC to another CPU and than the USB port handling the DVB signal (/proc/irq/*/smp_affinity_list). And then pin the userspace process (taskset) to another CPU than the one handling USB-softirq. > Unfortunately there's no usbmon or tshark on libreelec so I can't > provide further logs. Do you have perf or trace-cmd on the box? Maybe we could come up with some kernel functions to trace, to measure/show the latency spikes?
On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote: > I did expected the issue to get worse, when you load the Pi with > network traffic, as now the softirq time-budget have to be shared > between networking and USB/DVB. Thus, I guess you are running TCP and > USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...) Isn't networking also over USB on the Pi ?
On Mon, 8 Jan 2018 22:44:27 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote: > > I did expected the issue to get worse, when you load the Pi with > > network traffic, as now the softirq time-budget have to be shared > > between networking and USB/DVB. Thus, I guess you are running TCP and > > USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...) > > Isn't networking also over USB on the Pi ? Darn, that is true. Looking at the dmesg output in http://ix.io/DOg: [ 0.405942] usbcore: registered new interface driver smsc95xx [ 5.821104] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 I don't know enough about USB... is it possible to control which CPU handles the individual USB ports, or on some other level (than ports)?
Hi Linus, your patch works very good for me and others (please see https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=77006#post77006). No errors in recordings any more. The patch was also tested on x86_64 (Revo 3700) with positive effect. I agree with the forum poster, that there's still an issue when recording and watching livetv at same time. I also get audio dropouts and audio is out of sync. According to user smp kernel 4.9.73 with your patch on rpi and according to user jahutchi kernel 4.11.12 on x86_64 have no such issues. I don't know if this dropouts are related to this topic. If of any help I could provide perf output on raspberry with libreelec and tvheadend. Regards, Josef Gesendet: Montag, 08. Januar 2018 um 23:16 Uhr Von: "Jesper Dangaard Brouer" <jbrouer@redhat.com> An: "Peter Zijlstra" <peterz@infradead.org> Cc: "Josef Griebichler" <griebichler.josef@gmx.at>, "Mauro Carvalho Chehab" <mchehab@s-opensource.com>, "Alan Stern" <stern@rowland.harvard.edu>, "Greg Kroah-Hartman" <gregkh@linuxfoundation.org>, linux-usb@vger.kernel.org, "Eric Dumazet" <edumazet@google.com>, "Rik van Riel" <riel@redhat.com>, "Paolo Abeni" <pabeni@redhat.com>, "Hannes Frederic Sowa" <hannes@redhat.com>, linux-kernel <linux-kernel@vger.kernel.org>, netdev <netdev@vger.kernel.org>, "Jonathan Corbet" <corbet@lwn.net>, LMML <linux-media@vger.kernel.org>, "David Miller" <davem@davemloft.net>, torvalds@linux-foundation.org Betreff: Re: dvb usb issues since kernel 4.9 On Mon, 8 Jan 2018 22:44:27 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote: > > I did expected the issue to get worse, when you load the Pi with > > network traffic, as now the softirq time-budget have to be shared > > between networking and USB/DVB. Thus, I guess you are running TCP and > > USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...) > > Isn't networking also over USB on the Pi ? Darn, that is true. Looking at the dmesg output in http://ix.io/DOg: [ 0.405942] usbcore: registered new interface driver smsc95xx [ 5.821104] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 I don't know enough about USB... is it possible to control which CPU handles the individual USB ports, or on some other level (than ports)? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer[http://www.linkedin.com/in/brouer]
On Tue, Jan 9, 2018 at 8:51 AM, Josef Griebichler <griebichler.josef@gmx.at> wrote: > Hi Linus, > > your patch works very good for me and others (please see https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=77006#post77006). No errors in recordings any more. > The patch was also tested on x86_64 (Revo 3700) with positive effect. > I agree with the forum poster, that there's still an issue when recording and watching livetv at same time. I also get audio dropouts and audio is out of sync. > According to user smp kernel 4.9.73 with your patch on rpi and according to user jahutchi kernel 4.11.12 on x86_64 have no such issues. > I don't know if this dropouts are related to this topic. > > If of any help I could provide perf output on raspberry with libreelec and tvheadend. > Sorry to come late to the party. It seems problem comes from some piece of hardware/driver having some precise timing prereq, and opportunistic use of softirq/tasklet (instead maybe of hard irq handlers ) While it is true that softirq might do the job in most cases, we already have cases where this can be easily defeated, say if one cpu has suddenly to handle multiple sources of interrupts for various devices. NET_RX can easily lock the cpu for 10ms (on HZ=100 builds) So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has shown up multiple times in various 'regressions' simply because it could surface the problem more often. But even if you revert it, you can still make the faulty driver/subsystem misbehave by adding more stress to the cpu handling the IRQ. Note that networking lacks fine control of its softirq processing. Some people found/complained that relying more on ksoftirqd was potentially adding tail latencies. Maybe the answer is to tune the kernel for small latencies at the price of small throughput (situation before the patch) 1) Revert the patch 2) get rid of ksoftirqd since it adds unexpected latencies. 3) Let applications that expect to have high throughput make sure to pin their threads on cpus that are not processing IRQ. (And make sure to not use irqbalance, and setup IRQ cpu affinities)
On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazet <edumazet@google.com> wrote: > > So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has > shown up multiple times in various 'regressions' > simply because it could surface the problem more often. > But even if you revert it, you can still make the faulty > driver/subsystem misbehave by adding more stress to the cpu handling > the IRQ. ..but that's always true. People sometimes live on the edge - often by design (ie hardware has been designed/selected to be the crappiest possible that still work). That doesn't change anything. A patch that takes "bad things can happen" to "bad things DO happen" is a bad patch. > Maybe the answer is to tune the kernel for small latencies at the > price of small throughput (situation before the patch) Generally we always want to tune for latency. Throughput is "easy", but almost never interesting. Sure, people do batch jobs. And yes, people often _benchmark_ throughput, because it's easy to benchmark. It's much harder to benchmark latency, even when it's often much more important. A prime example is the SSD benchmarks in the last few years - they improved _dramatically_ when people noticed that the real problem was latency, not the idiotic maximum big-block bandwidth numbers that have almost zero impact on most people. Put another way: we already have a very strong implicit bias towards bandwidth just because it's easier to see and measure. That means that we generally should strive to have a explicit bias towards optimizing for latency when that choice comes up. Just to balance things out (and just to not take the easy way out: bandwidth can often be improved by adding more layers of buffering and bigger buffers, and that often ends up really hurting latency). > 1) Revert the patch Well, we can revert it only partially - limiting it to just networking for example. Just saying "act the way you used to for tasklets" already seems to have fixed the issue in DVB. > 2) get rid of ksoftirqd since it adds unexpected latencies. We can't get rid of it entirely, since the synchronous softirq code can cause problems too. It's why we have that "maximum of ten synchronous events" in __do_softirq(). And we don't *want* to get rid of it. We've _always_ had that small-scale "at some point we can't do it synchronously any more". That is a small-scale "don't have horrible latency for _other_ things" protection. So it's about latency too, it's just about protecting latency of the rest of the system. The problem with commit 4cd13c21b207 is that it turns the small-scale latency issues in softirq handling (they get larger latencies for lots of hardware interrupts or even from non-preemptible kernel code) into the _huge_ scale latency of scheduling, and does so in a racy way too. > 3) Let applications that expect to have high throughput make sure to > pin their threads on cpus that are not processing IRQ. > (And make sure to not use irqbalance, and setup IRQ cpu affinities) The only people that really deal in "thoughput only" tend to be the HPC people, and they already do things like this. (The other end of the spectrum is the realtime people that have extreme latency requirements, who do things like that for the reverse reason: keeping one or more CPU's reserved for the particular low-latency realtime job). Linus
On Tue, Jan 9, 2018 at 9:48 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazet <edumazet@google.com> wrote: >> >> So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has >> shown up multiple times in various 'regressions' >> simply because it could surface the problem more often. >> But even if you revert it, you can still make the faulty >> driver/subsystem misbehave by adding more stress to the cpu handling >> the IRQ. > > ..but that's always true. People sometimes live on the edge - often by > design (ie hardware has been designed/selected to be the crappiest > possible that still work). > > That doesn't change anything. A patch that takes "bad things can > happen" to "bad things DO happen" is a bad patch. I was expecting that people could get a chance to fix the root cause, instead of trying to keep status quo. Strangely, it took 18 months for someone to complain enough and 'bisect to this commit' Your patch considers TASKLET_SOFTIRQ being a candidate for 'immediate handling', but TCP Small queues heavily use TASKLET, so as far as I am concerned a revert would have the same effect.
On Tue, Jan 9, 2018 at 9:57 AM, Eric Dumazet <edumazet@google.com> wrote: > > Your patch considers TASKLET_SOFTIRQ being a candidate for 'immediate > handling', but TCP Small queues heavily use TASKLET, > so as far as I am concerned a revert would have the same effect. Does it actually? TCP ends up dropping packets outside of the window etc, so flooding a machine with TCP packets and causing some further processing up the stack sounds very different from the basic packet flooding thing that happens with NET_RX_SOFTIRQ. Also, honestly, the kinds of people who really worry about flooding tend to have packet filtering in the receive path etc. So I really think "you can use up 90% of CPU time with a UDP packet flood from the same network" is very very very different - and honestly not at all as important - as "you want to be able to use a USB DVB receiver and watch/record TV". Because that whole "UDP packet flood from the same network" really is something you _fundamentally_ have other mitigations for. I bet that whole commit was introduced because of a benchmark test, rather than real life. No? In contrast, now people are complaining about real loads not working. Linus
On Tue, Jan 9, 2018 at 10:58 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 9, 2018 at 9:57 AM, Eric Dumazet <edumazet@google.com> wrote: >> >> Your patch considers TASKLET_SOFTIRQ being a candidate for 'immediate >> handling', but TCP Small queues heavily use TASKLET, >> so as far as I am concerned a revert would have the same effect. > > Does it actually? > > TCP ends up dropping packets outside of the window etc, so flooding a > machine with TCP packets and causing some further processing up the > stack sounds very different from the basic packet flooding thing that > happens with NET_RX_SOFTIRQ. > > Also, honestly, the kinds of people who really worry about flooding > tend to have packet filtering in the receive path etc. > > So I really think "you can use up 90% of CPU time with a UDP packet > flood from the same network" is very very very different - and > honestly not at all as important - as "you want to be able to use a > USB DVB receiver and watch/record TV". > > Because that whole "UDP packet flood from the same network" really is > something you _fundamentally_ have other mitigations for. > > I bet that whole commit was introduced because of a benchmark test, > rather than real life. No? > > In contrast, now people are complaining about real loads not working. > > Linus I said that a revert was fine, maybe I was not clear. Clearly we can not touch anything scheduler related without breaking someone workload/assumptions on how system behaved at some point. Your patch wont solve other workloads that might have been impacted by my patch, so in one year (or next week), we will have to cope with another device driver not using tasklet but still relying on immediate softirq processing. Apparently, we have to live with softirq model forever, or switch to RT kernels. Note that we have no mitigation for something that involve flood of valid packets that no firewall can drop (without dropping legitimate packets). The 'benchmark' here is not really the trigger, only a tool validating an idea/patch.
On Tue, 9 Jan 2018 10:58:30 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I really think "you can use up 90% of CPU time with a UDP packet > flood from the same network" is very very very different - and > honestly not at all as important - as "you want to be able to use a > USB DVB receiver and watch/record TV". > > Because that whole "UDP packet flood from the same network" really is > something you _fundamentally_ have other mitigations for. > > I bet that whole commit was introduced because of a benchmark test, > rather than real life. No? I believe this have happened in real-life. In the form of DNS servers not being able to recover after long outage, where DNS-TTL had timeout causing legitimate traffic to overload their DNS servers. The goodput answers/sec from their DNS servers were too low, when bringing them online again. (Based on talk over beer at NetDevConf from a guy claiming they ran DNS for AWS). The commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") tries to address a fundamental problem that the network stack have when interacting with softirq in overload situations. (Maybe we can come up with a better solution?) Before this commit, when application run on same CPU as softirq, the kernel have a bad "drop off cliff" behavior, when reaching above the saturation point. This is confirmed in CloudFlare blogpost[1], which used a kernel that predates this commit. From[1] section: "A note on NUMA performance" Quote:" 1. Run receiver on another CPU, but on the same NUMA node as the RX queue. The performance as we saw above is around 360kpps. 2. With receiver on exactly same CPU as the RX queue we can get up to ~430kpps. But it creates high variability. The performance drops down to zero if the NIC is overwhelmed with packets." The behavior problem here is "performance drops down to zero if the NIC is overwhelmed with packets". That is a bad way to handle overload. Not only when attacked, but also when bringing a service online after an outage. What essentially happens is that: 1. softirq NAPI enqueue 64 packets into socket. 2. application dequeue 1 packet and invoke local_bh_enable() 3. causing softirq to run in app-timeslice, again enq 64 packets 4. app only see goodput of 1/128 of packets That is essentially what Eric solved with his commit, avoiding (3) local_bh_enable() to invoke softirq if ksoftirqd is already running. Maybe we can come up with a better solution? (as I do agree this was a too big-hammer affecting other use-cases) [1] https://blog.cloudflare.com/how-to-receive-a-million-packets/ p.s. Regarding quote[1] point "1.", after Paolo Abeni optimized the UDP code, that statement is no longer true. It now (significantly) faster to run/pin your UDP application to another CPU than the RX-CPU.
Em Tue, 9 Jan 2018 09:48:47 -0800 Linus Torvalds <torvalds@linux-foundation.org> escreveu: > On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazet <edumazet@google.com> wrote: > > > > So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has > > shown up multiple times in various 'regressions' > > simply because it could surface the problem more often. > > But even if you revert it, you can still make the faulty > > driver/subsystem misbehave by adding more stress to the cpu handling > > the IRQ. > > ..but that's always true. People sometimes live on the edge - often by > design (ie hardware has been designed/selected to be the crappiest > possible that still work). > > That doesn't change anything. A patch that takes "bad things can > happen" to "bad things DO happen" is a bad patch. > > > Maybe the answer is to tune the kernel for small latencies at the > > price of small throughput (situation before the patch) > > Generally we always want to tune for latency. Throughput is "easy", > but almost never interesting. > > Sure, people do batch jobs. And yes, people often _benchmark_ > throughput, because it's easy to benchmark. It's much harder to > benchmark latency, even when it's often much more important. > > A prime example is the SSD benchmarks in the last few years - they > improved _dramatically_ when people noticed that the real problem was > latency, not the idiotic maximum big-block bandwidth numbers that have > almost zero impact on most people. > > Put another way: we already have a very strong implicit bias towards > bandwidth just because it's easier to see and measure. > > That means that we generally should strive to have a explicit bias > towards optimizing for latency when that choice comes up. Just to > balance things out (and just to not take the easy way out: bandwidth > can often be improved by adding more layers of buffering and bigger > buffers, and that often ends up really hurting latency). > > > 1) Revert the patch > > Well, we can revert it only partially - limiting it to just networking > for example. > > Just saying "act the way you used to for tasklets" already seems to > have fixed the issue in DVB. > > > 2) get rid of ksoftirqd since it adds unexpected latencies. > > We can't get rid of it entirely, since the synchronous softirq code > can cause problems too. It's why we have that "maximum of ten > synchronous events" in __do_softirq(). > > And we don't *want* to get rid of it. > > We've _always_ had that small-scale "at some point we can't do it > synchronously any more". > > That is a small-scale "don't have horrible latency for _other_ things" > protection. So it's about latency too, it's just about protecting > latency of the rest of the system. > > The problem with commit 4cd13c21b207 is that it turns the small-scale > latency issues in softirq handling (they get larger latencies for lots > of hardware interrupts or even from non-preemptible kernel code) into > the _huge_ scale latency of scheduling, and does so in a racy way too. > > > 3) Let applications that expect to have high throughput make sure to > > pin their threads on cpus that are not processing IRQ. > > (And make sure to not use irqbalance, and setup IRQ cpu affinities) > > The only people that really deal in "thoughput only" tend to be the > HPC people, and they already do things like this. > > (The other end of the spectrum is the realtime people that have > extreme latency requirements, who do things like that for the reverse > reason: keeping one or more CPU's reserved for the particular > low-latency realtime job). Ok, it took me some time - and a faster microSD - in order to be sure that the data loss weren't due to bad storage performance, but I have now some test results. In summary, indeed the ksoftirq commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") is causing data losses. On my tests, it generate at least one continuity error on every 1-5 minutes. Either reverting it or applying Linus proposal of partially reverting it fixes the issues. Increasing the number of URBs doesn't seem to help. I'm enclosing the dirty details below. Linus/Eric, Now that I have an environment setup, I can test whatever other alternative that would fix the UDP packet flow attack while won't break the softirq handling code. Regards, Mauro --- All tests below were done on a Raspberry Pi3 with a SanDisk Extreme U3 microSD card with 32GB and a DVBSky S960C DVB-S2 tuner with an external power supply, connected to a TCP/IP network via Ethernet (with uses USB on RPi). It also have a serial cable connected to it. It was installed with LibreELEC 8.2.2, using tvheadend backend. I'm recording one MPEG-TS service/"channel" composed of one audio and one video stream, The total traffic collected by tvheadend was about 4 Mbits/s (audio+video+EPG tables). It is part of a 58 mbits/s MPEG Transport stream, with 23 TV service/"channels" on it. While handling this issue, I found one unrelated bug, fixed on this patch: https://git.linuxtv.org/mchehab/experimental.git/commit/?h=softirq_fixup&id=afb6c749c9da6e661335bc059f2b117421c09f77 This bug has no effect on DVB streaming. It only causes the signal strength to be reported wrongly on 32 bit Kernels. On all tests below I had this patch applied. Test 1 ====== Kernel (e. g. Raspbian Kernel), recording and watching the video at the same time on Kodi, plus one VLC client, on an interval of time of 5 minutes, it had 4 MPEG continuity errors on video (and one on audio): Jan 12 15:05:39 rpi3 tvheadend[285]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 1) Jan 12 15:06:20 rpi3 tvheadend[285]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 2) Jan 12 15:07:36 rpi3 tvheadend[285]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 3) Jan 12 15:07:36 rpi3 tvheadend[285]: TS: DVB-S Network/12090H/TV Senado: MPEG2AUDIO @ #1602 Continuity counter error (total 1) Jan 12 15:10:28 rpi3 tvheadend[285]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 4) With upstream Kernels, Kodi stops working (it depends on Raspbian video driver). So, I opened two VLC players, on separate machines, in order to also have 2 clients watching, plus the record task. That increased the network and USB traffic as well. All the next tests were on such scenario. Test 2 ====== With Kernel 4.14.12 vanilla with just one extra patch increasing the number of URB buffers from 8 to 16, it got 2 video errors on a 6 minutes interval: Jan 12 15:56:09 rpi3 tvheadend[222]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 2) Jan 12 15:56:09 rpi3 tvheadend[222]: TS: DVB-S Network/12090H/TV Senado: MPEG2AUDIO @ #1602 Continuity counter error (total 1) Jan 12 16:03:05 rpi3 tvheadend[222]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 3) Jan 12 16:03:05 rpi3 tvheadend[222]: TS: DVB-S Network/12090H/TV Senado: MPEG2AUDIO @ #1602 Continuity counter error (total 2) Test 3 ====== With upstream Kernel 4.14.12 + 16 buffers patch + commit 4cd13c21b207 reverted, I kept it running for about 15-30 mins. No continuity errors. Test 4 ====== With upstream Kernel 4.14.12 with the partial softirq revert made by Linus test patch[1], running for about 20 mins, it got just one continuity error: Jan 12 16:51:31 rpi3 tvheadend[237]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 1) Jan 12 16:51:31 rpi3 tvheadend[237]: TS: DVB-S Network/12090H/TV Senado: MPEG2AUDIO @ #1602 Continuity counter error (total 1) [1] https://git.linuxtv.org/mchehab/experimental.git/commit/?h=softirq_fixup&id=7996c39af87d329f64e6b1b2af120d6ce11ede29 Test 5 ====== I then moved to Kernel 4.15-rc7. In this case, I had to add an extra patch, as the USB controller is currently broken upstream: https://git.linuxtv.org/mchehab/experimental.git/commit/?h=softirq_fixup&id=6bcc57ea8a84e9d5fed9f5ebf13d63fd28ef181c The .config file used to build the Kernel is at: https://pastebin.com/wpZghann With upstream Kernel 4.15-rc7 - with Linus patch applied[2], I kept the record + 2 VLC clients running for about one hour. It got just one continuity error: Jan 12 20:06:26 rpi3 tvheadend[226]: TS: DVB-S Network/12090H/TV Senado: H264 @ #1601 Continuity counter error (total 1) Jan 12 20:06:26 rpi3 tvheadend[226]: TS: DVB-S Network/12090H/TV Senado: MPEG2AUDIO @ #1602 Continuity counter error (total 1) [2] The test Kernel is this one: https://git.linuxtv.org/mchehab/experimental.git/log/?h=softirq_fixup It is hard to tell if this one continuity error is due to some Kernel issue, or if it is simply due to some PES packet with bad CRC that got discarded, but it seems a normal condition to me. Thanks, Mauro
On Fri, 2018-01-12 at 19:13 -0200, Mauro Carvalho Chehab wrote: > > > The .config file used to build the Kernel is at: > https://pastebin.com/wpZghann > Hi Mauro Any chance you can try CONFIG_HZ_1000=y, CONFIG_HZ=1000 ? Thanks.
Em Fri, 12 Jan 2018 13:48:46 -0800 Eric Dumazet <eric.dumazet@gmail.com> escreveu: > On Fri, 2018-01-12 at 19:13 -0200, Mauro Carvalho Chehab wrote: > > > > > > The .config file used to build the Kernel is at: > > https://pastebin.com/wpZghann > > > > Hi Mauro > > Any chance you can try CONFIG_HZ_1000=y, CONFIG_HZ=1000 ? I can do such test to satisfy your curiosity, but that doesn't sound the right fix. See, almost all TV and set top boxes(STB) run Linux nowadays and usually come with ARM cpus designed to "just do their job" (e. g. CPUs with low clocks). There, power consumption is a must. This bug very likely affect those devices, once migrated to Kernel 4.9+. Changing from NO_HZ to HZ=1000 on TV/STB will for sure have bad side effects on those types of devices, increasing power consumption. Not saying that this will be environmentally very bad, as the number of just TV unit sales is at the order of 230 million units per year[1]. [1] https://www.statista.com/statistics/461316/global-tv-unit-sales/ Thanks, Mauro
Em Sat, 13 Jan 2018 07:09:20 -0200 Mauro Carvalho Chehab <mchehab@s-opensource.com> escreveu: > Em Fri, 12 Jan 2018 13:48:46 -0800 > Eric Dumazet <eric.dumazet@gmail.com> escreveu: > > > On Fri, 2018-01-12 at 19:13 -0200, Mauro Carvalho Chehab wrote: > > > > > > > > > The .config file used to build the Kernel is at: > > > https://pastebin.com/wpZghann > > > > > > > Hi Mauro > > > > Any chance you can try CONFIG_HZ_1000=y, CONFIG_HZ=1000 ? It actually made it a lot worse! without Linus patch (or reverting the softirq patch), on a 4 minutes of capture, it got all those errors: Jan 13 10:41:41 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 1) Jan 13 10:41:42 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: MPEG2AUDIO @ #1912 Continuity counter error (total 1) Jan 13 10:42:14 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 3) Jan 13 10:42:47 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 4) Jan 13 10:42:58 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 5) Jan 13 10:42:58 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: MPEG2AUDIO @ #1912 Continuity counter error (total 2) Jan 13 10:43:34 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 9) Jan 13 10:43:37 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: MPEG2AUDIO @ #1912 Continuity counter error (total 5) Jan 13 10:44:00 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 12) Jan 13 10:44:29 rpi3 tvheadend[226]: TS: DVB-S Network/12130H/NBR: H264 @ #1911 Continuity counter error (total 13) Thanks, Mauro
diff --git a/drivers/media/usb/dvb-usb-v2/dvbsky.c b/drivers/media/usb/dvb-usb-v2/dvbsky.c index 131b6c08e199..d3f5ffc54b25 100644 --- a/drivers/media/usb/dvb-usb-v2/dvbsky.c +++ b/drivers/media/usb/dvb-usb-v2/dvbsky.c @@ -740,7 +740,7 @@ static struct dvb_usb_device_properties dvbsky_s960_props = { .num_adapters = 1, .adapter = { { - .stream = DVB_USB_STREAM_BULK(0x82, 8, 4096), + .stream = DVB_USB_STREAM_BULK(0x82, 65, 4096), } } };