Message ID | 20170718060909.5280-1-airlied@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote: > This patch allows the user to disable write combined mapping > of the efifb framebuffer console using an nowc option. > > A customer noticed major slowdowns while logging to the console > with write combining enabled, on other tasks running on the same > CPU. (10x or greater slow down on all other cores on the same CPU > as is doing the logging). > > I reproduced this on a machine with dual CPUs. > Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) > > I wrote a test that just mmaps the pci bar and writes to it in > a loop, while this was running in the background one a single > core with (taskset -c 1), building a kernel up to init/version.o > (taskset -c 8) went from 13s to 133s or so. I've yet to explain > why this occurs or what is going wrong I haven't managed to find > a perf command that in any way gives insight into this. > > 11,885,070,715 instructions # 1.39 insns per cycle > vs > 12,082,592,342 instructions # 0.13 insns per cycle > > is the only thing I've spotted of interest, I've tried at least: > dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses > > For now it seems at least a good idea to allow a user to disable write > combining if they see this until we can figure it out. Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ using ioremap_wc() for the exact same reason. I'm not against letting the user force one way or the other if it helps, though it sure would be nice to know why. Anyway, Acked-By: Peter Jones <pjones@redhat.com> Bartlomiej, do you want to handle this in your devel tree?
On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@redhat.com> wrote: > > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ > using ioremap_wc() for the exact same reason. I'm not against letting > the user force one way or the other if it helps, though it sure would be > nice to know why. It's kind of amazing for another reason too: how is ioremap_wc() _possibly_ slower than ioremap_nocache() (which is what plain ioremap() is)? The difference is literally _PAGE_CACHE_MODE_WC vs _PAGE_CACHE_MODE_UC_MINUS. Both of them should be uncached, but WC should allow much better write behavior. It should also allow much better system behavior. This really sounds like a band-aid patch that just hides some other issue entirely. Maybe we screw up the cache modes for some PAT mode setup? Or maybe it really is something where there is one global write queue per die (not per CPU), and having that write queue "active" doing combining will slow down every core due to some crazy synchronization issue? x86 people, look at what Dave Airlie did, I'll just repeat it because it sounds so crazy: > A customer noticed major slowdowns while logging to the console > with write combining enabled, on other tasks running on the same > CPU. (10x or greater slow down on all other cores on the same CPU > as is doing the logging). > > I reproduced this on a machine with dual CPUs. > Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) > > I wrote a test that just mmaps the pci bar and writes to it in > a loop, while this was running in the background one a single > core with (taskset -c 1), building a kernel up to init/version.o > (taskset -c 8) went from 13s to 133s or so. I've yet to explain > why this occurs or what is going wrong I haven't managed to find > a perf command that in any way gives insight into this. So basically the UC vs WC thing seems to slow down somebody *else* (in this case a kernel compile) on another core entirely, by a factor of 10x. Maybe the WC writer itself is much faster, but _others_ are slowed down enormously. Whaa? That just seems incredible. Dave - while your test sounds very simple, can you package it up some way so that somebody inside of Intel can just run it on one of their machines? The patch itself (to allow people to *not* do WC that is supposed to be so much better but clearly doesn't seem to be) looks fine to me, but it would be really good to get intel to look at this. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 05:57, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@redhat.com> wrote: >> >> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ >> using ioremap_wc() for the exact same reason. I'm not against letting >> the user force one way or the other if it helps, though it sure would be >> nice to know why. > > It's kind of amazing for another reason too: how is ioremap_wc() > _possibly_ slower than ioremap_nocache() (which is what plain > ioremap() is)? In normal operation the console is faster with _wc. It's the side effects on other cores that is the problem. > Or maybe it really is something where there is one global write queue > per die (not per CPU), and having that write queue "active" doing > combining will slow down every core due to some crazy synchronization > issue? > > x86 people, look at what Dave Airlie did, I'll just repeat it because > it sounds so crazy: > >> A customer noticed major slowdowns while logging to the console >> with write combining enabled, on other tasks running on the same >> CPU. (10x or greater slow down on all other cores on the same CPU >> as is doing the logging). >> >> I reproduced this on a machine with dual CPUs. >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) >> >> I wrote a test that just mmaps the pci bar and writes to it in >> a loop, while this was running in the background one a single >> core with (taskset -c 1), building a kernel up to init/version.o >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain >> why this occurs or what is going wrong I haven't managed to find >> a perf command that in any way gives insight into this. > > So basically the UC vs WC thing seems to slow down somebody *else* (in > this case a kernel compile) on another core entirely, by a factor of > 10x. Maybe the WC writer itself is much faster, but _others_ are > slowed down enormously. > > Whaa? That just seems incredible. Yes I've been staring at this for a while now trying to narrow it down, I've been a bit slow on testing it on a wider range of Intel CPUs, I've only really managed to play on that particular machine, I've attached two test files. compile both of them (I just used make write_resource burn-cycles). On my test CPU core 1/8 are on same die. time taskset -c 1 ./burn-cycles takes about 6 seconds taskset -c 8 ./write_resource wc taskset -c 1 ./burn-cycles takes about 1 minute. Now I've noticed write_resource wc or not wc doesn't seem to make a difference, so I think it matters that efifb has used _wc for the memory area already and set PAT on it for wc, and we always get wc on that BAR. From the other person seeing it: "I done a similar test some time ago, the result was the same. I ran some benchmarks, and it seems that when data set fits in L1 cache there is no significant performance degradation." Dave.
On 19 July 2017 at 06:44, Dave Airlie <airlied@gmail.com> wrote: > On 19 July 2017 at 05:57, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@redhat.com> wrote: >>> >>> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ >>> using ioremap_wc() for the exact same reason. I'm not against letting >>> the user force one way or the other if it helps, though it sure would be >>> nice to know why. >> >> It's kind of amazing for another reason too: how is ioremap_wc() >> _possibly_ slower than ioremap_nocache() (which is what plain >> ioremap() is)? > > In normal operation the console is faster with _wc. It's the side effects > on other cores that is the problem. > >> Or maybe it really is something where there is one global write queue >> per die (not per CPU), and having that write queue "active" doing >> combining will slow down every core due to some crazy synchronization >> issue? >> >> x86 people, look at what Dave Airlie did, I'll just repeat it because >> it sounds so crazy: >> >>> A customer noticed major slowdowns while logging to the console >>> with write combining enabled, on other tasks running on the same >>> CPU. (10x or greater slow down on all other cores on the same CPU >>> as is doing the logging). >>> >>> I reproduced this on a machine with dual CPUs. >>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) >>> >>> I wrote a test that just mmaps the pci bar and writes to it in >>> a loop, while this was running in the background one a single >>> core with (taskset -c 1), building a kernel up to init/version.o >>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain >>> why this occurs or what is going wrong I haven't managed to find >>> a perf command that in any way gives insight into this. >> >> So basically the UC vs WC thing seems to slow down somebody *else* (in >> this case a kernel compile) on another core entirely, by a factor of >> 10x. Maybe the WC writer itself is much faster, but _others_ are >> slowed down enormously. >> >> Whaa? That just seems incredible. > > Yes I've been staring at this for a while now trying to narrow it down, I've > been a bit slow on testing it on a wider range of Intel CPUs, I've > only really managed > to play on that particular machine, > > I've attached two test files. compile both of them (I just used make > write_resource burn-cycles). > > On my test CPU core 1/8 are on same die. > > time taskset -c 1 ./burn-cycles > takes about 6 seconds > > taskset -c 8 ./write_resource wc > taskset -c 1 ./burn-cycles > takes about 1 minute. > > Now I've noticed write_resource wc or not wc doesn't seem to make a > difference, so > I think it matters that efifb has used _wc for the memory area already > and set PAT on it for wc, > and we always get wc on that BAR. > > From the other person seeing it: > "I done a similar test some time ago, the result was the same. > I ran some benchmarks, and it seems that when data set fits in L1 > cache there is no significant performance degradation." Oh and just FYI, the machine I've tested this on has an mgag200 server graphics card backing the framebuffer, but with just efifb loaded. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <airlied@gmail.com> wrote: > > Oh and just FYI, the machine I've tested this on has an mgag200 server > graphics card backing the framebuffer, but with just efifb loaded. Yeah, it looks like it needs special hardware - and particularly the kind of garbage hardware that people only have on servers. Why do server people continually do absolute sh*t hardware? It's crap, crap, crap across the board outside the CPU. Nasty and bad hacky stuff that nobody else would touch with a ten-foot pole, and the "serious enterprise" people lap it up like it was ambrosia. It's not just "graphics is bad anyway since we don't care". It's all the things they ostensibly _do_ care about too, like the disk and the fabric infrastructure. Buggy nasty crud. Anyway, rant over. I wonder if we could show this without special hardware by just mapping some region that doesn't even have hardware in it as WC. Do we even expose the PAT settings to user space, though, or do we always have to have some fake module to create the PAT stuff? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 08:22, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <airlied@gmail.com> wrote: >> >> Oh and just FYI, the machine I've tested this on has an mgag200 server >> graphics card backing the framebuffer, but with just efifb loaded. > > Yeah, it looks like it needs special hardware - and particularly the > kind of garbage hardware that people only have on servers. > > Why do server people continually do absolute sh*t hardware? It's crap, > crap, crap across the board outside the CPU. Nasty and bad hacky stuff > that nobody else would touch with a ten-foot pole, and the "serious > enterprise" people lap it up like it was ambrosia. > > It's not just "graphics is bad anyway since we don't care". It's all > the things they ostensibly _do_ care about too, like the disk and the > fabric infrastructure. Buggy nasty crud. I've tried to reproduce now on: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz using some address space from 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2) And I don't see the issue. I'll try and track down some more efi compatible mga or other wierd server chips stuff if I can. > Anyway, rant over. I wonder if we could show this without special > hardware by just mapping some region that doesn't even have hardware > in it as WC. Do we even expose the PAT settings to user space, though, > or do we always have to have some fake module to create the PAT stuff? I do wonder wtf the hw could be doing that would cause this, but I've no idea how to tell what difference a write combined PCI transaction would have on the bus side of things, and what the device could generate that would cause such a horrible slowdown. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 09:16, Dave Airlie <airlied@gmail.com> wrote: > On 19 July 2017 at 08:22, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <airlied@gmail.com> wrote: >>> >>> Oh and just FYI, the machine I've tested this on has an mgag200 server >>> graphics card backing the framebuffer, but with just efifb loaded. >> >> Yeah, it looks like it needs special hardware - and particularly the >> kind of garbage hardware that people only have on servers. >> >> Why do server people continually do absolute sh*t hardware? It's crap, >> crap, crap across the board outside the CPU. Nasty and bad hacky stuff >> that nobody else would touch with a ten-foot pole, and the "serious >> enterprise" people lap it up like it was ambrosia. >> >> It's not just "graphics is bad anyway since we don't care". It's all >> the things they ostensibly _do_ care about too, like the disk and the >> fabric infrastructure. Buggy nasty crud. > > I've tried to reproduce now on: > Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz > using some address space from > 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2) > > And I don't see the issue. > > I'll try and track down some more efi compatible mga or other wierd server chips > stuff if I can. > >> Anyway, rant over. I wonder if we could show this without special >> hardware by just mapping some region that doesn't even have hardware >> in it as WC. Do we even expose the PAT settings to user space, though, >> or do we always have to have some fake module to create the PAT stuff? > > I do wonder wtf the hw could be doing that would cause this, but I've no idea > how to tell what difference a write combined PCI transaction would have on the > bus side of things, and what the device could generate that would cause such > a horrible slowdown. > > Dave. 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01) (prog-if 00 [VGA controller]) Subsystem: Hewlett-Packard Company iLO4 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 255 Region 0: Memory at 91000000 (32-bit, prefetchable) [size=16M] Region 1: Memory at 92a88000 (32-bit, non-prefetchable) [size=16K] Region 2: Memory at 92000000 (32-bit, non-prefetchable) [size=8M] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [a8] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1 <4us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Is a full lspci -vvv for the VGA device in question. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 09:16, Dave Airlie <airlied@gmail.com> wrote: > On 19 July 2017 at 09:16, Dave Airlie <airlied@gmail.com> wrote: >> On 19 July 2017 at 08:22, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <airlied@gmail.com> wrote: >>>> >>>> Oh and just FYI, the machine I've tested this on has an mgag200 server >>>> graphics card backing the framebuffer, but with just efifb loaded. >>> >>> Yeah, it looks like it needs special hardware - and particularly the >>> kind of garbage hardware that people only have on servers. >>> >>> Why do server people continually do absolute sh*t hardware? It's crap, >>> crap, crap across the board outside the CPU. Nasty and bad hacky stuff >>> that nobody else would touch with a ten-foot pole, and the "serious >>> enterprise" people lap it up like it was ambrosia. >>> >>> It's not just "graphics is bad anyway since we don't care". It's all >>> the things they ostensibly _do_ care about too, like the disk and the >>> fabric infrastructure. Buggy nasty crud. >> >> I've tried to reproduce now on: >> Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz >> using some address space from >> 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2) >> >> And I don't see the issue. >> >> I'll try and track down some more efi compatible mga or other wierd server chips >> stuff if I can. >> >>> Anyway, rant over. I wonder if we could show this without special >>> hardware by just mapping some region that doesn't even have hardware >>> in it as WC. Do we even expose the PAT settings to user space, though, >>> or do we always have to have some fake module to create the PAT stuff? >> >> I do wonder wtf the hw could be doing that would cause this, but I've no idea >> how to tell what difference a write combined PCI transaction would have on the >> bus side of things, and what the device could generate that would cause such >> a horrible slowdown. >> >> Dave. > More digging: Single CPU system: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH Now I can't get efifb to load on this (due to it being remote and I've no idea how to make my install efi onto it), but booting with no framebuffer, and running the tests on the mga, show no slowdown on this. Now I'm starting to wonder if it's something that only happens on multi-socket systems. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 18, 2017 at 5:00 PM, Dave Airlie <airlied@gmail.com> wrote: > > More digging: > Single CPU system: > Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz > 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH > > Now I can't get efifb to load on this (due to it being remote and I've > no idea how to make > my install efi onto it), but booting with no framebuffer, and running > the tests on the mga, > show no slowdown on this. Is it actually using write-combining memory without a frame buffer, though? I don't think it is. So the lack of slowdown might be just from that. > Now I'm starting to wonder if it's something that only happens on > multi-socket systems. Hmm. I guess that's possible, of course. [ Wild and crazy handwaving... ] Without write combining, all the uncached writes will be fully serialized and there is no buffering in the chip write buffers. There will be at most one outstanding PCI transaction in the uncore write buffer. In contrast, _with_ write combining, the write buffers in the uncore can fill up. But why should that matter? Maybe memory ordering. When one of the cores (doesn't matter *which* core) wants to get a cacheline for exclusive use (ie it did a write to it), it will need to invalidate cachelines in other cores. However, the uncore now has all those PCI writes buffered, and the write ordering says that they should happen before the memory writes. So before it can give the core exclusive ownership of the new cacheline, it needs to wait for all those buffered writes to be pushed out, so that no other CPU can see the new write *before* the device saw the old writes. But I'm not convinced this is any different in a multi-socket situation than it is in a single-socket one. The other cores on the same socket should not be able to see the writes out of order _either_. And honestly, I think PCI write posting rules makes the above crazy handwaving completely bogus anyway. Writes _can_ be posted, so the memory ordering isn't actually that tight. I dunno. I really think it would be good if somebody inside Intel would look at it.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 11:15, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jul 18, 2017 at 5:00 PM, Dave Airlie <airlied@gmail.com> wrote: >> >> More digging: >> Single CPU system: >> Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz >> 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH >> >> Now I can't get efifb to load on this (due to it being remote and I've >> no idea how to make >> my install efi onto it), but booting with no framebuffer, and running >> the tests on the mga, >> show no slowdown on this. > > Is it actually using write-combining memory without a frame buffer, > though? I don't think it is. So the lack of slowdown might be just > from that. > >> Now I'm starting to wonder if it's something that only happens on >> multi-socket systems. > > Hmm. I guess that's possible, of course. > > [ Wild and crazy handwaving... ] > > Without write combining, all the uncached writes will be fully > serialized and there is no buffering in the chip write buffers. There > will be at most one outstanding PCI transaction in the uncore write > buffer. > > In contrast, _with_ write combining, the write buffers in the uncore > can fill up. > > But why should that matter? Maybe memory ordering. When one of the > cores (doesn't matter *which* core) wants to get a cacheline for > exclusive use (ie it did a write to it), it will need to invalidate > cachelines in other cores. However, the uncore now has all those PCI > writes buffered, and the write ordering says that they should happen > before the memory writes. So before it can give the core exclusive > ownership of the new cacheline, it needs to wait for all those > buffered writes to be pushed out, so that no other CPU can see the new > write *before* the device saw the old writes. > > But I'm not convinced this is any different in a multi-socket > situation than it is in a single-socket one. The other cores on the > same socket should not be able to see the writes out of order > _either_. > > And honestly, I think PCI write posting rules makes the above crazy > handwaving completely bogus anyway. Writes _can_ be posted, so the > memory ordering isn't actually that tight. > > I dunno. I really think it would be good if somebody inside Intel > would look at it.. Yes hoping someone can give some insight. Scrap the multi-socket it's been seen on a single-socket, but not as drastic, 2x rather than 10x slowdowns. It's starting to seem like the commonality might be the Matrox G200EH which is part of the HP remote management iLO hardware, it might be that the RAM on the other side of the PCIE connection is causing some sort of wierd stalls or slowdowns. I'm not sure how best to validate that either. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 19, 2017 at 9:07 PM, Dave Airlie <airlied@gmail.com> wrote: > > Yes hoping someone can give some insight. > > Scrap the multi-socket it's been seen on a single-socket, but not as > drastic, 2x rather than 10x slowdowns. > > It's starting to seem like the commonality might be the Matrox G200EH > which is part of the HP remote management iLO hardware, it might be that > the RAM on the other side of the PCIE connection is causing some sort of > wierd stalls or slowdowns. I'm not sure how best to validate that either. It shouldn't be that hard to hack up efifb to allocate some actual RAM as "framebuffer", unmap it from the direct map, and ioremap_wc() it as usual. Then you could see if PCIe is important for it. WC streaming writes over PCIe end up doing 64 byte writes, right? Maybe the Matrox chip is just extremely slow handling 64b writes. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 19, 2017 at 9:28 PM, Andy Lutomirski <luto@kernel.org> wrote: > > It shouldn't be that hard to hack up efifb to allocate some actual RAM > as "framebuffer", unmap it from the direct map, and ioremap_wc() it as > usual. Then you could see if PCIe is important for it. The thing is, the "actual RAM" case is unlikely to show this issue. RAM is special, even when you try to mark it WC or whatever. Yes, it might be slowed down by lack of caching, but the uncore still *knows* it is RAM. The accesses go to the memory controller, not the PCI side. > WC streaming writes over PCIe end up doing 64 byte writes, right? > Maybe the Matrox chip is just extremely slow handling 64b writes. .. or maybe there is some unholy "management logic" thing that catches those writes, because this is server hardware, and server vendors invariably add "value add" (read; shit) to their hardware to justify the high price. Like the Intel "management console" that was such a "security feature". I think one of the points of those magic graphics cards is that you can export the frame buffer over the management network, so that you can still run the graphical Windows GUI management stuff. Because you wouldn't want to just ssh into it and run command line stuff. So I wouldn't be surprised at all if the thing has a special back channel to the network chip with a queue of changes going over ethernet or something, and then when you stream things at high speeds to the GPU DRAM, you fill up the management bandwidth. If it was actual framebuffer DRAM, I would expect it to be *happy* with streaming 64-bit writes. But some special "management interface ASIC" that tries to keep track of GPU framebuffer "damage" might be something else altogether. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <airlied@gmail.com> wrote: > > > > Oh and just FYI, the machine I've tested this on has an mgag200 server > > graphics card backing the framebuffer, but with just efifb loaded. > > Yeah, it looks like it needs special hardware - and particularly the > kind of garbage hardware that people only have on servers. > > Why do server people continually do absolute sh*t hardware? It's crap, > crap, crap across the board outside the CPU. Nasty and bad hacky stuff > that nobody else would touch with a ten-foot pole, and the "serious > enterprise" people lap it up like it was ambrosia. > > It's not just "graphics is bad anyway since we don't care". It's all > the things they ostensibly _do_ care about too, like the disk and the > fabric infrastructure. Buggy nasty crud. I believe it's crappy for similar reasons why almost all other large scale pieces of human technological infrastructure are crappy if you look deep under the hood: transportation and communication networks, banking systems, manufacturing, you name it. The main reasons are: - Cost of a clean redesign is an order of magnitude higher that the next delta revision, once you have accumulated a few decades of legacy. - The path dependent evolutionary legacies become so ugly after time that most good people will run away from key elements - so there's not enough internal energy to redesign and implement a clean methodology from grounds up. - Even if there are enough good people, the benefits of a clean design are a long term benefit, constantly hindered by short term pricing. - For non-experts it's hard to tell a good, clean redesign from a flashy but fundamentally flawed redesign. Both are expensive and the latter can have disastrous outcomes. - These are high margin businesses, with customers captured by legacies, where you can pass down the costs to customers, which hides the true costs of crap. i.e. typical free market failure due high complexity combined with (very) long price propagation latencies and opaqueness of pricing. I believe the only place where you'll find overall beautiful server hardware as a rule and not as an exception is in satellite technology: when the unit price is in excess of $100m, expected life span is 10-20 years with no on-site maintenance, and it's all running in a fundamentally hostile environment, then clean and robust hardware design is forced at every step by physics. Humanity is certainly able to design beautiful hardware, once all other options are exhausted. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20 July 2017 at 14:44, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jul 19, 2017 at 9:28 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> It shouldn't be that hard to hack up efifb to allocate some actual RAM >> as "framebuffer", unmap it from the direct map, and ioremap_wc() it as >> usual. Then you could see if PCIe is important for it. > > The thing is, the "actual RAM" case is unlikely to show this issue. > > RAM is special, even when you try to mark it WC or whatever. Yes, it > might be slowed down by lack of caching, but the uncore still *knows* > it is RAM. The accesses go to the memory controller, not the PCI side. > >> WC streaming writes over PCIe end up doing 64 byte writes, right? >> Maybe the Matrox chip is just extremely slow handling 64b writes. > > .. or maybe there is some unholy "management logic" thing that catches > those writes, because this is server hardware, and server vendors > invariably add "value add" (read; shit) to their hardware to justify > the high price. > > Like the Intel "management console" that was such a "security feature". > > I think one of the points of those magic graphics cards is that you > can export the frame buffer over the management network, so that you > can still run the graphical Windows GUI management stuff. Because you > wouldn't want to just ssh into it and run command line stuff. > > So I wouldn't be surprised at all if the thing has a special back > channel to the network chip with a queue of changes going over > ethernet or something, and then when you stream things at high speeds > to the GPU DRAM, you fill up the management bandwidth. > > If it was actual framebuffer DRAM, I would expect it to be *happy* > with streaming 64-bit writes. But some special "management interface > ASIC" that tries to keep track of GPU framebuffer "damage" might be > something else altogether. > I think it's just some RAM on the management console device that is partitioned and exposed via the PCI BAR on the mga vga device. I expect it possibly can't handle lots of writes very well and sends something back that causes the stalls. I'm not even sure how to prove it. So I expect we should at least land this patch for now so people who do suffer from this can at least disable it for now, and if we can narrow it down to a pci id or subsys id for certain HP ilo devices, then we can add a blacklist. I wonder if anyone knows anyone from HPE ilo team. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 July 2017 at 00:34, Peter Jones <pjones@redhat.com> wrote: > On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote: >> This patch allows the user to disable write combined mapping >> of the efifb framebuffer console using an nowc option. >> >> A customer noticed major slowdowns while logging to the console >> with write combining enabled, on other tasks running on the same >> CPU. (10x or greater slow down on all other cores on the same CPU >> as is doing the logging). >> >> I reproduced this on a machine with dual CPUs. >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) >> >> I wrote a test that just mmaps the pci bar and writes to it in >> a loop, while this was running in the background one a single >> core with (taskset -c 1), building a kernel up to init/version.o >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain >> why this occurs or what is going wrong I haven't managed to find >> a perf command that in any way gives insight into this. >> >> 11,885,070,715 instructions # 1.39 insns per cycle >> vs >> 12,082,592,342 instructions # 0.13 insns per cycle >> >> is the only thing I've spotted of interest, I've tried at least: >> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses >> >> For now it seems at least a good idea to allow a user to disable write >> combining if they see this until we can figure it out. > > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ > using ioremap_wc() for the exact same reason. I'm not against letting > the user force one way or the other if it helps, though it sure would be > nice to know why. > > Anyway, > > Acked-By: Peter Jones <pjones@redhat.com> > > Bartlomiej, do you want to handle this in your devel tree? I'm happy to stick this in a drm-fixes pull with this ack. Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, July 25, 2017 02:00:00 PM Dave Airlie wrote: > On 19 July 2017 at 00:34, Peter Jones <pjones@redhat.com> wrote: > > On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote: > >> This patch allows the user to disable write combined mapping > >> of the efifb framebuffer console using an nowc option. > >> > >> A customer noticed major slowdowns while logging to the console > >> with write combining enabled, on other tasks running on the same > >> CPU. (10x or greater slow down on all other cores on the same CPU > >> as is doing the logging). > >> > >> I reproduced this on a machine with dual CPUs. > >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) > >> > >> I wrote a test that just mmaps the pci bar and writes to it in > >> a loop, while this was running in the background one a single > >> core with (taskset -c 1), building a kernel up to init/version.o > >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain > >> why this occurs or what is going wrong I haven't managed to find > >> a perf command that in any way gives insight into this. > >> > >> 11,885,070,715 instructions # 1.39 insns per cycle > >> vs > >> 12,082,592,342 instructions # 0.13 insns per cycle > >> > >> is the only thing I've spotted of interest, I've tried at least: > >> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses > >> > >> For now it seems at least a good idea to allow a user to disable write > >> combining if they see this until we can figure it out. > > > > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ > > using ioremap_wc() for the exact same reason. I'm not against letting > > the user force one way or the other if it helps, though it sure would be > > nice to know why. > > > > Anyway, > > > > Acked-By: Peter Jones <pjones@redhat.com> > > > > Bartlomiej, do you want to handle this in your devel tree? > > I'm happy to stick this in a drm-fixes pull with this ack. I'll put it into fbdev fixes for 4.13 with other fbdev patches. Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, July 25, 2017 10:56:15 AM Bartlomiej Zolnierkiewicz wrote: > On Tuesday, July 25, 2017 02:00:00 PM Dave Airlie wrote: > > On 19 July 2017 at 00:34, Peter Jones <pjones@redhat.com> wrote: > > > On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote: > > >> This patch allows the user to disable write combined mapping > > >> of the efifb framebuffer console using an nowc option. > > >> > > >> A customer noticed major slowdowns while logging to the console > > >> with write combining enabled, on other tasks running on the same > > >> CPU. (10x or greater slow down on all other cores on the same CPU > > >> as is doing the logging). > > >> > > >> I reproduced this on a machine with dual CPUs. > > >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) > > >> > > >> I wrote a test that just mmaps the pci bar and writes to it in > > >> a loop, while this was running in the background one a single > > >> core with (taskset -c 1), building a kernel up to init/version.o > > >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain > > >> why this occurs or what is going wrong I haven't managed to find > > >> a perf command that in any way gives insight into this. > > >> > > >> 11,885,070,715 instructions # 1.39 insns per cycle > > >> vs > > >> 12,082,592,342 instructions # 0.13 insns per cycle > > >> > > >> is the only thing I've spotted of interest, I've tried at least: > > >> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses > > >> > > >> For now it seems at least a good idea to allow a user to disable write > > >> combining if they see this until we can figure it out. > > > > > > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ > > > using ioremap_wc() for the exact same reason. I'm not against letting > > > the user force one way or the other if it helps, though it sure would be > > > nice to know why. > > > > > > Anyway, > > > > > > Acked-By: Peter Jones <pjones@redhat.com> > > > > > > Bartlomiej, do you want to handle this in your devel tree? > > > > I'm happy to stick this in a drm-fixes pull with this ack. > > I'll put it into fbdev fixes for 4.13 with other fbdev patches. Patch queued for 4.13, thanks. Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/18/17 13:44, Dave Airlie wrote: > > In normal operation the console is faster with _wc. It's the side effects > on other cores that is the problem. > I'm guessing leaving these as UC- rate-limits them so it doesn't interfere with the I/O operations on the other cores... -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/fb/efifb.txt b/Documentation/fb/efifb.txt index a59916c..1a85c1b 100644 --- a/Documentation/fb/efifb.txt +++ b/Documentation/fb/efifb.txt @@ -27,5 +27,11 @@ You have to add the following kernel parameters in your elilo.conf: Macbook Pro 17", iMac 20" : video=efifb:i20 +Accepted options: + +nowc Don't map the framebuffer write combined. This can be used + to workaround side-effects and slowdowns on other CPU cores + when large amounts of console data are written. + -- Edgar Hucek <gimli@dark-green.com> diff --git a/drivers/video/fbdev/efifb.c b/drivers/video/fbdev/efifb.c index b827a81..a568fe0 100644 --- a/drivers/video/fbdev/efifb.c +++ b/drivers/video/fbdev/efifb.c @@ -17,6 +17,7 @@ #include <asm/efi.h> static bool request_mem_succeeded = false; +static bool nowc = false; static struct fb_var_screeninfo efifb_defined = { .activate = FB_ACTIVATE_NOW, @@ -99,6 +100,8 @@ static int efifb_setup(char *options) screen_info.lfb_height = simple_strtoul(this_opt+7, NULL, 0); else if (!strncmp(this_opt, "width:", 6)) screen_info.lfb_width = simple_strtoul(this_opt+6, NULL, 0); + else if (!strcmp(this_opt, "nowc")) + nowc = true; } } @@ -255,7 +258,10 @@ static int efifb_probe(struct platform_device *dev) info->apertures->ranges[0].base = efifb_fix.smem_start; info->apertures->ranges[0].size = size_remap; - info->screen_base = ioremap_wc(efifb_fix.smem_start, efifb_fix.smem_len); + if (nowc) + info->screen_base = ioremap(efifb_fix.smem_start, efifb_fix.smem_len); + else + info->screen_base = ioremap_wc(efifb_fix.smem_start, efifb_fix.smem_len); if (!info->screen_base) { pr_err("efifb: abort, cannot ioremap video memory 0x%x @ 0x%lx\n", efifb_fix.smem_len, efifb_fix.smem_start);
This patch allows the user to disable write combined mapping of the efifb framebuffer console using an nowc option. A customer noticed major slowdowns while logging to the console with write combining enabled, on other tasks running on the same CPU. (10x or greater slow down on all other cores on the same CPU as is doing the logging). I reproduced this on a machine with dual CPUs. Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) I wrote a test that just mmaps the pci bar and writes to it in a loop, while this was running in the background one a single core with (taskset -c 1), building a kernel up to init/version.o (taskset -c 8) went from 13s to 133s or so. I've yet to explain why this occurs or what is going wrong I haven't managed to find a perf command that in any way gives insight into this. 11,885,070,715 instructions # 1.39 insns per cycle vs 12,082,592,342 instructions # 0.13 insns per cycle is the only thing I've spotted of interest, I've tried at least: dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses For now it seems at least a good idea to allow a user to disable write combining if they see this until we can figure it out. Note also most users get a real framebuffer driver loaded when kms kicks in, it just happens on these machines the kernel didn't support the gpu specific driver. Signed-off-by: Dave Airlie <airlied@redhat.com> --- Documentation/fb/efifb.txt | 6 ++++++ drivers/video/fbdev/efifb.c | 8 +++++++- 2 files changed, 13 insertions(+), 1 deletion(-)