Message ID | 20180317113655.GA30572@ls3530.fritz.box (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Hi Helge, just a nit on PCI terminology... On Sat, Mar 17, 2018 at 12:36 PM, Helge Deller <deller@gmx.de> wrote: > * Carlo Pisani <carlojpisani@gmail.com>: >> I have created and applied the following patch >> testing the kernel with heavy I/O seems now stable >> >> my C3600 is still under testing, moving chunks of 500Mbyte and compiling gcc-v6 >> >> >> http://93.55.217.0//wonderland/chunk_of/user/ivelegacy/happa-dev/hppa2_0001_HPMC_fix_my_v1.patch > > --- drivers/parisc/lba_pci.c 2018-01-28 22:20:33.000000000 +0100 > +++ drivers/parisc/lba_pci.c 2018-03-15 12:26:44.839894952 +0100 > @@ -1405,7 +1405,7 @@ > > /* Set HF mode as the default (vs. -1 mode). */ > stat = READ_REG32(d->hba.base_addr + LBA_STAT_CTL); > - WRITE_REG32(stat | HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL); > + WRITE_REG32(stat & ~HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL); > > /* > ** Writing a zero to STAT_CTL.rf (bit 0) will clear reset signal > > That's the patch from Kyle: > https://www.spinics.net/lists/linux-parisc/msg01027.html > > which comes out of this mail thread: > https://www.spinics.net/lists/linux-parisc/msg01024.html > specifically with those notes: > https://www.spinics.net/lists/linux-parisc/msg01026.html > > Citing here: > "bus timeout" usually means we tried to read an address that doesn't > respond. that is, nothing on the bus accepted the transaction for it, > so it timed out and HPMC'd the box. HF= Hard Fail on PCI "Master Abort". "Master Abort" means the MMIO transaction timed out - usually due to the device not responding to an MMIO read. We normally want HF to be enabled as you noted below in order to get state of the CPU and which IO device it was trying to access. It's possible the "~0L" returned in SoftFail mode is being handled by the driver OR the particular read that fails just doesn't matter. Would have to see dmesg output if the driver ever complains about invalid MMIO read data (~0L). cheers, grant > what you really need is the IIR, and the address it tried to access > (both the kernel vaddr which will be in the register, and the "system > requester address" from the hpmc dump which will be the physical address > mapped. > > not sure why the hpmc handler is getting skipped, that's a little weird. > > you can try hacking elroy to set softfail mode on that bus, which will > result in a timeout on the pci bus to return -1 (like what x86 and most > other architectures do) rather than hang the box, but it really likely > means a driver bug. > > > > So, you change LBA to return -1 instead of faulting via HPMC which is > of course one work-around to avoid the HPMC. > > But could you try to check the driver instead? > > You run this SATA controller: > 01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID > > Can you maybe try to localize where in the drivers/ata/sata_via.c driver > triggers the HPMC ? > > > --- arch/parisc/kernel/hpmc.S 2018-01-28 22:20:33.000000000 +0100 > +++ arch/parisc/kernel/hpmc.S 2018-03-15 14:13:46.611969815 +0100 > @@ -308,4 +290,5 @@ > .align 4 > .export os_hpmc_size > os_hpmc_size: > - .word .os_hpmc_end-.os_hpmc > + /* .word .os_hpmc_end-.os_hpmc */ > + .word (.os_hpmc_end - .os_hpmc) * 4 /* sizeof(u32) */ > > This one seems wrong. > I think you just didn't hit a HPMC with your first patch, and as such > this patch has no influence... > > Helge > -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[adding lists back to CC since this is a public discussion] On Sat, Mar 17, 2018 at 5:36 PM, Carlo Pisani <carlojpisani@gmail.com> wrote: >> Would have to see dmesg output if the driver ever complains about >> invalid MMIO read data (~0L). > > I repeat AGAIN > > I have tested different sATA controllers > - via6421 > - SIL24 > - Adaptex 2410 Ah sorry - I missed that. > none of them has never complained in dmesg ( i have the console > redirected the serial port ) Thanks for confirming Carlo! Having worked on mv7042 and SIL3124 (IIRC) driver support for almost 4 years, I can tell you most SATA drivers suck. Helge is most likely correct that this is due to a SATA driver bug. Since Intel x86 (32 or 64bit) systems ONLY support "SoftFail" mode, many driver issues are never exposed until something gets corrupted or HW totally wedges. > and all of them showed the same behavior under heavy I/O While the symptom (HPMC) looks the same, the details are likely different for each card. It's been over 10 years but ISTR that "SER PIM" command when entered at boot prompt will dump those details. Ah, yes, that's correct: https://parisc.wiki.kernel.org/index.php/How_to_report_a_parisc-linux_kernel_problem#HPMC BTW, HPMCs can also be due to devices DMAing to an invalid address since the IOMMU is "strict" and requires all DMA be "mapped". IIRC, these DMA failures are "imprecise events" meaning SER PIM is the only way to determine what the target DMA address was. This is non-trivial stuff to debug since it requires pretty deep understanding of the SATA controller operations and how the driver works. > the C3600 machine stops to work, and I see errors in the LCD > without a line from the kernel on the serial console Yes, HPMC due to IO errors (including DMA failures and PCI Master Abort) will lock up all other IO including the serial console. SER PIM is the only output that will lead to diagnosing the underlying driver bug. But in SoftFail mode, I was hoping one of the drivers might complain about "invalid value" - but that is likely way too optimistic on my part. The only drivers that look for ~0L on MMIO read are the ones that support PCI hotplug (e.g. PCMCIA devices and some higher end PCI IO devices which target data center/server market). cheers, grant ps. I haven't looked at mvsata or sil3124 SATA controller code since 2010. But if you could find a MV7042 PCI (or PCI-X?) SATA controller, I have more faith in that driver/HW combination partly because I tested it pretty thoroughly at the time. -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Grant, On 2018-03-17 12:12 PM, Grant Grundler wrote: > "Master Abort" means the MMIO > transaction timed out - usually due to the device not responding to an > MMIO read. In lba_pci.c and sba_iommu.c, it says "BE WARNED: register writes are posted" and need to be followed by a read. It seems there are a some routines in these modules that have writes that aren't followed by a read. One is lba_wr_cfg(). Another might be the macro LBA_CFG_RESTORE(). Are these okay? It seems probable that the problem that Carlo is having is a conflict between devices. Dave
Hi John! On Sat, Mar 17, 2018 at 6:47 PM, John David Anglin <dave.anglin@bell.net> wrote: > Hi Grant, > > On 2018-03-17 12:12 PM, Grant Grundler wrote: >> >> "Master Abort" means the MMIO >> transaction timed out - usually due to the device not responding to an >> MMIO read. > > In lba_pci.c and sba_iommu.c, it says "BE WARNED: register writes are > posted" and need to be followed by a read. It seems there are a some > routines in these modules that have writes that aren't followed by a read. > One is lba_wr_cfg(). Another might be the macro > LBA_CFG_RESTORE(). Are these okay? I looked through the two examples you point out and I *think* both are ok. lba_wr_cfg() issues an mmio write and immediately after calls LBA_CFG_MASTER_ABORT_CHECK() which performs an MMIO read from the same base address. The LBA_CFG_RESTORE() is "lazy" - the next MMIO read will flush those three writes and (I believe) any following MMIO writes will still be issued in order. Typically, the problem with posted MMIO writes is DMA or other events don't start until the MMIO write is "seen" by the device. This is important when specific timing between MMIO transactions is required OR some magic (e.g. device reset, updates Frame Buffer, etc) happens. > It seems probable that the problem that Carlo is having is a conflict > between devices. Hrm. I don't know. I haven't yet looked at the latest dump that Carlo helpfully provided as I'm still traveling. Why do you suspect this? I'm skeptical about "conflict between devices" (due to lba_wr_cfg()) for two reasons: 1) configuration space accesses are usually not part of normal IO device transaction processing. 2) I've nearly always found that PCI Master Aborts (on MMIO reads) are usually just a symptom of something else going wrong and not the root cause. Typically, the issues I recall running into are around the drivers hitting a corner case where the device is still performing DMA to an address that gets unmapped by the driver. This will wedge the IOMMU (sba) and then following MMIO reads will generate an HPMC. The hard part is to determine what the corner case is based on a DMA address (as reported in SER PIM output). It requires deeper understanding of the DMA programming for the given SATA controller (driver directing HW what to do), how transaction completions are reported (SATA controller HW) and handled (driver operation). In the past, I've sorted several of these issues out for tg3 and tulip NIC drivers and I can with confidence say that some issues still remain in the tulip driver shutdown path. But I gave up on trying to fix those and lost interest later. cheers, grant -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[adding public lists back to CC since I can't sort this by myself any more] On Sat, Mar 17, 2018 at 6:45 PM, Carlo Pisani <carlojpisani@gmail.com> wrote: >> While the symptom (HPMC) looks the same, the details are likely >> different for each card. It's been over 10 years but ISTR that "SER >> PIM" command when entered at boot prompt will dump those details. > > this is an old log > http://93.55.217.0/wonderland/chunk_of/user/ivelegacy/hppa-dev/ser-prim-2018-03-12--sata-crash Awesome - thank you! Can you tell me which SATA controller was installed when this crash was taken? > my C3600 is now busy, compiling 4.16.*, a couple of branches with two > different configurations, so I can't force a crash to have a fresh log > at the moment, but I will with several SATA card! No worries - start with one. :) > what about SAS kernel modules? are them better than SATA? In general, I expect they are better supported but I never worked with SAS controllers. > and what about SCSI (adaptec 29320, PCI-X)? Sorry: I don't have any experience with adaptec - only the LSI SCSI controllers, Marvel 7042 SATA controller, and Silicon Image 3124 (IIRC). The SI3124 was a PoS. MV7042 worked well and was reliable "at scale". > I am waiting for an array of disks, I can test the SCSI card under > heavy I/O to see > - performance > - stability > > it will be delivered here for the end of the next week Ok! Sounds like fun! :D cheers, grant -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 19, 2018 at 8:37 PM, Grant Grundler <grantgrundler@gmail.com> wrote: ... >> this is an old log >> http://93.55.217.0/wonderland/chunk_of/user/ivelegacy/hppa-dev/ser-prim-2018-03-12--sata-crash BTW, I should point out one needs to give "CLEARPIM" command at PDC command line in order to record a new HPMC. So once you reproduce the crash, collect HPMC dump with "SER PIM", then "CLEAR PIM" again. :) cheers, grant -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This is info about PCI devices that Carlo sent me. It would help if we lnew where 0x4046b840 was in the kernel. From the IIR, 0xf3c001c, it seems the timeout occurred on a byte operation. Dave On 2018-03-17 2:58 PM, Carlo Pisani wrote: >> Probably, it would be useful to have verbose output from lspci posted to >> list as well. > > c3600 ~ $ lspci -v > 00:0c.0 Ethernet controller: Digital Equipment Corporation DECchip > 21142/43 (rev 41) > Subsystem: Hewlett-Packard Company DECchip 21142/43 > Flags: bus master, medium devsel, latency 255, IRQ 17 > I/O ports at 1000 [size=128] > Memory at fffffffff4008000 (32-bit, non-prefetchable) [size=1K] > Expansion ROM at fffffffff4040000 [disabled] [size=256K] > Kernel driver in use: tulip > lspci: Unable to load libkmod resources: error -12 > > 00:0d.0 Multimedia audio controller: Analog Devices Device 1889 > Subsystem: Analog Devices Device 1889 > Flags: bus master, medium devsel, latency 255, IRQ 18 > Memory at fffffffff400c000 (32-bit, prefetchable) [size=512] > Memory at fffffffff400b000 (32-bit, prefetchable) [size=16] > Memory at fffffffff400a000 (32-bit, prefetchable) [size=16] > Memory at fffffffff4009000 (32-bit, prefetchable) [size=16] > Capabilities: [dc] Power Management version 1 > Kernel driver in use: snd_ad1889 > > 00:0e.0 IDE interface: National Semiconductor Corporation 87415/87560 > IDE (rev 03) (prog-if 8f [Master SecP SecO PriP PriO]) > Flags: bus master, medium devsel, latency 255, IRQ 7 > I/O ports at 0f00 [size=8] > I/O ports at 0e00 [size=4] > I/O ports at 0d00 [size=8] > I/O ports at 0b00 [size=4] > I/O ports at 0a00 [size=16] > Kernel driver in use: NS87415_IDE > > 00:0e.1 Bridge: National Semiconductor Corporation 87560 Legacy I/O (rev 01) > Flags: bus master, medium devsel, latency 255, IRQ 19 > Kernel driver in use: SuperIO > > 00:0e.2 USB controller: National Semiconductor Corporation USB > Controller (rev 02) (prog-if 10 [OHCI]) > Flags: bus master, medium devsel, latency 240, IRQ 1 > Memory at fffffffff4007000 (32-bit, non-prefetchable) [size=4K] > Memory at fffffffff4006000 (32-bit, non-prefetchable) [size=4K] > Kernel driver in use: ohci-pci > > 00:0f.0 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07) > Subsystem: LSI Logic / Symbios Logic LSI53C896/7 PCI to Dual > Channel Ultra2 SCSI Multifunction Controller > Flags: bus master, medium devsel, latency 255, IRQ 20 > I/O ports at 0900 [size=256] > Memory at fffffffff4005000 (64-bit, non-prefetchable) [size=1K] > Memory at fffffffff4002000 (64-bit, non-prefetchable) [size=8K] > Capabilities: [40] Power Management version 2 > Kernel driver in use: sym53c8xx > > 00:0f.1 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07) > Subsystem: LSI Logic / Symbios Logic LSI53C896/7 PCI to Dual > Channel Ultra2 SCSI Multifunction Controller > Flags: bus master, medium devsel, latency 255, IRQ 20 > I/O ports at 0800 [size=256] > Memory at fffffffff4004000 (64-bit, non-prefetchable) [size=1K] > Memory at fffffffff4000000 (64-bit, non-prefetchable) [size=8K] > Capabilities: [40] Power Management version 2 > Kernel driver in use: sym53c8xx > > 01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA > Controller (rev 50) > Subsystem: VIA Technologies, Inc. VT6421 IDE/SATA Controller > Flags: bus master, medium devsel, latency 240, IRQ 21 > I/O ports at 12500 [size=16] > I/O ports at 12400 [size=16] > I/O ports at 12300 [size=16] > I/O ports at 12200 [size=16] > I/O ports at 12100 [size=32] > I/O ports at 12000 [size=256] > Expansion ROM at fffffffff4800000 [disabled] [size=64K] > Capabilities: [e0] Power Management version 2 > Kernel driver in use: sata_via > The requesting device was 10/1/5/0.
--- drivers/parisc/lba_pci.c 2018-01-28 22:20:33.000000000 +0100 +++ drivers/parisc/lba_pci.c 2018-03-15 12:26:44.839894952 +0100 @@ -1405,7 +1405,7 @@ /* Set HF mode as the default (vs. -1 mode). */ stat = READ_REG32(d->hba.base_addr + LBA_STAT_CTL); - WRITE_REG32(stat | HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL); + WRITE_REG32(stat & ~HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL); /* ** Writing a zero to STAT_CTL.rf (bit 0) will clear reset signal That's the patch from Kyle: https://www.spinics.net/lists/linux-parisc/msg01027.html which comes out of this mail thread: https://www.spinics.net/lists/linux-parisc/msg01024.html specifically with those notes: https://www.spinics.net/lists/linux-parisc/msg01026.html Citing here: "bus timeout" usually means we tried to read an address that doesn't respond. that is, nothing on the bus accepted the transaction for it, so it timed out and HPMC'd the box. what you really need is the IIR, and the address it tried to access (both the kernel vaddr which will be in the register, and the "system requester address" from the hpmc dump which will be the physical address mapped. not sure why the hpmc handler is getting skipped, that's a little weird. you can try hacking elroy to set softfail mode on that bus, which will result in a timeout on the pci bus to return -1 (like what x86 and most other architectures do) rather than hang the box, but it really likely means a driver bug. So, you change LBA to return -1 instead of faulting via HPMC which is of course one work-around to avoid the HPMC. But could you try to check the driver instead? You run this SATA controller: 01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Can you maybe try to localize where in the drivers/ata/sata_via.c driver triggers the HPMC ? --- arch/parisc/kernel/hpmc.S 2018-01-28 22:20:33.000000000 +0100 +++ arch/parisc/kernel/hpmc.S 2018-03-15 14:13:46.611969815 +0100 @@ -308,4 +290,5 @@ .align 4 .export os_hpmc_size os_hpmc_size: - .word .os_hpmc_end-.os_hpmc + /* .word .os_hpmc_end-.os_hpmc */ + .word (.os_hpmc_end - .os_hpmc) * 4 /* sizeof(u32) */