Message ID | 20191120034451.30102-1-Zhiqiang.Hou@nxp.com (mailing list archive) |
---|---|
Headers | show |
Series | PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs | expand |
On Wed, Nov 20, 2019 at 03:45:17AM +0000, Z.q. Hou wrote: > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> > > This patch set is to recode the Mobiveil driver and add > PCIe support for NXP Layerscape series SoCs integrated > Mobiveil's PCIe Gen4 controller. How many PCIe cards have been tested to work/don't work with this? I need: PCI: mobiveil: ls_pcie_g4: fix SError when accessing config space PCI: mobiveil: ls_pcie_g4: add Workaround for A-011451 PCI: mobiveil: ls_pcie_g4: add Workaround for A-011577 to successfully boot with a Mellanox card plugged in with a previous revision of these patches.
Hi Russell, > -----Original Message----- > From: Russell King - ARM Linux admin <linux@armlinux.org.uk> > Sent: 2019年11月20日 17:57 > To: Z.q. Hou <zhiqiang.hou@nxp.com> > Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org; > devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; > bhelgaas@google.com; robh+dt@kernel.org; arnd@arndb.de; > mark.rutland@arm.com; l.subrahmanya@mobiveil.co.in; > shawnguo@kernel.org; m.karthikeyan@mobiveil.co.in; Leo Li > <leoyang.li@nxp.com>; lorenzo.pieralisi@arm.com; > catalin.marinas@arm.com; will.deacon@arm.com; > andrew.murray@arm.com; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei > Bao <xiaowei.bao@nxp.com>; Mingkai Hu <mingkai.hu@nxp.com> > Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 > driver for NXP Layerscape SoCs > > On Wed, Nov 20, 2019 at 03:45:17AM +0000, Z.q. Hou wrote: > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> > > > > This patch set is to recode the Mobiveil driver and add PCIe support > > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4 > > controller. > > How many PCIe cards have been tested to work/don't work with this? > > I need: > > PCI: mobiveil: ls_pcie_g4: fix SError when accessing config space > PCI: mobiveil: ls_pcie_g4: add Workaround for A-011451 > PCI: mobiveil: ls_pcie_g4: add Workaround for A-011577 > > to successfully boot with a Mellanox card plugged in with a previous revision > of these patches. > Yes, we need to apply these NXP internal maintained workarounds on top of this series. I only tested Intel e1000e NIC with this patch set + these 3 workarounds. Thanks, Zhiqiang > -- > RMK's Patch system: > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww > .armlinux.org.uk%2Fdeveloper%2Fpatches%2F&data=02%7C01%7Czhiq > iang.hou%40nxp.com%7C69f6fb1f4fd44f3fca3808d76da01440%7C686ea1d > 3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637098406606503361&sd > ata=wOLWzKfZZoiP%2FZpTOw5zr4enpuNImz45RM8Hy80aUdI%3D&res > erved=0 > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps > up According to speedtest.net: 11.9Mbps down 500kbps up
Hi! On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> > > This patch set is to recode the Mobiveil driver and add > PCIe support for NXP Layerscape series SoCs integrated > Mobiveil's PCIe Gen4 controller. Can we get a respin for this on top of the 5.5 merge window material? Given that it's a bunch of refactorings, many of them don't apply on top of the material that was merged. I'd love to see these go in sooner rather than later so I can start getting -next running on ls2160a here. -Olof
Hi Lorenzo, The v9 patches have addressed the comments from Andrew, and it has been dried about 1 month, can you help to apply them? Thanks, Zhiqiang > -----Original Message----- > From: Olof Johansson <olof@lixom.net> > Sent: 2019年12月14日 2:37 > To: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com > Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org; > devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; > robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com; > l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org; > m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>; > lorenzo.pieralisi@arm.com; catalin.marinas@arm.com; > will.deacon@arm.com; andrew.murray@arm.com; Mingkai Hu > <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei Bao > <xiaowei.bao@nxp.com> > Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 > driver for NXP Layerscape SoCs > > Hi! > > On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> > > > > This patch set is to recode the Mobiveil driver and add PCIe support > > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4 > > controller. > > Can we get a respin for this on top of the 5.5 merge window material? > Given that it's a bunch of refactorings, many of them don't apply on top of > the material that was merged. > > I'd love to see these go in sooner rather than later so I can start getting -next > running on ls2160a here. > > > -Olof
On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote: > Hi Lorenzo, > > The v9 patches have addressed the comments from Andrew, and it has > been dried about 1 month, can you help to apply them? We shall have a look beginning of next week, sorry for the delay in getting back to you. Lorenzo > Thanks, > Zhiqiang > > > -----Original Message----- > > From: Olof Johansson <olof@lixom.net> > > Sent: 2019年12月14日 2:37 > > To: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com > > Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org; > > devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; > > robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com; > > l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org; > > m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>; > > lorenzo.pieralisi@arm.com; catalin.marinas@arm.com; > > will.deacon@arm.com; andrew.murray@arm.com; Mingkai Hu > > <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei Bao > > <xiaowei.bao@nxp.com> > > Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 > > driver for NXP Layerscape SoCs > > > > Hi! > > > > On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> > > > > > > This patch set is to recode the Mobiveil driver and add PCIe support > > > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4 > > > controller. > > > > Can we get a respin for this on top of the 5.5 merge window material? > > Given that it's a bunch of refactorings, many of them don't apply on top of > > the material that was merged. > > > > I'd love to see these go in sooner rather than later so I can start getting -next > > running on ls2160a here. > > > > > > -Olof
On Fri, Jan 10, 2020 at 7:33 AM Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> wrote: > > On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote: > > Hi Lorenzo, > > > > The v9 patches have addressed the comments from Andrew, and it has > > been dried about 1 month, can you help to apply them? > > We shall have a look beginning of next week, sorry for the delay > in getting back to you. Note that the patch set no longer applies since the refactorings conflict with new development by others. Zhiqiang, can you rebase and post a new version of the patch set? -Olof
Hi Olof, Thanks a lot for your comments! And sorry for my delay respond! > -----Original Message----- > From: Olof Johansson <olof@lixom.net> > Sent: 2020年1月11日 1:06 > To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Cc: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com; > linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org; > devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; > robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com; > l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org; > m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>; > catalin.marinas@arm.com; will.deacon@arm.com; andrew.murray@arm.com; > Mingkai Hu <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>; > Xiaowei Bao <xiaowei.bao@nxp.com> > Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 > driver for NXP Layerscape SoCs > > On Fri, Jan 10, 2020 at 7:33 AM Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > wrote: > > > > On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote: > > > Hi Lorenzo, > > > > > > The v9 patches have addressed the comments from Andrew, and it has > > > been dried about 1 month, can you help to apply them? > > > > We shall have a look beginning of next week, sorry for the delay in > > getting back to you. > > Note that the patch set no longer applies since the refactorings conflict with > new development by others. > > Zhiqiang, can you rebase and post a new version of the patch set? Yes, I will rebase the patches to the latest code base. Thanks, Zhiqiang > > -Olof
On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > Hi Olof, > > Thanks a lot for your comments! > And sorry for my delay respond! Actually, they apply with only minor conflicts on top of current -next. Bjorn, any chance we can get you to pick these up pretty soon? They enable full use of a promising ARM developer system, the SolidRun HoneyComb, and would be quite valuable for me and others to be able to use with mainline or -next without any additional patches applied -- which this patchset achieves. I know there are pending revisions based on feedback. I'll leave it up to you and others to determine if that can be done with incremental patches on top, or if it should be fixed before the initial patchset is applied. But all in all, it's holding up adaption by me and surely others of a very interesting platform -- I'm looking to replace my aging MacchiatoBin with one of these and would need PCIe/NVMe to work before I do. Thanks! -Olof
On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > Hi Olof, > > > > Thanks a lot for your comments! > > And sorry for my delay respond! > > Actually, they apply with only minor conflicts on top of current -next. > > Bjorn, any chance we can get you to pick these up pretty soon? They > enable full use of a promising ARM developer system, the SolidRun > HoneyComb, and would be quite valuable for me and others to be able to > use with mainline or -next without any additional patches applied -- > which this patchset achieves. > > I know there are pending revisions based on feedback. I'll leave it up > to you and others to determine if that can be done with incremental > patches on top, or if it should be fixed before the initial patchset > is applied. But all in all, it's holding up adaption by me and surely > others of a very interesting platform -- I'm looking to replace my > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > before I do. If you're going to be using NVMe, make sure you use a power-fail safe version; I've already had one instance where ext4 failed to mount because of a corrupted journal using an XPG SX8200 after the Honeycomb Serror'd, and then I powered it down after a few hours before later booting it back up. EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem EXT4-fs (nvme0n1p2): write access will be enabled during recovery JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. EXT4-fs (nvme0n1p2): error loading journal
On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > Hi Olof, > > > > > > Thanks a lot for your comments! > > > And sorry for my delay respond! > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > enable full use of a promising ARM developer system, the SolidRun > > HoneyComb, and would be quite valuable for me and others to be able to > > use with mainline or -next without any additional patches applied -- > > which this patchset achieves. > > > > I know there are pending revisions based on feedback. I'll leave it up > > to you and others to determine if that can be done with incremental > > patches on top, or if it should be fixed before the initial patchset > > is applied. But all in all, it's holding up adaption by me and surely > > others of a very interesting platform -- I'm looking to replace my > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > before I do. > > If you're going to be using NVMe, make sure you use a power-fail safe > version; I've already had one instance where ext4 failed to mount > because of a corrupted journal using an XPG SX8200 after the Honeycomb > Serror'd, and then I powered it down after a few hours before later > booting it back up. > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > EXT4-fs (nvme0n1p2): error loading journal Hmm, using btrfs on mine, not sure if the exposure is similar or not. Do you know if the SErr was due to a known issue and/or if it's something that's fixed in production silicon? (I still can't enable SMMU since across a warm reboot it fails *completely*, with nothing coming up and working. NXP folks, you listening? :) -Olof
On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > Hi Olof, > > > > Thanks a lot for your comments! > > And sorry for my delay respond! > > Actually, they apply with only minor conflicts on top of current -next. > > Bjorn, any chance we can get you to pick these up pretty soon? They > enable full use of a promising ARM developer system, the SolidRun > HoneyComb, and would be quite valuable for me and others to be able to > use with mainline or -next without any additional patches applied -- > which this patchset achieves. > > I know there are pending revisions based on feedback. I'll leave it up > to you and others to determine if that can be done with incremental > patches on top, or if it should be fixed before the initial patchset > is applied. But all in all, it's holding up adaption by me and surely > others of a very interesting platform -- I'm looking to replace my > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > before I do. We should be able to merge them for v5.7, I don't know when they will land in -next. Thanks, Lorenzo
On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote: > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > Hi Olof, > > > > > > > > Thanks a lot for your comments! > > > > And sorry for my delay respond! > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > enable full use of a promising ARM developer system, the SolidRun > > > HoneyComb, and would be quite valuable for me and others to be able to > > > use with mainline or -next without any additional patches applied -- > > > which this patchset achieves. > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > to you and others to determine if that can be done with incremental > > > patches on top, or if it should be fixed before the initial patchset > > > is applied. But all in all, it's holding up adaption by me and surely > > > others of a very interesting platform -- I'm looking to replace my > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > before I do. > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > version; I've already had one instance where ext4 failed to mount > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > Serror'd, and then I powered it down after a few hours before later > > booting it back up. > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > EXT4-fs (nvme0n1p2): error loading journal > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. As I understand the problem, it isn't a filesystem issue. It's a data integrity issue with the NVMe over power fail, how they cache the data, and ultimately write it to the nand flash. Have a read of: https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection As NVMe and SSD are basically the same underlying technology (the host interface is different) and the issues I've heard, and now experienced with my NVMe, I think the above is a good pointer to the problems of flash mass storage. As I understand it, the problem occurs when the mapping table has not been written back to flash, power is lost without the Standby Immediate command being sent, and there is no way for the firmware to quickly save the table. On subsequent power up, the firmware has to reconstruct the mapping table, and depending on how that is done, incorrect (old?) data may be returned for some blocks. That can happen to any blocks on the drive, which means any data can be at risk from a power loss event, whether that is a power failure or after a crash. > Do you know if the SErr was due to a known issue and/or if it's > something that's fixed in production silicon? The SError is triggered by something on the PCIe side of things; if I leave the Mellanox PCIe card out, then I don't get them. The errata patches I have merged into my tree help a bit, turning the code from being unable to boot without a SError with the card plugged in, to being able to boot and last a while - but the SErrors still eventually come, maybe taking a few days... and that's without the Mellanox ethernet interface being up. > (I still can't enable SMMU since across a warm reboot it fails > *completely*, with nothing coming up and working. NXP folks, you > listening? :) Is it just a warm reboot? I thought I saw SMMU activity on a cold boot as well, implying that there were devices active that Linux did not know about.
On Mon, Feb 10, 2020 at 04:15:53PM +0000, Russell King - ARM Linux admin wrote: > On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote: > > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > > > Hi Olof, > > > > > > > > > > Thanks a lot for your comments! > > > > > And sorry for my delay respond! > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > enable full use of a promising ARM developer system, the SolidRun > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > use with mainline or -next without any additional patches applied -- > > > > which this patchset achieves. > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > to you and others to determine if that can be done with incremental > > > > patches on top, or if it should be fixed before the initial patchset > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > others of a very interesting platform -- I'm looking to replace my > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > before I do. > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > version; I've already had one instance where ext4 failed to mount > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > Serror'd, and then I powered it down after a few hours before later > > > booting it back up. > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > EXT4-fs (nvme0n1p2): error loading journal > > > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. > > As I understand the problem, it isn't a filesystem issue. It's a data > integrity issue with the NVMe over power fail, how they cache the data, > and ultimately write it to the nand flash. > > Have a read of: > > https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection This was the link I was actually looking for: http://industrial.adata.com/en/technology/92 but there's also: http://industrial.adata.com/en/technology/26 ADATA make the XPG SX8200: NVME Identify Controller: vid : 0x1cc1 ssvid : 0x1cc1 mn : ADATA SX8200PNP fr : R0906I
[cc:ing honeycomb-users, didn't think of that earlier] On Mon, Feb 10, 2020 at 5:16 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote: > > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > > > Hi Olof, > > > > > > > > > > Thanks a lot for your comments! > > > > > And sorry for my delay respond! > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > enable full use of a promising ARM developer system, the SolidRun > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > use with mainline or -next without any additional patches applied -- > > > > which this patchset achieves. > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > to you and others to determine if that can be done with incremental > > > > patches on top, or if it should be fixed before the initial patchset > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > others of a very interesting platform -- I'm looking to replace my > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > before I do. > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > version; I've already had one instance where ext4 failed to mount > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > Serror'd, and then I powered it down after a few hours before later > > > booting it back up. > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > EXT4-fs (nvme0n1p2): error loading journal > > > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. > > As I understand the problem, it isn't a filesystem issue. It's a data > integrity issue with the NVMe over power fail, how they cache the data, > and ultimately write it to the nand flash. > > Have a read of: > > https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection > > As NVMe and SSD are basically the same underlying technology (the host > interface is different) and the issues I've heard, and now experienced > with my NVMe, I think the above is a good pointer to the problems of > flash mass storage. > > As I understand it, the problem occurs when the mapping table has not > been written back to flash, power is lost without the Standby Immediate > command being sent, and there is no way for the firmware to quickly > save the table. On subsequent power up, the firmware has to > reconstruct the mapping table, and depending on how that is done, > incorrect (old?) data may be returned for some blocks. > > That can happen to any blocks on the drive, which means any data can > be at risk from a power loss event, whether that is a power failure > or after a crash. Makes me suspect if there's some board-level power/reset sequencing issue, or if there's a problem with one card going down disabling others. I haven't read the specs enough to know what's expected behavior but I've seen similar issues on other platforms so take it with a grain of salt. > > Do you know if the SErr was due to a known issue and/or if it's > > something that's fixed in production silicon? > > The SError is triggered by something on the PCIe side of things; if I > leave the Mellanox PCIe card out, then I don't get them. The errata > patches I have merged into my tree help a bit, turning the code from > being unable to boot without a SError with the card plugged in, to > being able to boot and last a while - but the SErrors still eventually > come, maybe taking a few days... and that's without the Mellanox > ethernet interface being up. > > > (I still can't enable SMMU since across a warm reboot it fails > > *completely*, with nothing coming up and working. NXP folks, you > > listening? :) > > Is it just a warm reboot? I thought I saw SMMU activity on a cold > boot as well, implying that there were devices active that Linux > did not know about. Yeah, 100% reproducible on warm reboot -- every single time. Not on cold boot though (100% success rate as far as I remember). I boot with kernel on NVMe on PCIe, native 1GbE for networking. u-boot from SD card. This is with the SolidRun u-boot from GitHub. -Olof
On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote: > > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > Hi Olof, > > > > > > > > Thanks a lot for your comments! > > > > And sorry for my delay respond! > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > enable full use of a promising ARM developer system, the SolidRun > > > HoneyComb, and would be quite valuable for me and others to be able to > > > use with mainline or -next without any additional patches applied -- > > > which this patchset achieves. > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > to you and others to determine if that can be done with incremental > > > patches on top, or if it should be fixed before the initial patchset > > > is applied. But all in all, it's holding up adaption by me and surely > > > others of a very interesting platform -- I'm looking to replace my > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > before I do. > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > version; I've already had one instance where ext4 failed to mount > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > Serror'd, and then I powered it down after a few hours before later > > booting it back up. > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > EXT4-fs (nvme0n1p2): error loading journal > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. > > Do you know if the SErr was due to a known issue and/or if it's > something that's fixed in production silicon? > > (I still can't enable SMMU since across a warm reboot it fails > *completely*, with nothing coming up and working. NXP folks, you > listening? :) This is a known issue about DPAA2 MC bus not working well with SMMU based IO mapping. Adding Laurentiu to the chain who has been looking into this issue. Regards, Leo
On Mon, Feb 10, 2020 at 12:41 PM Li Yang <leoyang.li@nxp.com> wrote: > > On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote: > > > > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > > > Hi Olof, > > > > > > > > > > Thanks a lot for your comments! > > > > > And sorry for my delay respond! > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > enable full use of a promising ARM developer system, the SolidRun > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > use with mainline or -next without any additional patches applied -- > > > > which this patchset achieves. > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > to you and others to determine if that can be done with incremental > > > > patches on top, or if it should be fixed before the initial patchset > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > others of a very interesting platform -- I'm looking to replace my > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > before I do. > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > version; I've already had one instance where ext4 failed to mount > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > Serror'd, and then I powered it down after a few hours before later > > > booting it back up. > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > EXT4-fs (nvme0n1p2): error loading journal > > > > Hmm, using btrfs on mine, not sure if the exposure is similar or not. > > > > Do you know if the SErr was due to a known issue and/or if it's > > something that's fixed in production silicon? > > > > (I still can't enable SMMU since across a warm reboot it fails > > *completely*, with nothing coming up and working. NXP folks, you > > listening? :) > > This is a known issue about DPAA2 MC bus not working well with SMMU > based IO mapping. Adding Laurentiu to the chain who has been looking > into this issue. Forgot to mention that you can workaround the issue by setting CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT=n or adding "arm-smmu.disable_bypass=0" to boot parameters. Regards, Leo
On 10.02.2020 20:41, Li Yang wrote: > On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote: >> >> On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin >> <linux@armlinux.org.uk> wrote: >>> >>> On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: >>>> On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: >>>>> >>>>> Hi Olof, >>>>> >>>>> Thanks a lot for your comments! >>>>> And sorry for my delay respond! >>>> >>>> Actually, they apply with only minor conflicts on top of current -next. >>>> >>>> Bjorn, any chance we can get you to pick these up pretty soon? They >>>> enable full use of a promising ARM developer system, the SolidRun >>>> HoneyComb, and would be quite valuable for me and others to be able to >>>> use with mainline or -next without any additional patches applied -- >>>> which this patchset achieves. >>>> >>>> I know there are pending revisions based on feedback. I'll leave it up >>>> to you and others to determine if that can be done with incremental >>>> patches on top, or if it should be fixed before the initial patchset >>>> is applied. But all in all, it's holding up adaption by me and surely >>>> others of a very interesting platform -- I'm looking to replace my >>>> aging MacchiatoBin with one of these and would need PCIe/NVMe to work >>>> before I do. >>> >>> If you're going to be using NVMe, make sure you use a power-fail safe >>> version; I've already had one instance where ext4 failed to mount >>> because of a corrupted journal using an XPG SX8200 after the Honeycomb >>> Serror'd, and then I powered it down after a few hours before later >>> booting it back up. >>> >>> EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem >>> EXT4-fs (nvme0n1p2): write access will be enabled during recovery >>> JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. >>> EXT4-fs (nvme0n1p2): error loading journal >> >> Hmm, using btrfs on mine, not sure if the exposure is similar or not. >> >> Do you know if the SErr was due to a known issue and/or if it's >> something that's fixed in production silicon? >> >> (I still can't enable SMMU since across a warm reboot it fails >> *completely*, with nothing coming up and working. NXP folks, you >> listening? :) > > This is a known issue about DPAA2 MC bus not working well with SMMU > based IO mapping. Adding Laurentiu to the chain who has been looking > into this issue. Yes, I'm closely following the issue. I actually have a workaround (attached) but haven't submitted as it will probably raise a lot of eyebrows. In the mean time I'm following some discussions [1][2][3] on the iommu list which seem to try to tackle what appears to be a similar issue but with framebuffers. My hope is that we will be able to leverage whatever turns out. In the mean time, can you try the workaround Leo suggested? [1] https://patchwork.kernel.org/patch/11327667/ [2] https://patchwork.kernel.org/patch/10967729/ [3] https://patchwork.kernel.org/cover/11279577/ --- Best Regards, Laurentiu
On 2020-02-11 12:13 pm, Laurentiu Tudor wrote: [...] >> This is a known issue about DPAA2 MC bus not working well with SMMU >> based IO mapping. Adding Laurentiu to the chain who has been looking >> into this issue. > > Yes, I'm closely following the issue. I actually have a workaround > (attached) but haven't submitted as it will probably raise a lot of > eyebrows. In the mean time I'm following some discussions [1][2][3] on > the iommu list which seem to try to tackle what appears to be a similar > issue but with framebuffers. My hope is that we will be able to leverage > whatever turns out. Indeed it's more general than framebuffers - in fact there was a specific requirement from the IORT side to accommodate network/storage controllers with in-memory firmware/configuration data/whatever set up by the bootloader that want to be handed off 'live' to Linux because the overhead of stopping and restarting them is impractical. Thus this DPAA2 setup is very much within scope of the desired solution, so please feel free to join in (particularly on the DT parts) :) As for right now, note that your patch would only be a partial mitigation to slightly reduce the fault window but not remove it entirely. To be robust the SMMU driver *has* to know about live streams before the first arm_smmu_reset() - hence the need for generic firmware bindings - so doing anything from the MC driver is already too late (and indeed the current iommu_request_dm_for_dev() mechanism is itself a microcosm of the same problem). > In the mean time, can you try the workaround Leo suggested? Agreed, I'd imagine the command-line option is probably the best choice for these platforms, since it's likely to be easier to set that by default in the bootloader than faff with rebuilding generic kernel configs. Robin. > [1] https://patchwork.kernel.org/patch/11327667/ > [2] https://patchwork.kernel.org/patch/10967729/ > [3] https://patchwork.kernel.org/cover/11279577/ > > --- > Best Regards, Laurentiu > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On 11.02.2020 15:04, Robin Murphy wrote: > On 2020-02-11 12:13 pm, Laurentiu Tudor wrote: > [...] >>> This is a known issue about DPAA2 MC bus not working well with SMMU >>> based IO mapping. Adding Laurentiu to the chain who has been looking >>> into this issue. >> >> Yes, I'm closely following the issue. I actually have a workaround >> (attached) but haven't submitted as it will probably raise a lot of >> eyebrows. In the mean time I'm following some discussions [1][2][3] on >> the iommu list which seem to try to tackle what appears to be a >> similar issue but with framebuffers. My hope is that we will be able >> to leverage whatever turns out. > > Indeed it's more general than framebuffers - in fact there was a > specific requirement from the IORT side to accommodate network/storage > controllers with in-memory firmware/configuration data/whatever set up > by the bootloader that want to be handed off 'live' to Linux because the > overhead of stopping and restarting them is impractical. Thus this DPAA2 > setup is very much within scope of the desired solution, so please feel > free to join in (particularly on the DT parts) :) Will sure do. Seems that the 2nd approach (the one with list of compatibles in arm-smmu) fits really well with our scenario. Will this be the way to go forward? > As for right now, note that your patch would only be a partial > mitigation to slightly reduce the fault window but not remove it > entirely. To be robust the SMMU driver *has* to know about live streams > before the first arm_smmu_reset() - hence the need for generic firmware > bindings - so doing anything from the MC driver is already too late (and > indeed the current iommu_request_dm_for_dev() mechanism is itself a > microcosm of the same problem). I think you might have missed in the patch that it pauses the firmware at early boot, in its driver init and it resumes it only after iommu_request_dm_for_dev() has completed. :) --- Best Regards, Laurentiu
On Tue, Feb 11, 2020 at 5:04 AM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2020-02-11 12:13 pm, Laurentiu Tudor wrote: > [...] > >> This is a known issue about DPAA2 MC bus not working well with SMMU > >> based IO mapping. Adding Laurentiu to the chain who has been looking > >> into this issue. > > > > Yes, I'm closely following the issue. I actually have a workaround > > (attached) but haven't submitted as it will probably raise a lot of > > eyebrows. In the mean time I'm following some discussions [1][2][3] on > > the iommu list which seem to try to tackle what appears to be a similar > > issue but with framebuffers. My hope is that we will be able to leverage > > whatever turns out. > > Indeed it's more general than framebuffers - in fact there was a > specific requirement from the IORT side to accommodate network/storage > controllers with in-memory firmware/configuration data/whatever set up > by the bootloader that want to be handed off 'live' to Linux because the > overhead of stopping and restarting them is impractical. Thus this DPAA2 > setup is very much within scope of the desired solution, so please feel > free to join in (particularly on the DT parts) :) That's a real problem that nees a solution, but that's not what's happening here, since cold boots works fine. Isn't it a whole lot more likely that something isn't reset/reinitialized properly in u-boot, such that there is lingering state in the setup, causing this? > As for right now, note that your patch would only be a partial > mitigation to slightly reduce the fault window but not remove it > entirely. To be robust the SMMU driver *has* to know about live streams > before the first arm_smmu_reset() - hence the need for generic firmware > bindings - so doing anything from the MC driver is already too late (and > indeed the current iommu_request_dm_for_dev() mechanism is itself a > microcosm of the same problem). This is more likely a live stream that's left behind from the previous kernel (there are some error messages about being unable to detach domains, but the errors make it hard to tell what driver didn't unbind enough). *BUT*, even with that bug, the system should reboot reliably and come up clean. So, something isn't clearing up the state *on boot*. > > In the mean time, can you try the workaround Leo suggested? > > Agreed, I'd imagine the command-line option is probably the best choice > for these platforms, since it's likely to be easier to set that by > default in the bootloader than faff with rebuilding generic kernel configs. For the generic user, definitely. I'll give it a go later this week when I have a bit more spare time with the device physically present. -Olof
On 11/02/2020 1:55 pm, Laurentiu Tudor wrote: > > > On 11.02.2020 15:04, Robin Murphy wrote: >> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote: >> [...] >>>> This is a known issue about DPAA2 MC bus not working well with SMMU >>>> based IO mapping. Adding Laurentiu to the chain who has been looking >>>> into this issue. >>> >>> Yes, I'm closely following the issue. I actually have a workaround >>> (attached) but haven't submitted as it will probably raise a lot of >>> eyebrows. In the mean time I'm following some discussions [1][2][3] >>> on the iommu list which seem to try to tackle what appears to be a >>> similar issue but with framebuffers. My hope is that we will be able >>> to leverage whatever turns out. >> >> Indeed it's more general than framebuffers - in fact there was a >> specific requirement from the IORT side to accommodate network/storage >> controllers with in-memory firmware/configuration data/whatever set up >> by the bootloader that want to be handed off 'live' to Linux because >> the overhead of stopping and restarting them is impractical. Thus this >> DPAA2 setup is very much within scope of the desired solution, so >> please feel free to join in (particularly on the DT parts) :) > > Will sure do. Seems that the 2nd approach (the one with list of > compatibles in arm-smmu) fits really well with our scenario. Will this > be the way to go forward? I'm hoping that Thierry's proposal can be made to work out, since it's closer to how the ACPI version should work, which means we would be able to do a lot more in shared common code rather than baking magic knowledge and duplicated functionality into individual IOMMU drivers. >> As for right now, note that your patch would only be a partial >> mitigation to slightly reduce the fault window but not remove it >> entirely. To be robust the SMMU driver *has* to know about live >> streams before the first arm_smmu_reset() - hence the need for generic >> firmware bindings - so doing anything from the MC driver is already >> too late (and indeed the current iommu_request_dm_for_dev() mechanism >> is itself a microcosm of the same problem). > > I think you might have missed in the patch that it pauses the firmware > at early boot, in its driver init and it resumes it only after > iommu_request_dm_for_dev() has completed. :) Ah, from the context I missed that that was non-modular and relying on initcall trickery... fair enough, in that case I'll downgrade my "it's insufficient" to "it's ugly and somewhat fragile" :P Thanks, Robin.
On 11.02.2020 16:48, Olof Johansson wrote: > On Tue, Feb 11, 2020 at 5:04 AM Robin Murphy <robin.murphy@arm.com> wrote: >> >> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote: >> [...] >>>> This is a known issue about DPAA2 MC bus not working well with SMMU >>>> based IO mapping. Adding Laurentiu to the chain who has been looking >>>> into this issue. >>> >>> Yes, I'm closely following the issue. I actually have a workaround >>> (attached) but haven't submitted as it will probably raise a lot of >>> eyebrows. In the mean time I'm following some discussions [1][2][3] on >>> the iommu list which seem to try to tackle what appears to be a similar >>> issue but with framebuffers. My hope is that we will be able to leverage >>> whatever turns out. >> >> Indeed it's more general than framebuffers - in fact there was a >> specific requirement from the IORT side to accommodate network/storage >> controllers with in-memory firmware/configuration data/whatever set up >> by the bootloader that want to be handed off 'live' to Linux because the >> overhead of stopping and restarting them is impractical. Thus this DPAA2 >> setup is very much within scope of the desired solution, so please feel >> free to join in (particularly on the DT parts) :) > > That's a real problem that nees a solution, but that's not what's > happening here, since cold boots works fine. > > Isn't it a whole lot more likely that something isn't > reset/reinitialized properly in u-boot, such that there is lingering > state in the setup, causing this? Ok, so this is completely something else. I don't think our u-boots are designed to run in ways other than coming from hard reset. >> As for right now, note that your patch would only be a partial >> mitigation to slightly reduce the fault window but not remove it >> entirely. To be robust the SMMU driver *has* to know about live streams >> before the first arm_smmu_reset() - hence the need for generic firmware >> bindings - so doing anything from the MC driver is already too late (and >> indeed the current iommu_request_dm_for_dev() mechanism is itself a >> microcosm of the same problem). > > This is more likely a live stream that's left behind from the previous > kernel (there are some error messages about being unable to detach > domains, but the errors make it hard to tell what driver didn't unbind > enough). I also noticed those messages. Perhaps our PCI driver doesn't do all the required cleanup. > *BUT*, even with that bug, the system should reboot reliably and come > up clean. So, something isn't clearing up the state *on boot*. We do test some kexec based "soft-reset" scenarios, didn't hit your issue but instead we hit this: https://lkml.org/lkml/2018/9/21/1066 Can you please provide some more info on your scenario? --- Best Regards, Laurentiu
On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote: > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > Hi Olof, > > > > > > Thanks a lot for your comments! > > > And sorry for my delay respond! > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > enable full use of a promising ARM developer system, the SolidRun > > HoneyComb, and would be quite valuable for me and others to be able to > > use with mainline or -next without any additional patches applied -- > > which this patchset achieves. > > > > I know there are pending revisions based on feedback. I'll leave it up > > to you and others to determine if that can be done with incremental > > patches on top, or if it should be fixed before the initial patchset > > is applied. But all in all, it's holding up adaption by me and surely > > others of a very interesting platform -- I'm looking to replace my > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > before I do. > > If you're going to be using NVMe, make sure you use a power-fail safe > version; I've already had one instance where ext4 failed to mount > because of a corrupted journal using an XPG SX8200 after the Honeycomb > Serror'd, and then I powered it down after a few hours before later > booting it back up. > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > EXT4-fs (nvme0n1p2): error loading journal ... and last night, I just got more ext4fs errors on the NVMe, without any unclean power cycles: [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid [73729.565354] Aborting journal on device nvme0n1p2-8. [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid The affected file is /var/backups/dpkg.status.6.gz It was cleanly shut down and powered off on the 22nd February, booted yesterday morning followed by another reboot a few minutes later. What worries me is the fact that corruption has happened - and if that happens to a file rather than an inode, it will likely go unnoticed for a considerably longer time. I think I'm getting to the point of deciding NVMe or the LX2160A to be just too unreliable for serious use. I hadn't noticed any issues when using the rootfs on the eMMC, so it suggests either the NVMe is unreliable, or there's a problem with PCIe on this platform (which we kind of know about with Jon's GPU rendering issues.)
On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote: > On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote: > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > Hi Olof, > > > > > > > > Thanks a lot for your comments! > > > > And sorry for my delay respond! > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > enable full use of a promising ARM developer system, the SolidRun > > > HoneyComb, and would be quite valuable for me and others to be able to > > > use with mainline or -next without any additional patches applied -- > > > which this patchset achieves. > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > to you and others to determine if that can be done with incremental > > > patches on top, or if it should be fixed before the initial patchset > > > is applied. But all in all, it's holding up adaption by me and surely > > > others of a very interesting platform -- I'm looking to replace my > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > before I do. > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > version; I've already had one instance where ext4 failed to mount > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > Serror'd, and then I powered it down after a few hours before later > > booting it back up. > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > EXT4-fs (nvme0n1p2): error loading journal > > ... and last night, I just got more ext4fs errors on the NVMe, without > any unclean power cycles: > > [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > [73729.565354] Aborting journal on device nvme0n1p2-8. > [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only > [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal > [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid > > The affected file is /var/backups/dpkg.status.6.gz > > It was cleanly shut down and powered off on the 22nd February, booted > yesterday morning followed by another reboot a few minutes later. > > What worries me is the fact that corruption has happened - and if that > happens to a file rather than an inode, it will likely go unnoticed > for a considerably longer time. > > I think I'm getting to the point of deciding NVMe or the LX2160A to be > just too unreliable for serious use. I hadn't noticed any issues when > using the rootfs on the eMMC, so it suggests either the NVMe is > unreliable, or there's a problem with PCIe on this platform (which we > kind of know about with Jon's GPU rendering issues.) Adding Ted and Andreas... Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine, and probably a similar size): debugfs: id <917527> 0000 a481 0000 30ff 0300 bd8e 475e bd77 4f5e ....0.....G^.wO^ 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 4000 0000 8087 3800 ........@.....8. 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 c40b 4c0a 0000 0000 0000 0000 ......L......... 0160 0000 0000 0000 0000 0000 0000 3884 0000 ............8... 0200 2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92 ...D........a.. 0220 bd31 4a5e ecc5 260c 0000 0000 0000 0000 .1J^..&......... 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * and for the affected inode: debugfs: id <917524> 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... 0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#.. 0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>.. 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * and "stat" output: debugfs: stat <917527> Inode: 917527 Type: regular Mode: 0644 Flags: 0x80000 Generation: 172755908 Version: 0x00000000:00000001 User: 0 Group: 0 Project: 0 Size: 261936 File ACL: 0 Links: 1 Blockcount: 512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020 atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020 mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020 Size of extra inode fields: 32 Inode checksum: 0xf2958438 EXTENTS: (0-63):3704704-3704767 debugfs: stat <917524> Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000 Generation: 3033515103 Version: 0x00000000:00000001 User: 0 Group: 0 Project: 0 Size: 261936 File ACL: 0 Links: 1 Blockcount: 512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020 atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020 mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020 Size of extra inode fields: 32 Inode checksum: 0xc31c23af EXTENTS: (0-63):3705024-3705087 When using sif (set_inode_info) to re-set the UID to 0 on this (so provoke the checksum to be updated): debugfs: id <917524> 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... 0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................ ^^^^ 0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>.. ^^^^ 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * The values with "^^^^" are the checksum, which are the only values that have changed here - the checksum is now 0x15aa1fb6 rather than 0xc31c23af. With that changed, running e2fsck -n on the filesystem results in a pass: root@cex7:~# e2fsck -n /dev/nvme0n1p2 e2fsck 1.44.5 (15-Dec-2018) Warning: skipping journal recovery because doing a read-only filesystem check. /dev/nvme0n1p2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks and the file now appears to be intact (being a gzip file, gzip verifies that the contents are now as it expects.) So, it looks like the _only_ issue is that the checksum on the inode became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe issue. I wonder whether the journal would contain anything useful, but I don't know how to use debugfs to find that out - while I can dump the journal, I'd need to know which block contains the inode, and then work out where in the journal that block was going to be written. If that would help, let me know ASAP as I'll hold off rebooting the platform for a while (which means the filesystem will remain as-is - and yes, I have the debugfs file for e2undo to put stuff back.) Maybe it's possible to pull the block number out of the e2undo file? tune2fs says: Checksum type: crc32c Checksum: 0x682f91b9 I guess this is what is used to checksum the inodes? If so, it's using the kernel's crc32c-generic driver (according to /proc/crypto). Could it be a race condition, or some problem that's specific to the ARM64 kernel that's provoking this corruption?
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote: > > On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote: > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > > > Hi Olof, > > > > > > > > > > Thanks a lot for your comments! > > > > > And sorry for my delay respond! > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > enable full use of a promising ARM developer system, the SolidRun > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > use with mainline or -next without any additional patches applied -- > > > > which this patchset achieves. > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > to you and others to determine if that can be done with incremental > > > > patches on top, or if it should be fixed before the initial patchset > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > others of a very interesting platform -- I'm looking to replace my > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > before I do. > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > version; I've already had one instance where ext4 failed to mount > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > Serror'd, and then I powered it down after a few hours before later > > > booting it back up. > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > EXT4-fs (nvme0n1p2): error loading journal > > > > ... and last night, I just got more ext4fs errors on the NVMe, without > > any unclean power cycles: > > > > [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > > [73729.565354] Aborting journal on device nvme0n1p2-8. > > [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only > > [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal > > [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > > [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid > > > > The affected file is /var/backups/dpkg.status.6.gz > > > > It was cleanly shut down and powered off on the 22nd February, booted > > yesterday morning followed by another reboot a few minutes later. > > > > What worries me is the fact that corruption has happened - and if that > > happens to a file rather than an inode, it will likely go unnoticed > > for a considerably longer time. > > > > I think I'm getting to the point of deciding NVMe or the LX2160A to be > > just too unreliable for serious use. I hadn't noticed any issues when > > using the rootfs on the eMMC, so it suggests either the NVMe is > > unreliable, or there's a problem with PCIe on this platform (which we > > kind of know about with Jon's GPU rendering issues.) > > Adding Ted and Andreas... > > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine, > and probably a similar size): > > debugfs: id <917527> > 0000 a481 0000 30ff 0300 bd8e 475e bd77 4f5e ....0.....G^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 8087 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 c40b 4c0a 0000 0000 0000 0000 ......L......... > 0160 0000 0000 0000 0000 0000 0000 3884 0000 ............8... > 0200 2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92 ...D........a.. > 0220 bd31 4a5e ecc5 260c 0000 0000 0000 0000 .1J^..&......... > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > and for the affected inode: > debugfs: id <917524> > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > 0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#.. > 0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > and "stat" output: > debugfs: stat <917527> > Inode: 917527 Type: regular Mode: 0644 Flags: 0x80000 > Generation: 172755908 Version: 0x00000000:00000001 > User: 0 Group: 0 Project: 0 Size: 261936 > File ACL: 0 > Links: 1 Blockcount: 512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020 > atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020 > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020 > Size of extra inode fields: 32 > Inode checksum: 0xf2958438 > EXTENTS: > (0-63):3704704-3704767 > debugfs: stat <917524> > Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000 > Generation: 3033515103 Version: 0x00000000:00000001 > User: 0 Group: 0 Project: 0 Size: 261936 > File ACL: 0 > Links: 1 Blockcount: 512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020 > atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020 > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020 > Size of extra inode fields: 32 > Inode checksum: 0xc31c23af > EXTENTS: > (0-63):3705024-3705087 > > When using sif (set_inode_info) to re-set the UID to 0 on this (so > provoke the checksum to be updated): > > debugfs: id <917524> > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > 0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................ > ^^^^ > 0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > ^^^^ > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > The values with "^^^^" are the checksum, which are the only values > that have changed here - the checksum is now 0x15aa1fb6 rather than > 0xc31c23af. > > With that changed, running e2fsck -n on the filesystem results in a > pass: > > root@cex7:~# e2fsck -n /dev/nvme0n1p2 > e2fsck 1.44.5 (15-Dec-2018) > Warning: skipping journal recovery because doing a read-only filesystem check. > /dev/nvme0n1p2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks > > and the file now appears to be intact (being a gzip file, gzip verifies > that the contents are now as it expects.) > > So, it looks like the _only_ issue is that the checksum on the inode > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe > issue. > > I wonder whether the journal would contain anything useful, but I don't > know how to use debugfs to find that out - while I can dump the journal, > I'd need to know which block contains the inode, and then work out where > in the journal that block was going to be written. If that would help, > let me know ASAP as I'll hold off rebooting the platform for a while > (which means the filesystem will remain as-is - and yes, I have the > debugfs file for e2undo to put stuff back.) Maybe it's possible to pull > the block number out of the e2undo file? Okay, the inode was stored in block 3670049, and the journal appears to contains no entries for that block. > tune2fs says: > > Checksum type: crc32c > Checksum: 0x682f91b9 > > I guess this is what is used to checksum the inodes? If so, it's using > the kernel's crc32c-generic driver (according to /proc/crypto). > > Could it be a race condition, or some problem that's specific to the > ARM64 kernel that's provoking this corruption? Something else occurs to me: root@cex7:~# ls -li --time=ctime --full-time /var/backups/dpkg.status* 917622 -rw-r--r-- 1 root root 999052 2020-02-29 06:25:01.852231277 +0000 /var/backups/dpkg.status 917583 -rw-r--r-- 1 root root 999052 2020-02-21 06:25:01.958160960 +0000 /var/backups/dpkg.status.0 917520 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.954161050 +0000 /var/backups/dpkg.status.1.gz 917531 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.854163293 +0000 /var/backups/dpkg.status.2.gz 917532 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.3.gz 917509 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.4.gz 917527 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.846163473 +0000 /var/backups/dpkg.status.5.gz 917524 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.842163563 +0000 /var/backups/dpkg.status.6.gz So the last time that the kernel changed inode 917524 was on the 21th of February, probably when it was last renamed by logrotate, and like several other files stored in the same inode block. Yet, _only_ the checksum for 917524 was corrupted, the rest were fine. I would guess that logrotate behaves as follows: - remove /var/backups/dpkg.status.6.gz - rename /var/backups/dpkg.status.5.gz to /var/backups/dpkg.status.6.gz - repeat for other dpkg.status.*.gz files - gzip /var/backups/dpkg.status.0 to /var/backups/dpkg.status.1.gz - rename /var/backups/dpkg.status to /var/backups/dpkg.status.0 - create new /var/backups/dpkg.status Looking at the inode block in the e2undo file, inode 917524 is at offset 0x300 into the block, which means the first inode in the block is 917521 and the last is 917536, which means we have several of the dpkg.status.* files that are stored in this inode block. That would've meant that the inode for /var/backups/dpkg.status.6.gz would have been updated just before the inode for /var/backups/dpkg.status.5.gz. I wonder if the inode block was written out somehow out of order, with the ctime for /var/backups/dpkg.status.6.gz having been updated but not the checksum as a result of the later changes - maybe as a result of having executed on a different CPU? That would suggest a weakness in the ARM64 locking implementation, coherency issues, or interconnect issues.
On Sat, Feb 29, 2020 at 12:08:28PM +0000, Russell King - ARM Linux admin wrote: > On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote: > > > On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote: > > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote: > > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote: > > > > > > > > > > > > Hi Olof, > > > > > > > > > > > > Thanks a lot for your comments! > > > > > > And sorry for my delay respond! > > > > > > > > > > Actually, they apply with only minor conflicts on top of current -next. > > > > > > > > > > Bjorn, any chance we can get you to pick these up pretty soon? They > > > > > enable full use of a promising ARM developer system, the SolidRun > > > > > HoneyComb, and would be quite valuable for me and others to be able to > > > > > use with mainline or -next without any additional patches applied -- > > > > > which this patchset achieves. > > > > > > > > > > I know there are pending revisions based on feedback. I'll leave it up > > > > > to you and others to determine if that can be done with incremental > > > > > patches on top, or if it should be fixed before the initial patchset > > > > > is applied. But all in all, it's holding up adaption by me and surely > > > > > others of a very interesting platform -- I'm looking to replace my > > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work > > > > > before I do. > > > > > > > > If you're going to be using NVMe, make sure you use a power-fail safe > > > > version; I've already had one instance where ext4 failed to mount > > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb > > > > Serror'd, and then I powered it down after a few hours before later > > > > booting it back up. > > > > > > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem > > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery > > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt. > > > > EXT4-fs (nvme0n1p2): error loading journal > > > > > > ... and last night, I just got more ext4fs errors on the NVMe, without > > > any unclean power cycles: > > > > > > [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > > > [73729.565354] Aborting journal on device nvme0n1p2-8. > > > [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only > > > [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal > > > [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid > > > [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid > > > > > > The affected file is /var/backups/dpkg.status.6.gz > > > > > > It was cleanly shut down and powered off on the 22nd February, booted > > > yesterday morning followed by another reboot a few minutes later. > > > > > > What worries me is the fact that corruption has happened - and if that > > > happens to a file rather than an inode, it will likely go unnoticed > > > for a considerably longer time. > > > > > > I think I'm getting to the point of deciding NVMe or the LX2160A to be > > > just too unreliable for serious use. I hadn't noticed any issues when > > > using the rootfs on the eMMC, so it suggests either the NVMe is > > > unreliable, or there's a problem with PCIe on this platform (which we > > > kind of know about with Jon's GPU rendering issues.) > > > > Adding Ted and Andreas... > > > > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine, > > and probably a similar size): > > > > debugfs: id <917527> > > 0000 a481 0000 30ff 0300 bd8e 475e bd77 4f5e ....0.....G^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 8087 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 c40b 4c0a 0000 0000 0000 0000 ......L......... > > 0160 0000 0000 0000 0000 0000 0000 3884 0000 ............8... > > 0200 2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92 ...D........a.. > > 0220 bd31 4a5e ecc5 260c 0000 0000 0000 0000 .1J^..&......... > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > and for the affected inode: > > debugfs: id <917524> > > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > > 0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#.. > > 0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > and "stat" output: > > debugfs: stat <917527> > > Inode: 917527 Type: regular Mode: 0644 Flags: 0x80000 > > Generation: 172755908 Version: 0x00000000:00000001 > > User: 0 Group: 0 Project: 0 Size: 261936 > > File ACL: 0 > > Links: 1 Blockcount: 512 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020 > > atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020 > > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020 > > Size of extra inode fields: 32 > > Inode checksum: 0xf2958438 > > EXTENTS: > > (0-63):3704704-3704767 > > debugfs: stat <917524> > > Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000 > > Generation: 3033515103 Version: 0x00000000:00000001 > > User: 0 Group: 0 Project: 0 Size: 261936 > > File ACL: 0 > > Links: 1 Blockcount: 512 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020 > > atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020 > > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020 > > Size of extra inode fields: 32 > > Inode checksum: 0xc31c23af > > EXTENTS: > > (0-63):3705024-3705087 > > > > When using sif (set_inode_info) to re-set the UID to 0 on this (so > > provoke the checksum to be updated): > > > > debugfs: id <917524> > > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > > 0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................ > > ^^^^ > > 0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > > ^^^^ > > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > The values with "^^^^" are the checksum, which are the only values > > that have changed here - the checksum is now 0x15aa1fb6 rather than > > 0xc31c23af. > > > > With that changed, running e2fsck -n on the filesystem results in a > > pass: > > > > root@cex7:~# e2fsck -n /dev/nvme0n1p2 > > e2fsck 1.44.5 (15-Dec-2018) > > Warning: skipping journal recovery because doing a read-only filesystem check. > > /dev/nvme0n1p2 contains a file system with errors, check forced. > > Pass 1: Checking inodes, blocks, and sizes > > Pass 2: Checking directory structure > > Pass 3: Checking directory connectivity > > Pass 4: Checking reference counts > > Pass 5: Checking group summary information > > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks > > > > and the file now appears to be intact (being a gzip file, gzip verifies > > that the contents are now as it expects.) > > > > So, it looks like the _only_ issue is that the checksum on the inode > > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe > > issue. > > > > I wonder whether the journal would contain anything useful, but I don't > > know how to use debugfs to find that out - while I can dump the journal, > > I'd need to know which block contains the inode, and then work out where > > in the journal that block was going to be written. If that would help, > > let me know ASAP as I'll hold off rebooting the platform for a while > > (which means the filesystem will remain as-is - and yes, I have the > > debugfs file for e2undo to put stuff back.) Maybe it's possible to pull > > the block number out of the e2undo file? > > Okay, the inode was stored in block 3670049, and the journal appears > to contains no entries for that block. > > > tune2fs says: > > > > Checksum type: crc32c > > Checksum: 0x682f91b9 > > > > I guess this is what is used to checksum the inodes? If so, it's using > > the kernel's crc32c-generic driver (according to /proc/crypto). > > > > Could it be a race condition, or some problem that's specific to the > > ARM64 kernel that's provoking this corruption? > > Something else occurs to me: > > root@cex7:~# ls -li --time=ctime --full-time /var/backups/dpkg.status* > 917622 -rw-r--r-- 1 root root 999052 2020-02-29 06:25:01.852231277 +0000 /var/backups/dpkg.status > 917583 -rw-r--r-- 1 root root 999052 2020-02-21 06:25:01.958160960 +0000 /var/backups/dpkg.status.0 > 917520 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.954161050 +0000 /var/backups/dpkg.status.1.gz > 917531 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.854163293 +0000 /var/backups/dpkg.status.2.gz > 917532 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.3.gz > 917509 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.4.gz > 917527 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.846163473 +0000 /var/backups/dpkg.status.5.gz > 917524 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.842163563 +0000 /var/backups/dpkg.status.6.gz > > So the last time that the kernel changed inode 917524 was on the 21th > of February, probably when it was last renamed by logrotate, and like > several other files stored in the same inode block. Yet, _only_ the > checksum for 917524 was corrupted, the rest were fine. > > I would guess that logrotate behaves as follows: > - remove /var/backups/dpkg.status.6.gz > - rename /var/backups/dpkg.status.5.gz to /var/backups/dpkg.status.6.gz > - repeat for other dpkg.status.*.gz files > - gzip /var/backups/dpkg.status.0 to /var/backups/dpkg.status.1.gz > - rename /var/backups/dpkg.status to /var/backups/dpkg.status.0 > - create new /var/backups/dpkg.status > > Looking at the inode block in the e2undo file, inode 917524 is at > offset 0x300 into the block, which means the first inode in the > block is 917521 and the last is 917536, which means we have several > of the dpkg.status.* files that are stored in this inode block. > > That would've meant that the inode for /var/backups/dpkg.status.6.gz > would have been updated just before the inode for > /var/backups/dpkg.status.5.gz. I wonder if the inode block was > written out somehow out of order, with the ctime for > /var/backups/dpkg.status.6.gz having been updated but not the checksum > as a result of the later changes - maybe as a result of having > executed on a different CPU? That would suggest a weakness in the > ARM64 locking implementation, coherency issues, or interconnect issues. Looking at the errata configuration, I have: # ARM errata workarounds via the alternatives framework # CONFIG_ARM64_WORKAROUND_CLEAN_CACHE=y CONFIG_ARM64_ERRATUM_826319=y CONFIG_ARM64_ERRATUM_827319=y CONFIG_ARM64_ERRATUM_824069=y CONFIG_ARM64_ERRATUM_819472=y CONFIG_ARM64_ERRATUM_832075=y CONFIG_ARM64_ERRATUM_834220=y CONFIG_ARM64_ERRATUM_845719=y CONFIG_ARM64_ERRATUM_843419=y CONFIG_ARM64_ERRATUM_1024718=y CONFIG_ARM64_ERRATUM_1418040=y CONFIG_ARM64_ERRATUM_1165522=y CONFIG_ARM64_ERRATUM_1286807=y CONFIG_ARM64_ERRATUM_1319367=y CONFIG_ARM64_ERRATUM_1463225=y # CONFIG_ARM64_ERRATUM_1542419 is not set # CONFIG_CAVIUM_ERRATUM_22375 is not set # CONFIG_CAVIUM_ERRATUM_23154 is not set # CONFIG_CAVIUM_ERRATUM_27456 is not set # CONFIG_CAVIUM_ERRATUM_30115 is not set # CONFIG_CAVIUM_TX2_ERRATUM_219 is not set CONFIG_QCOM_FALKOR_ERRATUM_1003=y CONFIG_ARM64_WORKAROUND_REPEAT_TLBI=y CONFIG_QCOM_FALKOR_ERRATUM_1009=y CONFIG_QCOM_QDF2400_ERRATUM_0065=y # CONFIG_SOCIONEXT_SYNQUACER_PREITS is not set # CONFIG_HISILICON_ERRATUM_161600802 is not set CONFIG_QCOM_FALKOR_ERRATUM_E1041=y # CONFIG_FUJITSU_ERRATUM_010001 is not set # end of ARM errata workarounds via the alternatives framework ... CONFIG_FSL_ERRATUM_A008585=y CONFIG_HISILICON_ERRATUM_161010101=y CONFIG_ARM64_ERRATUM_858921=y so I don't think it's a missing errata kconfig setting, unless there's an erratum that isn't in v5.5 that's necessary.
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > Could it be a race condition, or some problem that's specific to the > ARM64 kernel that's provoking this corruption? Since I got brought in mid-way through this discussion, can someone summarize the vital details of the bughunt? What kernel version is involved, and is this a regression? If so, what's the last version of the kernel where you didn't have a problem on this hardware? Can you trigger this failure reliably? Unfortunately, while I'm regularly running xfstests on x86_64 on a Google Compute Engine VM, I'm not doing any runs on arm64. I can certainly build an arm-64. There's a test-appliance designed to be run on ARM64 here[1]. [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz which is a Debian chroot, designed to be run via android-xfstests[2], but if you unpack it, it should be possible to enter the chroot and trigger the xfstests run manually on any arm64 system. [2] https://thunk.org/android-xfstests Does anyone know if kernel CI is running xfstests regularly? Cheers, - Ted
On Sat, Feb 29, 2020 at 10:19:07AM -0500, Theodore Y. Ts'o wrote: > On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > Could it be a race condition, or some problem that's specific to the > > ARM64 kernel that's provoking this corruption? > > Since I got brought in mid-way through this discussion, can someone > summarize the vital details of the bughunt? What kernel version is > involved, and is this a regression? If so, what's the last version of > the kernel where you didn't have a problem on this hardware? It's a new platform, I've run most 5.x kernels on it, but only recently have I had a NVMe. Currently running a 5.5 based kernel (for which I have to patch in support for the platform), and I've no idea if it is a regression or not. > Can you trigger this failure reliably? No - the very first time I ended up with a corrupted ext4 fs was on the 8th February, and at that time it was put down to the NVMe not being power-off safe: the machine had crashed sometime over night, resulting in a section of my network going offline (due to a pause frame storm). So, I powered it down from crashed state - and from what people tell me, NVMe _may_ keep blocks unwritten to safe media for a considerable time. I never bothered to investigate it because the explanation seemed reasonable, and manually running e2fsck fixed the filesystem. The system was then booted back into using the NVMe rootfs, and continued to do so without apparent issue until the 21st Feb, when I cleanly shut it down, and powered it off. During the time it was running, it likely saw many reboots of the 5.5 kernel. I powered it back on yesterday morning, and this morning it found the fs corruption while trying to do a logrotate. As I say in my last email, I suspect it isn't an ext4 bug, but either a locking implementation issue, coherency issue, or interconnect issue. The 4k block with the affected inode looks perfectly reasonable with the only exception that the checksum is incorrect for that one inode - and other inodes stored in the same 4k block were modified afterwards. It suggests to me that the writes to update the two 16-bit words containing the checksum were somehow lost for this particular inode. > Unfortunately, while I'm regularly running xfstests on x86_64 on a > Google Compute Engine VM, I'm not doing any runs on arm64. I can > certainly build an arm-64. > > There's a test-appliance designed to be run on ARM64 here[1]. > > [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz The filename seems to say "amd64" not "arm64" ? > which is a Debian chroot, designed to be run via android-xfstests[2], but > if you unpack it, it should be possible to enter the chroot and > trigger the xfstests run manually on any arm64 system. > > [2] https://thunk.org/android-xfstests > > Does anyone know if kernel CI is running xfstests regularly? I don't know...
On Sat, Feb 29, 2020 at 05:03:28PM +0000, Russell King - ARM Linux admin wrote: > > There's a test-appliance designed to be run on ARM64 here[1]. > > > > [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz > > The filename seems to say "amd64" not "arm64" ? Sorry, I cut and pasted the wrong link: s/amd64/arm64/ If there are arm64-specific locking issues, we can probably flush them out if we could figure out some way of running some of the stress tests in xfstests. I don't know a whole lot about arm-64 architectures; would running xfstests on, say, an Amazon AWS arm-based VM be representative of your new architecture? Or are there a lot of sub-architecture differences in the arm-64 world? - Ted
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > Adding Ted and Andreas... > > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine, > and probably a similar size): > > debugfs: id <917527> > 0000 a481 0000 30ff 0300 bd8e 475e bd77 4f5e ....0.....G^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 8087 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 c40b 4c0a 0000 0000 0000 0000 ......L......... > 0160 0000 0000 0000 0000 0000 0000 3884 0000 ............8... > 0200 2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92 ...D........a.. > 0220 bd31 4a5e ecc5 260c 0000 0000 0000 0000 .1J^..&......... > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > and for the affected inode: > debugfs: id <917524> > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > 0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#.. > 0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > and "stat" output: > debugfs: stat <917527> > Inode: 917527 Type: regular Mode: 0644 Flags: 0x80000 > Generation: 172755908 Version: 0x00000000:00000001 > User: 0 Group: 0 Project: 0 Size: 261936 > File ACL: 0 > Links: 1 Blockcount: 512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020 > atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020 > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020 > Size of extra inode fields: 32 > Inode checksum: 0xf2958438 > EXTENTS: > (0-63):3704704-3704767 > debugfs: stat <917524> > Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000 > Generation: 3033515103 Version: 0x00000000:00000001 > User: 0 Group: 0 Project: 0 Size: 261936 > File ACL: 0 > Links: 1 Blockcount: 512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020 > atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020 > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020 > Size of extra inode fields: 32 > Inode checksum: 0xc31c23af > EXTENTS: > (0-63):3705024-3705087 > > When using sif (set_inode_info) to re-set the UID to 0 on this (so > provoke the checksum to be updated): > > debugfs: id <917524> > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > 0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................ > ^^^^ > 0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > ^^^^ > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > The values with "^^^^" are the checksum, which are the only values > that have changed here - the checksum is now 0x15aa1fb6 rather than > 0xc31c23af. > > With that changed, running e2fsck -n on the filesystem results in a > pass: > > root@cex7:~# e2fsck -n /dev/nvme0n1p2 > e2fsck 1.44.5 (15-Dec-2018) > Warning: skipping journal recovery because doing a read-only filesystem check. > /dev/nvme0n1p2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks > > and the file now appears to be intact (being a gzip file, gzip verifies > that the contents are now as it expects.) > > So, it looks like the _only_ issue is that the checksum on the inode > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe > issue. > > I wonder whether the journal would contain anything useful, but I don't > know how to use debugfs to find that out - while I can dump the journal, > I'd need to know which block contains the inode, and then work out where > in the journal that block was going to be written. If that would help, > let me know ASAP as I'll hold off rebooting the platform for a while > (which means the filesystem will remain as-is - and yes, I have the > debugfs file for e2undo to put stuff back.) Maybe it's possible to pull > the block number out of the e2undo file? > > tune2fs says: > > Checksum type: crc32c > Checksum: 0x682f91b9 > > I guess this is what is used to checksum the inodes? If so, it's using > the kernel's crc32c-generic driver (according to /proc/crypto). > > Could it be a race condition, or some problem that's specific to the > ARM64 kernel that's provoking this corruption? Hi, The corruption has returned this evening: [25094.614718] EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid [25094.623781] Aborting journal on device nvme0n1p2-8. [25094.627419] EXT4-fs (nvme0n1p2): Remounting filesystem read-only [25094.628206] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:83: Detected aborted journal root@cex7:[~]:<506> debugfs /dev/nvme0n1p2 debugfs 1.44.5 (15-Dec-2018) debugfs: id <271688> 0000 a481 0000 f108 0000 2518 fd5d 2518 fd5d ........%..]%..] 0020 9f49 715c 0000 0000 0000 0100 0800 0000 .Iq\............ 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 0100 0000 ed19 1100 ................ 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 b42f 4f06 0000 0000 0000 0000 ...../O......... 0160 0000 0000 0000 0000 0000 0000 c9cf 0000 ................ 0200 2000 8d83 086d bebf 0000 0000 086d bebf ....m.......m.. 0220 2518 fd5d 086d bebf 0000 0000 0000 0000 %..].m.......... 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * debugfs: stat <271688> Inode: 271688 Type: regular Mode: 0644 Flags: 0x80000 Generation: 105852852 Version: 0x00000000:00000001 User: 0 Group: 0 Project: 0 Size: 2289 File ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 atime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 mtime: 0x5c71499f:00000000 -- Sat Feb 23 13:24:47 2019 crtime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 Size of extra inode fields: 32 Inode checksum: 0x838dcfc9 EXTENTS: (0):1120749 debugfs: root@cex7:[~]:<509> e2fsck -n /dev/nvme0n1p2 e2fsck 1.44.5 (15-Dec-2018) Warning: skipping journal recovery because doing a read-only filesystem check. /dev/nvme0n1p2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks This time, the machine has not been powered down for a very long time, although I've booted 5.7 (plus the additional patches including several workarounds in the PCIe driver so my Mellanox card works) on it earlier today. I did notice that debian decided to run a fsck on the filesystem at reboot, which is a little weird as it's ext4, and found nothing wrong. Hmm, I just tried: root@cex7:[~]:<514> hdparm -f /dev/nvme0n1p2 root@cex7:[~]:<515> hdparm -f /dev/nvme0n1 root@cex7:[~]:<517> e2fsck -n /dev/nvme0n1p2 e2fsck 1.44.5 (15-Dec-2018) Warning: skipping journal recovery because doing a read-only filesystem check. /dev/nvme0n1p2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has deleted/unused inode 922603. Clear? no Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has an incorrect filetype (was 1, should be 0). Fix? no Pass 3: Checking directory connectivity Pass 4: Checking reference counts Unattached inode 920748 Connect to /lost+found? no Pass 5: Checking group summary information Block bitmap differences: +(9259--9280) -3703011 -3703044 -3703053 +3736187 -3827722 -3830272 +3906363 +3911697 +3911699 +3911701 +3911703 +3913228 Fix? no Free blocks count wrong for group #113 (12615, counted=12606). Fix? no Free blocks count wrong (6845889, counted=6845880). Fix? no Inode bitmap differences: Group 112 inode bitmap does not match checksum. IGNORED. Block bitmap differences: Group 113 block bitmap does not match checksum. IGNORED. /dev/nvme0n1p2: ********** WARNING: Filesystem still has errors ********** /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks which looks less good, and is likely to be e2fsck reading off the media rather than using what was in the kernel cache. However, still nothing for the offending inode, who's raw data remains unchanged from what I've quoted above from debugfs. It /seems/ to be pointing at the data on the media changing, possibly buggy firmware on the nvme (ADATA SX8200PNP) drive, maybe? Or maybe undiscovered bugs in the Mobiveil PCIe hardware corrupting transfers to the nvme? The problem is, this is rather undebuggable as it happens so rarely. :( I'm becoming very discouraged to touch nvme ever again by this, as this is my first and only experience of that technology. I'm considering getting some conventional SATA HDDs and junking nvme on the basis of it being an unreliable technology.
On Sat, Jun 06, 2020 at 12:53:43AM +0100, Russell King - ARM Linux admin wrote: > On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > Adding Ted and Andreas... > > > > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine, > > and probably a similar size): > > > > debugfs: id <917527> > > 0000 a481 0000 30ff 0300 bd8e 475e bd77 4f5e ....0.....G^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 8087 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 c40b 4c0a 0000 0000 0000 0000 ......L......... > > 0160 0000 0000 0000 0000 0000 0000 3884 0000 ............8... > > 0200 2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92 ...D........a.. > > 0220 bd31 4a5e ecc5 260c 0000 0000 0000 0000 .1J^..&......... > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > and for the affected inode: > > debugfs: id <917524> > > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > > 0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#.. > > 0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > and "stat" output: > > debugfs: stat <917527> > > Inode: 917527 Type: regular Mode: 0644 Flags: 0x80000 > > Generation: 172755908 Version: 0x00000000:00000001 > > User: 0 Group: 0 Project: 0 Size: 261936 > > File ACL: 0 > > Links: 1 Blockcount: 512 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020 > > atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020 > > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020 > > Size of extra inode fields: 32 > > Inode checksum: 0xf2958438 > > EXTENTS: > > (0-63):3704704-3704767 > > debugfs: stat <917524> > > Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000 > > Generation: 3033515103 Version: 0x00000000:00000001 > > User: 0 Group: 0 Project: 0 Size: 261936 > > File ACL: 0 > > Links: 1 Blockcount: 512 > > Fragment: Address: 0 Number: 0 Size: 0 > > ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020 > > atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020 > > mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020 > > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020 > > Size of extra inode fields: 32 > > Inode checksum: 0xc31c23af > > EXTENTS: > > (0-63):3705024-3705087 > > > > When using sif (set_inode_info) to re-set the UID to 0 on this (so > > provoke the checksum to be updated): > > > > debugfs: id <917524> > > 0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^ > > 0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............ > > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > > 0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8. > > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > 0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._........... > > 0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................ > > ^^^^ > > 0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>.. > > ^^^^ > > 0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........ > > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > > * > > > > The values with "^^^^" are the checksum, which are the only values > > that have changed here - the checksum is now 0x15aa1fb6 rather than > > 0xc31c23af. > > > > With that changed, running e2fsck -n on the filesystem results in a > > pass: > > > > root@cex7:~# e2fsck -n /dev/nvme0n1p2 > > e2fsck 1.44.5 (15-Dec-2018) > > Warning: skipping journal recovery because doing a read-only filesystem check. > > /dev/nvme0n1p2 contains a file system with errors, check forced. > > Pass 1: Checking inodes, blocks, and sizes > > Pass 2: Checking directory structure > > Pass 3: Checking directory connectivity > > Pass 4: Checking reference counts > > Pass 5: Checking group summary information > > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks > > > > and the file now appears to be intact (being a gzip file, gzip verifies > > that the contents are now as it expects.) > > > > So, it looks like the _only_ issue is that the checksum on the inode > > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe > > issue. > > > > I wonder whether the journal would contain anything useful, but I don't > > know how to use debugfs to find that out - while I can dump the journal, > > I'd need to know which block contains the inode, and then work out where > > in the journal that block was going to be written. If that would help, > > let me know ASAP as I'll hold off rebooting the platform for a while > > (which means the filesystem will remain as-is - and yes, I have the > > debugfs file for e2undo to put stuff back.) Maybe it's possible to pull > > the block number out of the e2undo file? > > > > tune2fs says: > > > > Checksum type: crc32c > > Checksum: 0x682f91b9 > > > > I guess this is what is used to checksum the inodes? If so, it's using > > the kernel's crc32c-generic driver (according to /proc/crypto). > > > > Could it be a race condition, or some problem that's specific to the > > ARM64 kernel that's provoking this corruption? > > Hi, > > The corruption has returned this evening: > > [25094.614718] EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid > [25094.623781] Aborting journal on device nvme0n1p2-8. > [25094.627419] EXT4-fs (nvme0n1p2): Remounting filesystem read-only > [25094.628206] EXT4-fs error (device nvme0n1p2): > ext4_journal_check_start:83: Detected aborted journal > root@cex7:[~]:<506> debugfs /dev/nvme0n1p2 > debugfs 1.44.5 (15-Dec-2018) > debugfs: id <271688> > 0000 a481 0000 f108 0000 2518 fd5d 2518 fd5d ........%..]%..] > 0020 9f49 715c 0000 0000 0000 0100 0800 0000 .Iq\............ > 0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................ > 0060 0000 0000 0000 0000 0100 0000 ed19 1100 ................ > 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > 0140 0000 0000 b42f 4f06 0000 0000 0000 0000 ...../O......... > 0160 0000 0000 0000 0000 0000 0000 c9cf 0000 ................ > 0200 2000 8d83 086d bebf 0000 0000 086d bebf ....m.......m.. > 0220 2518 fd5d 086d bebf 0000 0000 0000 0000 %..].m.......... > 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ > * > > debugfs: stat <271688> > Inode: 271688 Type: regular Mode: 0644 Flags: 0x80000 > Generation: 105852852 Version: 0x00000000:00000001 > User: 0 Group: 0 Project: 0 Size: 2289 > File ACL: 0 > Links: 1 Blockcount: 8 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 > atime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 > mtime: 0x5c71499f:00000000 -- Sat Feb 23 13:24:47 2019 > crtime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019 > Size of extra inode fields: 32 > Inode checksum: 0x838dcfc9 > EXTENTS: > (0):1120749 > debugfs: > root@cex7:[~]:<509> e2fsck -n /dev/nvme0n1p2 > e2fsck 1.44.5 (15-Dec-2018) > Warning: skipping journal recovery because doing a read-only filesystem check. > /dev/nvme0n1p2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks > > This time, the machine has not been powered down for a very long time, > although I've booted 5.7 (plus the additional patches including several > workarounds in the PCIe driver so my Mellanox card works) on it earlier > today. I did notice that debian decided to run a fsck on the filesystem > at reboot, which is a little weird as it's ext4, and found nothing wrong. > > Hmm, I just tried: > > root@cex7:[~]:<514> hdparm -f /dev/nvme0n1p2 > root@cex7:[~]:<515> hdparm -f /dev/nvme0n1 > root@cex7:[~]:<517> e2fsck -n /dev/nvme0n1p2 > e2fsck 1.44.5 (15-Dec-2018) > Warning: skipping journal recovery because doing a read-only filesystem > check. > /dev/nvme0n1p2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has deleted/unused inode 922603. Clear? no > > Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has an incorrect filetype (was 1, should be 0). > Fix? no > > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Unattached inode 920748 > Connect to /lost+found? no > > Pass 5: Checking group summary information > Block bitmap differences: +(9259--9280) -3703011 -3703044 -3703053 +3736187 -3827722 -3830272 +3906363 +3911697 +3911699 +3911701 +3911703 +3913228 > Fix? no > > Free blocks count wrong for group #113 (12615, counted=12606). > Fix? no > > Free blocks count wrong (6845889, counted=6845880). > Fix? no > > Inode bitmap differences: Group 112 inode bitmap does not match checksum. > IGNORED. > Block bitmap differences: Group 113 block bitmap does not match checksum. > IGNORED. > > /dev/nvme0n1p2: ********** WARNING: Filesystem still has errors ********** > > /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks > > which looks less good, and is likely to be e2fsck reading off the media > rather than using what was in the kernel cache. However, still nothing > for the offending inode, who's raw data remains unchanged from what I've > quoted above from debugfs. > > It /seems/ to be pointing at the data on the media changing, possibly > buggy firmware on the nvme (ADATA SX8200PNP) drive, maybe? Or maybe > undiscovered bugs in the Mobiveil PCIe hardware corrupting transfers > to the nvme? > > The problem is, this is rather undebuggable as it happens so rarely. :( > > I'm becoming very discouraged to touch nvme ever again by this, as this > is my first and only experience of that technology. I'm considering > getting some conventional SATA HDDs and junking nvme on the basis of > it being an unreliable technology. Okay, now I'm confused. I haven't rebooted the platform (I was just about to) but because of the issues I've had in the past with the filesystem not being mountable, I thought I ought to run e2fsck on the now read-only root filesystem before rebooting to ensure that it is consistent. root@cex7:[~]:<587> e2fsck -n /dev/nvme0n1p2 e2fsck 1.44.5 (15-Dec-2018) /dev/nvme0n1p2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (6845930, counted=6845886). Fix? no Free inodes count wrong (1949681, counted=1949673). Fix? no /dev/nvme0n1p2: 147471/2097152 files (0.1% non-contiguous), 1542678/8388608 blocks but but but, the filesystem is still mounted read-only, so how can it have changed?
From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com> This patch set is to recode the Mobiveil driver and add PCIe support for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4 controller. Hou Zhiqiang (12): PCI: mobiveil: Re-abstract the private structure PCI: mobiveil: Move the host initialization into a routine PCI: mobiveil: Collect the interrupt related operations into a routine PCI: mobiveil: Modularize the Mobiveil PCIe Host Bridge IP driver PCI: mobiveil: Add callback function for interrupt initialization PCI: mobiveil: Add callback function for link up check PCI: mobiveil: Make mobiveil_host_init() can be used to re-init host PCI: mobiveil: Add 8-bit and 16-bit CSR register accessors dt-bindings: PCI: Add NXP Layerscape SoCs PCIe Gen4 controller PCI: mobiveil: Add PCIe Gen4 RC driver for NXP Layerscape SoCs arm64: dts: lx2160a: Add PCIe controller DT nodes arm64: defconfig: Enable CONFIG_PCIE_LAYERSCAPE_GEN4 .../bindings/pci/layerscape-pcie-gen4.txt | 52 ++ MAINTAINERS | 10 +- .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 163 ++++++ arch/arm64/configs/defconfig | 1 + drivers/pci/controller/Kconfig | 11 +- drivers/pci/controller/Makefile | 2 +- drivers/pci/controller/mobiveil/Kconfig | 34 ++ drivers/pci/controller/mobiveil/Makefile | 5 + .../mobiveil/pcie-layerscape-gen4.c | 274 +++++++++ .../pcie-mobiveil-host.c} | 544 ++++-------------- .../controller/mobiveil/pcie-mobiveil-plat.c | 60 ++ .../pci/controller/mobiveil/pcie-mobiveil.c | 230 ++++++++ .../pci/controller/mobiveil/pcie-mobiveil.h | 226 ++++++++ 13 files changed, 1157 insertions(+), 455 deletions(-) create mode 100644 Documentation/devicetree/bindings/pci/layerscape-pcie-gen4.txt create mode 100644 drivers/pci/controller/mobiveil/Kconfig create mode 100644 drivers/pci/controller/mobiveil/Makefile create mode 100644 drivers/pci/controller/mobiveil/pcie-layerscape-gen4.c rename drivers/pci/controller/{pcie-mobiveil.c => mobiveil/pcie-mobiveil-host.c} (54%) create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil-plat.c create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.c create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.h