mbox series

[PATCHv9,00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs

Message ID 20191120034451.30102-1-Zhiqiang.Hou@nxp.com (mailing list archive)
Headers show
Series PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs | expand

Message

Z.Q. Hou Nov. 20, 2019, 3:45 a.m. UTC
From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>

This patch set is to recode the Mobiveil driver and add
PCIe support for NXP Layerscape series SoCs integrated
Mobiveil's PCIe Gen4 controller.

Hou Zhiqiang (12):
  PCI: mobiveil: Re-abstract the private structure
  PCI: mobiveil: Move the host initialization into a routine
  PCI: mobiveil: Collect the interrupt related operations into a routine
  PCI: mobiveil: Modularize the Mobiveil PCIe Host Bridge IP driver
  PCI: mobiveil: Add callback function for interrupt initialization
  PCI: mobiveil: Add callback function for link up check
  PCI: mobiveil: Make mobiveil_host_init() can be used to re-init host
  PCI: mobiveil: Add 8-bit and 16-bit CSR register accessors
  dt-bindings: PCI: Add NXP Layerscape SoCs PCIe Gen4 controller
  PCI: mobiveil: Add PCIe Gen4 RC driver for NXP Layerscape SoCs
  arm64: dts: lx2160a: Add PCIe controller DT nodes
  arm64: defconfig: Enable CONFIG_PCIE_LAYERSCAPE_GEN4

 .../bindings/pci/layerscape-pcie-gen4.txt     |  52 ++
 MAINTAINERS                                   |  10 +-
 .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 163 ++++++
 arch/arm64/configs/defconfig                  |   1 +
 drivers/pci/controller/Kconfig                |  11 +-
 drivers/pci/controller/Makefile               |   2 +-
 drivers/pci/controller/mobiveil/Kconfig       |  34 ++
 drivers/pci/controller/mobiveil/Makefile      |   5 +
 .../mobiveil/pcie-layerscape-gen4.c           | 274 +++++++++
 .../pcie-mobiveil-host.c}                     | 544 ++++--------------
 .../controller/mobiveil/pcie-mobiveil-plat.c  |  60 ++
 .../pci/controller/mobiveil/pcie-mobiveil.c   | 230 ++++++++
 .../pci/controller/mobiveil/pcie-mobiveil.h   | 226 ++++++++
 13 files changed, 1157 insertions(+), 455 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/pci/layerscape-pcie-gen4.txt
 create mode 100644 drivers/pci/controller/mobiveil/Kconfig
 create mode 100644 drivers/pci/controller/mobiveil/Makefile
 create mode 100644 drivers/pci/controller/mobiveil/pcie-layerscape-gen4.c
 rename drivers/pci/controller/{pcie-mobiveil.c => mobiveil/pcie-mobiveil-host.c} (54%)
 create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil-plat.c
 create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.c
 create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.h

Comments

Russell King (Oracle) Nov. 20, 2019, 9:57 a.m. UTC | #1
On Wed, Nov 20, 2019 at 03:45:17AM +0000, Z.q. Hou wrote:
> From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> 
> This patch set is to recode the Mobiveil driver and add
> PCIe support for NXP Layerscape series SoCs integrated
> Mobiveil's PCIe Gen4 controller.

How many PCIe cards have been tested to work/don't work with this?

I need:

PCI: mobiveil: ls_pcie_g4: fix SError when accessing config space
PCI: mobiveil: ls_pcie_g4: add Workaround for A-011451
PCI: mobiveil: ls_pcie_g4: add Workaround for A-011577

to successfully boot with a Mellanox card plugged in with a previous
revision of these patches.
Z.Q. Hou Nov. 20, 2019, 10:30 a.m. UTC | #2
Hi Russell,

> -----Original Message-----
> From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> Sent: 2019年11月20日 17:57
> To: Z.q. Hou <zhiqiang.hou@nxp.com>
> Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org;
> bhelgaas@google.com; robh+dt@kernel.org; arnd@arndb.de;
> mark.rutland@arm.com; l.subrahmanya@mobiveil.co.in;
> shawnguo@kernel.org; m.karthikeyan@mobiveil.co.in; Leo Li
> <leoyang.li@nxp.com>; lorenzo.pieralisi@arm.com;
> catalin.marinas@arm.com; will.deacon@arm.com;
> andrew.murray@arm.com; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei
> Bao <xiaowei.bao@nxp.com>; Mingkai Hu <mingkai.hu@nxp.com>
> Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4
> driver for NXP Layerscape SoCs
> 
> On Wed, Nov 20, 2019 at 03:45:17AM +0000, Z.q. Hou wrote:
> > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> >
> > This patch set is to recode the Mobiveil driver and add PCIe support
> > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4
> > controller.
> 
> How many PCIe cards have been tested to work/don't work with this?
> 
> I need:
> 
> PCI: mobiveil: ls_pcie_g4: fix SError when accessing config space
> PCI: mobiveil: ls_pcie_g4: add Workaround for A-011451
> PCI: mobiveil: ls_pcie_g4: add Workaround for A-011577
> 
> to successfully boot with a Mellanox card plugged in with a previous revision
> of these patches.
>

Yes, we need to apply these NXP internal maintained workarounds on top of
this series. I only tested Intel e1000e NIC with this patch set + these 3
workarounds.

Thanks,
Zhiqiang
 
> --
> RMK's Patch system:
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww
> .armlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=02%7C01%7Czhiq
> iang.hou%40nxp.com%7C69f6fb1f4fd44f3fca3808d76da01440%7C686ea1d
> 3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637098406606503361&amp;sd
> ata=wOLWzKfZZoiP%2FZpTOw5zr4enpuNImz45RM8Hy80aUdI%3D&amp;res
> erved=0
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps
> up According to speedtest.net: 11.9Mbps down 500kbps up
Olof Johansson Dec. 13, 2019, 6:37 p.m. UTC | #3
Hi!

On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
>
> From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
>
> This patch set is to recode the Mobiveil driver and add
> PCIe support for NXP Layerscape series SoCs integrated
> Mobiveil's PCIe Gen4 controller.

Can we get a respin for this on top of the 5.5 merge window material?
Given that it's a bunch of refactorings, many of them don't apply on
top of the material that was merged.

I'd love to see these go in sooner rather than later so I can start
getting -next running on ls2160a here.


-Olof
Z.Q. Hou Dec. 17, 2019, 2:50 a.m. UTC | #4
Hi Lorenzo,

The v9 patches have addressed the comments from Andrew, and it has been dried about 1 month, can you help to apply them?

Thanks,
Zhiqiang

> -----Original Message-----
> From: Olof Johansson <olof@lixom.net>
> Sent: 2019年12月14日 2:37
> To: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com
> Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org;
> robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com;
> l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org;
> m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>;
> lorenzo.pieralisi@arm.com; catalin.marinas@arm.com;
> will.deacon@arm.com; andrew.murray@arm.com; Mingkai Hu
> <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei Bao
> <xiaowei.bao@nxp.com>
> Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4
> driver for NXP Layerscape SoCs
> 
> Hi!
> 
> On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> >
> > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> >
> > This patch set is to recode the Mobiveil driver and add PCIe support
> > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4
> > controller.
> 
> Can we get a respin for this on top of the 5.5 merge window material?
> Given that it's a bunch of refactorings, many of them don't apply on top of
> the material that was merged.
> 
> I'd love to see these go in sooner rather than later so I can start getting -next
> running on ls2160a here.
> 
> 
> -Olof
Lorenzo Pieralisi Jan. 10, 2020, 3:33 p.m. UTC | #5
On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote:
> Hi Lorenzo,
> 
> The v9 patches have addressed the comments from Andrew, and it has
> been dried about 1 month, can you help to apply them?

We shall have a look beginning of next week, sorry for the delay
in getting back to you.

Lorenzo

> Thanks,
> Zhiqiang
> 
> > -----Original Message-----
> > From: Olof Johansson <olof@lixom.net>
> > Sent: 2019年12月14日 2:37
> > To: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com
> > Cc: linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> > devicetree@vger.kernel.org; linux-kernel@vger.kernel.org;
> > robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com;
> > l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org;
> > m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>;
> > lorenzo.pieralisi@arm.com; catalin.marinas@arm.com;
> > will.deacon@arm.com; andrew.murray@arm.com; Mingkai Hu
> > <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>; Xiaowei Bao
> > <xiaowei.bao@nxp.com>
> > Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4
> > driver for NXP Layerscape SoCs
> > 
> > Hi!
> > 
> > On Tue, Nov 19, 2019 at 7:45 PM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > >
> > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> > >
> > > This patch set is to recode the Mobiveil driver and add PCIe support
> > > for NXP Layerscape series SoCs integrated Mobiveil's PCIe Gen4
> > > controller.
> > 
> > Can we get a respin for this on top of the 5.5 merge window material?
> > Given that it's a bunch of refactorings, many of them don't apply on top of
> > the material that was merged.
> > 
> > I'd love to see these go in sooner rather than later so I can start getting -next
> > running on ls2160a here.
> > 
> > 
> > -Olof
Olof Johansson Jan. 10, 2020, 5:05 p.m. UTC | #6
On Fri, Jan 10, 2020 at 7:33 AM Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
>
> On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote:
> > Hi Lorenzo,
> >
> > The v9 patches have addressed the comments from Andrew, and it has
> > been dried about 1 month, can you help to apply them?
>
> We shall have a look beginning of next week, sorry for the delay
> in getting back to you.

Note that the patch set no longer applies since the refactorings
conflict with new development by others.

Zhiqiang, can you rebase and post a new version of the patch set?


-Olof
Z.Q. Hou Feb. 6, 2020, 10:57 a.m. UTC | #7
Hi Olof,

Thanks a lot for your comments!
And sorry for my delay respond!

> -----Original Message-----
> From: Olof Johansson <olof@lixom.net>
> Sent: 2020年1月11日 1:06
> To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Z.q. Hou <zhiqiang.hou@nxp.com>; bhelgaas@google.com;
> linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org;
> robh+dt@kernel.org; arnd@arndb.de; mark.rutland@arm.com;
> l.subrahmanya@mobiveil.co.in; shawnguo@kernel.org;
> m.karthikeyan@mobiveil.co.in; Leo Li <leoyang.li@nxp.com>;
> catalin.marinas@arm.com; will.deacon@arm.com; andrew.murray@arm.com;
> Mingkai Hu <mingkai.hu@nxp.com>; M.h. Lian <minghuan.lian@nxp.com>;
> Xiaowei Bao <xiaowei.bao@nxp.com>
> Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4
> driver for NXP Layerscape SoCs
> 
> On Fri, Jan 10, 2020 at 7:33 AM Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> wrote:
> >
> > On Tue, Dec 17, 2019 at 02:50:15AM +0000, Z.q. Hou wrote:
> > > Hi Lorenzo,
> > >
> > > The v9 patches have addressed the comments from Andrew, and it has
> > > been dried about 1 month, can you help to apply them?
> >
> > We shall have a look beginning of next week, sorry for the delay in
> > getting back to you.
> 
> Note that the patch set no longer applies since the refactorings conflict with
> new development by others.
> 
> Zhiqiang, can you rebase and post a new version of the patch set?

Yes, I will rebase the patches to the latest code base.

Thanks,
Zhiqiang

 
> 
> -Olof
Olof Johansson Feb. 10, 2020, 3:12 p.m. UTC | #8
On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
>
> Hi Olof,
>
> Thanks a lot for your comments!
> And sorry for my delay respond!

Actually, they apply with only minor conflicts on top of current -next.

Bjorn, any chance we can get you to pick these up pretty soon? They
enable full use of a promising ARM developer system, the SolidRun
HoneyComb, and would be quite valuable for me and others to be able to
use with mainline or -next without any additional patches applied --
which this patchset achieves.

I know there are pending revisions based on feedback. I'll leave it up
to you and others to determine if that can be done with incremental
patches on top, or if it should be fixed before the initial patchset
is applied. But all in all, it's holding up adaption by me and surely
others of a very interesting platform -- I'm looking to replace my
aging MacchiatoBin with one of these and would need PCIe/NVMe to work
before I do.


Thanks!


-Olof
Russell King (Oracle) Feb. 10, 2020, 3:22 p.m. UTC | #9
On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> >
> > Hi Olof,
> >
> > Thanks a lot for your comments!
> > And sorry for my delay respond!
> 
> Actually, they apply with only minor conflicts on top of current -next.
> 
> Bjorn, any chance we can get you to pick these up pretty soon? They
> enable full use of a promising ARM developer system, the SolidRun
> HoneyComb, and would be quite valuable for me and others to be able to
> use with mainline or -next without any additional patches applied --
> which this patchset achieves.
> 
> I know there are pending revisions based on feedback. I'll leave it up
> to you and others to determine if that can be done with incremental
> patches on top, or if it should be fixed before the initial patchset
> is applied. But all in all, it's holding up adaption by me and surely
> others of a very interesting platform -- I'm looking to replace my
> aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> before I do.

If you're going to be using NVMe, make sure you use a power-fail safe
version; I've already had one instance where ext4 failed to mount
because of a corrupted journal using an XPG SX8200 after the Honeycomb
Serror'd, and then I powered it down after a few hours before later
booting it back up.

EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
EXT4-fs (nvme0n1p2): write access will be enabled during recovery
JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
EXT4-fs (nvme0n1p2): error loading journal
Olof Johansson Feb. 10, 2020, 3:28 p.m. UTC | #10
On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > >
> > > Hi Olof,
> > >
> > > Thanks a lot for your comments!
> > > And sorry for my delay respond!
> >
> > Actually, they apply with only minor conflicts on top of current -next.
> >
> > Bjorn, any chance we can get you to pick these up pretty soon? They
> > enable full use of a promising ARM developer system, the SolidRun
> > HoneyComb, and would be quite valuable for me and others to be able to
> > use with mainline or -next without any additional patches applied --
> > which this patchset achieves.
> >
> > I know there are pending revisions based on feedback. I'll leave it up
> > to you and others to determine if that can be done with incremental
> > patches on top, or if it should be fixed before the initial patchset
> > is applied. But all in all, it's holding up adaption by me and surely
> > others of a very interesting platform -- I'm looking to replace my
> > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > before I do.
>
> If you're going to be using NVMe, make sure you use a power-fail safe
> version; I've already had one instance where ext4 failed to mount
> because of a corrupted journal using an XPG SX8200 after the Honeycomb
> Serror'd, and then I powered it down after a few hours before later
> booting it back up.
>
> EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> EXT4-fs (nvme0n1p2): error loading journal

Hmm, using btrfs on mine, not sure if the exposure is similar or not.

Do you know if the SErr was due to a known issue and/or if it's
something that's fixed in production silicon?

(I still can't enable SMMU since across a warm reboot it fails
*completely*, with nothing coming up and working. NXP folks, you
listening? :)


-Olof
Lorenzo Pieralisi Feb. 10, 2020, 3:33 p.m. UTC | #11
On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> >
> > Hi Olof,
> >
> > Thanks a lot for your comments!
> > And sorry for my delay respond!
> 
> Actually, they apply with only minor conflicts on top of current -next.
> 
> Bjorn, any chance we can get you to pick these up pretty soon? They
> enable full use of a promising ARM developer system, the SolidRun
> HoneyComb, and would be quite valuable for me and others to be able to
> use with mainline or -next without any additional patches applied --
> which this patchset achieves.
> 
> I know there are pending revisions based on feedback. I'll leave it up
> to you and others to determine if that can be done with incremental
> patches on top, or if it should be fixed before the initial patchset
> is applied. But all in all, it's holding up adaption by me and surely
> others of a very interesting platform -- I'm looking to replace my
> aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> before I do.

We should be able to merge them for v5.7, I don't know when they
will land in -next.

Thanks,
Lorenzo
Russell King (Oracle) Feb. 10, 2020, 4:15 p.m. UTC | #12
On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote:
> On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > >
> > > > Hi Olof,
> > > >
> > > > Thanks a lot for your comments!
> > > > And sorry for my delay respond!
> > >
> > > Actually, they apply with only minor conflicts on top of current -next.
> > >
> > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > enable full use of a promising ARM developer system, the SolidRun
> > > HoneyComb, and would be quite valuable for me and others to be able to
> > > use with mainline or -next without any additional patches applied --
> > > which this patchset achieves.
> > >
> > > I know there are pending revisions based on feedback. I'll leave it up
> > > to you and others to determine if that can be done with incremental
> > > patches on top, or if it should be fixed before the initial patchset
> > > is applied. But all in all, it's holding up adaption by me and surely
> > > others of a very interesting platform -- I'm looking to replace my
> > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > before I do.
> >
> > If you're going to be using NVMe, make sure you use a power-fail safe
> > version; I've already had one instance where ext4 failed to mount
> > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > Serror'd, and then I powered it down after a few hours before later
> > booting it back up.
> >
> > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > EXT4-fs (nvme0n1p2): error loading journal
> 
> Hmm, using btrfs on mine, not sure if the exposure is similar or not.

As I understand the problem, it isn't a filesystem issue.  It's a data
integrity issue with the NVMe over power fail, how they cache the data,
and ultimately write it to the nand flash.

Have a read of:

https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection

As NVMe and SSD are basically the same underlying technology (the host
interface is different) and the issues I've heard, and now experienced
with my NVMe, I think the above is a good pointer to the problems of
flash mass storage.

As I understand it, the problem occurs when the mapping table has not
been written back to flash, power is lost without the Standby Immediate
command being sent, and there is no way for the firmware to quickly
save the table.  On subsequent power up, the firmware has to
reconstruct the mapping table, and depending on how that is done,
incorrect (old?) data may be returned for some blocks.

That can happen to any blocks on the drive, which means any data can
be at risk from a power loss event, whether that is a power failure
or after a crash.

> Do you know if the SErr was due to a known issue and/or if it's
> something that's fixed in production silicon?

The SError is triggered by something on the PCIe side of things; if I
leave the Mellanox PCIe card out, then I don't get them.  The errata
patches I have merged into my tree help a bit, turning the code from
being unable to boot without a SError with the card plugged in, to
being able to boot and last a while - but the SErrors still eventually
come, maybe taking a few days... and that's without the Mellanox
ethernet interface being up.

> (I still can't enable SMMU since across a warm reboot it fails
> *completely*, with nothing coming up and working. NXP folks, you
> listening? :)

Is it just a warm reboot?  I thought I saw SMMU activity on a cold
boot as well, implying that there were devices active that Linux
did not know about.
Russell King (Oracle) Feb. 10, 2020, 5:20 p.m. UTC | #13
On Mon, Feb 10, 2020 at 04:15:53PM +0000, Russell King - ARM Linux admin wrote:
> On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote:
> > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > > >
> > > > > Hi Olof,
> > > > >
> > > > > Thanks a lot for your comments!
> > > > > And sorry for my delay respond!
> > > >
> > > > Actually, they apply with only minor conflicts on top of current -next.
> > > >
> > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > enable full use of a promising ARM developer system, the SolidRun
> > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > use with mainline or -next without any additional patches applied --
> > > > which this patchset achieves.
> > > >
> > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > to you and others to determine if that can be done with incremental
> > > > patches on top, or if it should be fixed before the initial patchset
> > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > others of a very interesting platform -- I'm looking to replace my
> > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > before I do.
> > >
> > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > version; I've already had one instance where ext4 failed to mount
> > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > Serror'd, and then I powered it down after a few hours before later
> > > booting it back up.
> > >
> > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > EXT4-fs (nvme0n1p2): error loading journal
> > 
> > Hmm, using btrfs on mine, not sure if the exposure is similar or not.
> 
> As I understand the problem, it isn't a filesystem issue.  It's a data
> integrity issue with the NVMe over power fail, how they cache the data,
> and ultimately write it to the nand flash.
> 
> Have a read of:
> 
> https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection

This was the link I was actually looking for:

http://industrial.adata.com/en/technology/92

but there's also:

http://industrial.adata.com/en/technology/26

ADATA make the XPG SX8200:

NVME Identify Controller:
vid       : 0x1cc1
ssvid     : 0x1cc1
mn        : ADATA SX8200PNP
fr        : R0906I
Olof Johansson Feb. 10, 2020, 6:33 p.m. UTC | #14
[cc:ing honeycomb-users, didn't think of that earlier]

On Mon, Feb 10, 2020 at 5:16 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote:
> > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > > >
> > > > > Hi Olof,
> > > > >
> > > > > Thanks a lot for your comments!
> > > > > And sorry for my delay respond!
> > > >
> > > > Actually, they apply with only minor conflicts on top of current -next.
> > > >
> > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > enable full use of a promising ARM developer system, the SolidRun
> > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > use with mainline or -next without any additional patches applied --
> > > > which this patchset achieves.
> > > >
> > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > to you and others to determine if that can be done with incremental
> > > > patches on top, or if it should be fixed before the initial patchset
> > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > others of a very interesting platform -- I'm looking to replace my
> > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > before I do.
> > >
> > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > version; I've already had one instance where ext4 failed to mount
> > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > Serror'd, and then I powered it down after a few hours before later
> > > booting it back up.
> > >
> > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > EXT4-fs (nvme0n1p2): error loading journal
> >
> > Hmm, using btrfs on mine, not sure if the exposure is similar or not.
>
> As I understand the problem, it isn't a filesystem issue.  It's a data
> integrity issue with the NVMe over power fail, how they cache the data,
> and ultimately write it to the nand flash.
>
> Have a read of:
>
> https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection
>
> As NVMe and SSD are basically the same underlying technology (the host
> interface is different) and the issues I've heard, and now experienced
> with my NVMe, I think the above is a good pointer to the problems of
> flash mass storage.
>
> As I understand it, the problem occurs when the mapping table has not
> been written back to flash, power is lost without the Standby Immediate
> command being sent, and there is no way for the firmware to quickly
> save the table.  On subsequent power up, the firmware has to
> reconstruct the mapping table, and depending on how that is done,
> incorrect (old?) data may be returned for some blocks.
>
> That can happen to any blocks on the drive, which means any data can
> be at risk from a power loss event, whether that is a power failure
> or after a crash.

Makes me suspect if there's some board-level power/reset sequencing
issue, or if there's a problem with one card going down disabling
others. I haven't read the specs enough to know what's expected
behavior but I've seen similar issues on other platforms so take it
with a grain of salt.

> > Do you know if the SErr was due to a known issue and/or if it's
> > something that's fixed in production silicon?
>
> The SError is triggered by something on the PCIe side of things; if I
> leave the Mellanox PCIe card out, then I don't get them.  The errata
> patches I have merged into my tree help a bit, turning the code from
> being unable to boot without a SError with the card plugged in, to
> being able to boot and last a while - but the SErrors still eventually
> come, maybe taking a few days... and that's without the Mellanox
> ethernet interface being up.
>
> > (I still can't enable SMMU since across a warm reboot it fails
> > *completely*, with nothing coming up and working. NXP folks, you
> > listening? :)
>
> Is it just a warm reboot?  I thought I saw SMMU activity on a cold
> boot as well, implying that there were devices active that Linux
> did not know about.

Yeah, 100% reproducible on warm reboot -- every single time. Not on
cold boot though (100% success rate as far as I remember). I boot with
kernel on NVMe on PCIe, native 1GbE for networking. u-boot from SD
card.

This is with the SolidRun u-boot from GitHub.


-Olof
Leo Li Feb. 10, 2020, 6:41 p.m. UTC | #15
On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote:
>
> On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > >
> > > > Hi Olof,
> > > >
> > > > Thanks a lot for your comments!
> > > > And sorry for my delay respond!
> > >
> > > Actually, they apply with only minor conflicts on top of current -next.
> > >
> > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > enable full use of a promising ARM developer system, the SolidRun
> > > HoneyComb, and would be quite valuable for me and others to be able to
> > > use with mainline or -next without any additional patches applied --
> > > which this patchset achieves.
> > >
> > > I know there are pending revisions based on feedback. I'll leave it up
> > > to you and others to determine if that can be done with incremental
> > > patches on top, or if it should be fixed before the initial patchset
> > > is applied. But all in all, it's holding up adaption by me and surely
> > > others of a very interesting platform -- I'm looking to replace my
> > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > before I do.
> >
> > If you're going to be using NVMe, make sure you use a power-fail safe
> > version; I've already had one instance where ext4 failed to mount
> > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > Serror'd, and then I powered it down after a few hours before later
> > booting it back up.
> >
> > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > EXT4-fs (nvme0n1p2): error loading journal
>
> Hmm, using btrfs on mine, not sure if the exposure is similar or not.
>
> Do you know if the SErr was due to a known issue and/or if it's
> something that's fixed in production silicon?
>
> (I still can't enable SMMU since across a warm reboot it fails
> *completely*, with nothing coming up and working. NXP folks, you
> listening? :)

This is a known issue about DPAA2 MC bus not working well with SMMU
based IO mapping.  Adding Laurentiu to the chain who has been looking
into this issue.

Regards,
Leo
Leo Li Feb. 10, 2020, 7:48 p.m. UTC | #16
On Mon, Feb 10, 2020 at 12:41 PM Li Yang <leoyang.li@nxp.com> wrote:
>
> On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote:
> >
> > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > > >
> > > > > Hi Olof,
> > > > >
> > > > > Thanks a lot for your comments!
> > > > > And sorry for my delay respond!
> > > >
> > > > Actually, they apply with only minor conflicts on top of current -next.
> > > >
> > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > enable full use of a promising ARM developer system, the SolidRun
> > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > use with mainline or -next without any additional patches applied --
> > > > which this patchset achieves.
> > > >
> > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > to you and others to determine if that can be done with incremental
> > > > patches on top, or if it should be fixed before the initial patchset
> > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > others of a very interesting platform -- I'm looking to replace my
> > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > before I do.
> > >
> > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > version; I've already had one instance where ext4 failed to mount
> > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > Serror'd, and then I powered it down after a few hours before later
> > > booting it back up.
> > >
> > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > EXT4-fs (nvme0n1p2): error loading journal
> >
> > Hmm, using btrfs on mine, not sure if the exposure is similar or not.
> >
> > Do you know if the SErr was due to a known issue and/or if it's
> > something that's fixed in production silicon?
> >
> > (I still can't enable SMMU since across a warm reboot it fails
> > *completely*, with nothing coming up and working. NXP folks, you
> > listening? :)
>
> This is a known issue about DPAA2 MC bus not working well with SMMU
> based IO mapping.  Adding Laurentiu to the chain who has been looking
> into this issue.

Forgot to mention that you can workaround the issue by setting
CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT=n or adding
"arm-smmu.disable_bypass=0" to boot parameters.

Regards,
Leo
Laurentiu Tudor Feb. 11, 2020, 12:13 p.m. UTC | #17
On 10.02.2020 20:41, Li Yang wrote:
> On Mon, Feb 10, 2020 at 9:32 AM Olof Johansson <olof@lixom.net> wrote:
>>
>> On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>>>
>>> On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
>>>> On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
>>>>>
>>>>> Hi Olof,
>>>>>
>>>>> Thanks a lot for your comments!
>>>>> And sorry for my delay respond!
>>>>
>>>> Actually, they apply with only minor conflicts on top of current -next.
>>>>
>>>> Bjorn, any chance we can get you to pick these up pretty soon? They
>>>> enable full use of a promising ARM developer system, the SolidRun
>>>> HoneyComb, and would be quite valuable for me and others to be able to
>>>> use with mainline or -next without any additional patches applied --
>>>> which this patchset achieves.
>>>>
>>>> I know there are pending revisions based on feedback. I'll leave it up
>>>> to you and others to determine if that can be done with incremental
>>>> patches on top, or if it should be fixed before the initial patchset
>>>> is applied. But all in all, it's holding up adaption by me and surely
>>>> others of a very interesting platform -- I'm looking to replace my
>>>> aging MacchiatoBin with one of these and would need PCIe/NVMe to work
>>>> before I do.
>>>
>>> If you're going to be using NVMe, make sure you use a power-fail safe
>>> version; I've already had one instance where ext4 failed to mount
>>> because of a corrupted journal using an XPG SX8200 after the Honeycomb
>>> Serror'd, and then I powered it down after a few hours before later
>>> booting it back up.
>>>
>>> EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
>>> EXT4-fs (nvme0n1p2): write access will be enabled during recovery
>>> JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
>>> EXT4-fs (nvme0n1p2): error loading journal
>>
>> Hmm, using btrfs on mine, not sure if the exposure is similar or not.
>>
>> Do you know if the SErr was due to a known issue and/or if it's
>> something that's fixed in production silicon?
>>
>> (I still can't enable SMMU since across a warm reboot it fails
>> *completely*, with nothing coming up and working. NXP folks, you
>> listening? :)
> 
> This is a known issue about DPAA2 MC bus not working well with SMMU
> based IO mapping.  Adding Laurentiu to the chain who has been looking
> into this issue.

Yes, I'm closely following the issue. I actually have a workaround 
(attached) but haven't submitted as it will probably raise a lot of 
eyebrows. In the mean time I'm following some discussions [1][2][3] on 
the iommu list which seem to try to tackle what appears to be a similar 
issue but with framebuffers. My hope is that we will be able to leverage 
whatever turns out.
In the mean time, can you try the workaround Leo suggested?

[1] https://patchwork.kernel.org/patch/11327667/
[2] https://patchwork.kernel.org/patch/10967729/
[3] https://patchwork.kernel.org/cover/11279577/

---
Best Regards, Laurentiu
Robin Murphy Feb. 11, 2020, 1:04 p.m. UTC | #18
On 2020-02-11 12:13 pm, Laurentiu Tudor wrote:
[...]
>> This is a known issue about DPAA2 MC bus not working well with SMMU
>> based IO mapping.  Adding Laurentiu to the chain who has been looking
>> into this issue.
> 
> Yes, I'm closely following the issue. I actually have a workaround 
> (attached) but haven't submitted as it will probably raise a lot of 
> eyebrows. In the mean time I'm following some discussions [1][2][3] on 
> the iommu list which seem to try to tackle what appears to be a similar 
> issue but with framebuffers. My hope is that we will be able to leverage 
> whatever turns out.

Indeed it's more general than framebuffers - in fact there was a 
specific requirement from the IORT side to accommodate network/storage 
controllers with in-memory firmware/configuration data/whatever set up 
by the bootloader that want to be handed off 'live' to Linux because the 
overhead of stopping and restarting them is impractical. Thus this DPAA2 
setup is very much within scope of the desired solution, so please feel 
free to join in (particularly on the DT parts) :)

As for right now, note that your patch would only be a partial 
mitigation to slightly reduce the fault window but not remove it 
entirely. To be robust the SMMU driver *has* to know about live streams 
before the first arm_smmu_reset() - hence the need for generic firmware 
bindings - so doing anything from the MC driver is already too late (and 
indeed the current iommu_request_dm_for_dev() mechanism is itself a 
microcosm of the same problem).

> In the mean time, can you try the workaround Leo suggested?

Agreed, I'd imagine the command-line option is probably the best choice 
for these platforms, since it's likely to be easier to set that by 
default in the bootloader than faff with rebuilding generic kernel configs.

Robin.

> [1] https://patchwork.kernel.org/patch/11327667/
> [2] https://patchwork.kernel.org/patch/10967729/
> [3] https://patchwork.kernel.org/cover/11279577/
> 
> ---
> Best Regards, Laurentiu
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Laurentiu Tudor Feb. 11, 2020, 1:55 p.m. UTC | #19
On 11.02.2020 15:04, Robin Murphy wrote:
> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote:
> [...]
>>> This is a known issue about DPAA2 MC bus not working well with SMMU
>>> based IO mapping.  Adding Laurentiu to the chain who has been looking
>>> into this issue.
>>
>> Yes, I'm closely following the issue. I actually have a workaround 
>> (attached) but haven't submitted as it will probably raise a lot of 
>> eyebrows. In the mean time I'm following some discussions [1][2][3] on 
>> the iommu list which seem to try to tackle what appears to be a 
>> similar issue but with framebuffers. My hope is that we will be able 
>> to leverage whatever turns out.
> 
> Indeed it's more general than framebuffers - in fact there was a 
> specific requirement from the IORT side to accommodate network/storage 
> controllers with in-memory firmware/configuration data/whatever set up 
> by the bootloader that want to be handed off 'live' to Linux because the 
> overhead of stopping and restarting them is impractical. Thus this DPAA2 
> setup is very much within scope of the desired solution, so please feel 
> free to join in (particularly on the DT parts) :)

Will sure do. Seems that the 2nd approach (the one with list of 
compatibles in arm-smmu) fits really well with our scenario. Will this 
be the way to go forward?

> As for right now, note that your patch would only be a partial 
> mitigation to slightly reduce the fault window but not remove it 
> entirely. To be robust the SMMU driver *has* to know about live streams 
> before the first arm_smmu_reset() - hence the need for generic firmware 
> bindings - so doing anything from the MC driver is already too late (and 
> indeed the current iommu_request_dm_for_dev() mechanism is itself a 
> microcosm of the same problem).

I think you might have missed in the patch that it pauses the firmware 
at early boot, in its driver init and it resumes it only after 
iommu_request_dm_for_dev() has completed. :)

---
Best Regards, Laurentiu
Olof Johansson Feb. 11, 2020, 2:48 p.m. UTC | #20
On Tue, Feb 11, 2020 at 5:04 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote:
> [...]
> >> This is a known issue about DPAA2 MC bus not working well with SMMU
> >> based IO mapping.  Adding Laurentiu to the chain who has been looking
> >> into this issue.
> >
> > Yes, I'm closely following the issue. I actually have a workaround
> > (attached) but haven't submitted as it will probably raise a lot of
> > eyebrows. In the mean time I'm following some discussions [1][2][3] on
> > the iommu list which seem to try to tackle what appears to be a similar
> > issue but with framebuffers. My hope is that we will be able to leverage
> > whatever turns out.
>
> Indeed it's more general than framebuffers - in fact there was a
> specific requirement from the IORT side to accommodate network/storage
> controllers with in-memory firmware/configuration data/whatever set up
> by the bootloader that want to be handed off 'live' to Linux because the
> overhead of stopping and restarting them is impractical. Thus this DPAA2
> setup is very much within scope of the desired solution, so please feel
> free to join in (particularly on the DT parts) :)

That's a real problem that nees a solution, but that's not what's
happening here, since cold boots works fine.

Isn't it a whole lot more likely that something isn't
reset/reinitialized properly in u-boot, such that there is lingering
state in the setup, causing this?

> As for right now, note that your patch would only be a partial
> mitigation to slightly reduce the fault window but not remove it
> entirely. To be robust the SMMU driver *has* to know about live streams
> before the first arm_smmu_reset() - hence the need for generic firmware
> bindings - so doing anything from the MC driver is already too late (and
> indeed the current iommu_request_dm_for_dev() mechanism is itself a
> microcosm of the same problem).

This is more likely a live stream that's left behind from the previous
kernel (there are some error messages about being unable to detach
domains, but the errors make it hard to tell what driver didn't unbind
enough).

*BUT*, even with that bug, the system should reboot reliably and come
up clean. So, something isn't clearing up the state *on boot*.

> > In the mean time, can you try the workaround Leo suggested?
>
> Agreed, I'd imagine the command-line option is probably the best choice
> for these platforms, since it's likely to be easier to set that by
> default in the bootloader than faff with rebuilding generic kernel configs.

For the generic user, definitely. I'll give it a go later this week
when I have a bit more spare time with the device physically present.


-Olof
Robin Murphy Feb. 11, 2020, 2:51 p.m. UTC | #21
On 11/02/2020 1:55 pm, Laurentiu Tudor wrote:
> 
> 
> On 11.02.2020 15:04, Robin Murphy wrote:
>> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote:
>> [...]
>>>> This is a known issue about DPAA2 MC bus not working well with SMMU
>>>> based IO mapping.  Adding Laurentiu to the chain who has been looking
>>>> into this issue.
>>>
>>> Yes, I'm closely following the issue. I actually have a workaround 
>>> (attached) but haven't submitted as it will probably raise a lot of 
>>> eyebrows. In the mean time I'm following some discussions [1][2][3] 
>>> on the iommu list which seem to try to tackle what appears to be a 
>>> similar issue but with framebuffers. My hope is that we will be able 
>>> to leverage whatever turns out.
>>
>> Indeed it's more general than framebuffers - in fact there was a 
>> specific requirement from the IORT side to accommodate network/storage 
>> controllers with in-memory firmware/configuration data/whatever set up 
>> by the bootloader that want to be handed off 'live' to Linux because 
>> the overhead of stopping and restarting them is impractical. Thus this 
>> DPAA2 setup is very much within scope of the desired solution, so 
>> please feel free to join in (particularly on the DT parts) :)
> 
> Will sure do. Seems that the 2nd approach (the one with list of 
> compatibles in arm-smmu) fits really well with our scenario. Will this 
> be the way to go forward?

I'm hoping that Thierry's proposal can be made to work out, since it's 
closer to how the ACPI version should work, which means we would be able 
to do a lot more in shared common code rather than baking magic 
knowledge and duplicated functionality into individual IOMMU drivers.

>> As for right now, note that your patch would only be a partial 
>> mitigation to slightly reduce the fault window but not remove it 
>> entirely. To be robust the SMMU driver *has* to know about live 
>> streams before the first arm_smmu_reset() - hence the need for generic 
>> firmware bindings - so doing anything from the MC driver is already 
>> too late (and indeed the current iommu_request_dm_for_dev() mechanism 
>> is itself a microcosm of the same problem).
> 
> I think you might have missed in the patch that it pauses the firmware 
> at early boot, in its driver init and it resumes it only after 
> iommu_request_dm_for_dev() has completed. :)

Ah, from the context I missed that that was non-modular and relying on 
initcall trickery... fair enough, in that case I'll downgrade my "it's 
insufficient" to "it's ugly and somewhat fragile" :P

Thanks,
Robin.
Laurentiu Tudor Feb. 11, 2020, 3:14 p.m. UTC | #22
On 11.02.2020 16:48, Olof Johansson wrote:
> On Tue, Feb 11, 2020 at 5:04 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2020-02-11 12:13 pm, Laurentiu Tudor wrote:
>> [...]
>>>> This is a known issue about DPAA2 MC bus not working well with SMMU
>>>> based IO mapping.  Adding Laurentiu to the chain who has been looking
>>>> into this issue.
>>>
>>> Yes, I'm closely following the issue. I actually have a workaround
>>> (attached) but haven't submitted as it will probably raise a lot of
>>> eyebrows. In the mean time I'm following some discussions [1][2][3] on
>>> the iommu list which seem to try to tackle what appears to be a similar
>>> issue but with framebuffers. My hope is that we will be able to leverage
>>> whatever turns out.
>>
>> Indeed it's more general than framebuffers - in fact there was a
>> specific requirement from the IORT side to accommodate network/storage
>> controllers with in-memory firmware/configuration data/whatever set up
>> by the bootloader that want to be handed off 'live' to Linux because the
>> overhead of stopping and restarting them is impractical. Thus this DPAA2
>> setup is very much within scope of the desired solution, so please feel
>> free to join in (particularly on the DT parts) :)
> 
> That's a real problem that nees a solution, but that's not what's
> happening here, since cold boots works fine.
> 
> Isn't it a whole lot more likely that something isn't
> reset/reinitialized properly in u-boot, such that there is lingering
> state in the setup, causing this?

Ok, so this is completely something else. I don't think our u-boots are 
designed to run in ways other than coming from hard reset.

>> As for right now, note that your patch would only be a partial
>> mitigation to slightly reduce the fault window but not remove it
>> entirely. To be robust the SMMU driver *has* to know about live streams
>> before the first arm_smmu_reset() - hence the need for generic firmware
>> bindings - so doing anything from the MC driver is already too late (and
>> indeed the current iommu_request_dm_for_dev() mechanism is itself a
>> microcosm of the same problem).
> 
> This is more likely a live stream that's left behind from the previous
> kernel (there are some error messages about being unable to detach
> domains, but the errors make it hard to tell what driver didn't unbind
> enough).

I also noticed those messages. Perhaps our PCI driver doesn't do all the 
required cleanup.

> *BUT*, even with that bug, the system should reboot reliably and come
> up clean. So, something isn't clearing up the state *on boot*.

We do test some kexec based "soft-reset" scenarios, didn't hit your 
issue but instead we hit this:

https://lkml.org/lkml/2018/9/21/1066

Can you please provide some more info on your scenario?

---
Best Regards, Laurentiu
Russell King (Oracle) Feb. 29, 2020, 9:55 a.m. UTC | #23
On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote:
> On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > >
> > > Hi Olof,
> > >
> > > Thanks a lot for your comments!
> > > And sorry for my delay respond!
> > 
> > Actually, they apply with only minor conflicts on top of current -next.
> > 
> > Bjorn, any chance we can get you to pick these up pretty soon? They
> > enable full use of a promising ARM developer system, the SolidRun
> > HoneyComb, and would be quite valuable for me and others to be able to
> > use with mainline or -next without any additional patches applied --
> > which this patchset achieves.
> > 
> > I know there are pending revisions based on feedback. I'll leave it up
> > to you and others to determine if that can be done with incremental
> > patches on top, or if it should be fixed before the initial patchset
> > is applied. But all in all, it's holding up adaption by me and surely
> > others of a very interesting platform -- I'm looking to replace my
> > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > before I do.
> 
> If you're going to be using NVMe, make sure you use a power-fail safe
> version; I've already had one instance where ext4 failed to mount
> because of a corrupted journal using an XPG SX8200 after the Honeycomb
> Serror'd, and then I powered it down after a few hours before later
> booting it back up.
> 
> EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> EXT4-fs (nvme0n1p2): error loading journal

... and last night, I just got more ext4fs errors on the NVMe, without
any unclean power cycles:

[73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
[73729.565354] Aborting journal on device nvme0n1p2-8.
[73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal
[73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
[73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid

The affected file is /var/backups/dpkg.status.6.gz

It was cleanly shut down and powered off on the 22nd February, booted
yesterday morning followed by another reboot a few minutes later.

What worries me is the fact that corruption has happened - and if that
happens to a file rather than an inode, it will likely go unnoticed
for a considerably longer time.

I think I'm getting to the point of deciding NVMe or the LX2160A to be
just too unreliable for serious use.  I hadn't noticed any issues when
using the rootfs on the eMMC, so it suggests either the NVMe is
unreliable, or there's a problem with PCIe on this platform (which we
kind of know about with Jon's GPU rendering issues.)
Russell King (Oracle) Feb. 29, 2020, 11:04 a.m. UTC | #24
On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote:
> On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote:
> > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > >
> > > > Hi Olof,
> > > >
> > > > Thanks a lot for your comments!
> > > > And sorry for my delay respond!
> > > 
> > > Actually, they apply with only minor conflicts on top of current -next.
> > > 
> > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > enable full use of a promising ARM developer system, the SolidRun
> > > HoneyComb, and would be quite valuable for me and others to be able to
> > > use with mainline or -next without any additional patches applied --
> > > which this patchset achieves.
> > > 
> > > I know there are pending revisions based on feedback. I'll leave it up
> > > to you and others to determine if that can be done with incremental
> > > patches on top, or if it should be fixed before the initial patchset
> > > is applied. But all in all, it's holding up adaption by me and surely
> > > others of a very interesting platform -- I'm looking to replace my
> > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > before I do.
> > 
> > If you're going to be using NVMe, make sure you use a power-fail safe
> > version; I've already had one instance where ext4 failed to mount
> > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > Serror'd, and then I powered it down after a few hours before later
> > booting it back up.
> > 
> > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > EXT4-fs (nvme0n1p2): error loading journal
> 
> ... and last night, I just got more ext4fs errors on the NVMe, without
> any unclean power cycles:
> 
> [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> [73729.565354] Aborting journal on device nvme0n1p2-8.
> [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal
> [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid
> 
> The affected file is /var/backups/dpkg.status.6.gz
> 
> It was cleanly shut down and powered off on the 22nd February, booted
> yesterday morning followed by another reboot a few minutes later.
> 
> What worries me is the fact that corruption has happened - and if that
> happens to a file rather than an inode, it will likely go unnoticed
> for a considerably longer time.
> 
> I think I'm getting to the point of deciding NVMe or the LX2160A to be
> just too unreliable for serious use.  I hadn't noticed any issues when
> using the rootfs on the eMMC, so it suggests either the NVMe is
> unreliable, or there's a problem with PCIe on this platform (which we
> kind of know about with Jon's GPU rendering issues.)

Adding Ted and Andreas...

Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine,
and probably a similar size):

debugfs:  id <917527>
0000  a481 0000 30ff 0300 bd8e 475e bd77 4f5e  ....0.....G^.wO^
0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
0060  0000 0000 0000 0000 4000 0000 8087 3800  ........@.....8.
0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 c40b 4c0a 0000 0000 0000 0000  ......L.........
0160  0000 0000 0000 0000 0000 0000 3884 0000  ............8...
0200  2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92   ...D........a..
0220  bd31 4a5e ecc5 260c 0000 0000 0000 0000  .1J^..&.........
0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
*

and for the affected inode:
debugfs:  id <917524>
0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
0160  0000 0000 0000 0000 0000 0000 af23 0000  .............#..
0200  2000 1cc3 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
*

and "stat" output:
debugfs:  stat <917527>
Inode: 917527   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 172755908    Version: 0x00000000:00000001
User:     0   Group:     0   Project:     0   Size: 261936
File ACL: 0
Links: 1   Blockcount: 512
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020
 atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020
 mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020
Size of extra inode fields: 32
Inode checksum: 0xf2958438
EXTENTS:
(0-63):3704704-3704767
debugfs:  stat <917524>
Inode: 917524   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 3033515103    Version: 0x00000000:00000001
User:     0   Group:     0   Project:     0   Size: 261936
File ACL: 0
Links: 1   Blockcount: 512
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
 atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
 mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
Size of extra inode fields: 32
Inode checksum: 0xc31c23af
EXTENTS:
(0-63):3705024-3705087

When using sif (set_inode_info) to re-set the UID to 0 on this (so
provoke the checksum to be updated):

debugfs:  id <917524>
0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
0160  0000 0000 0000 0000 0000 0000 b61f 0000  ................
                                    ^^^^
0200  2000 aa15 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
           ^^^^
0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
*

The values with "^^^^" are the checksum, which are the only values
that have changed here - the checksum is now 0x15aa1fb6 rather than
0xc31c23af.

With that changed, running e2fsck -n on the filesystem results in a
pass:

root@cex7:~# e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks

and the file now appears to be intact (being a gzip file, gzip verifies
that the contents are now as it expects.)

So, it looks like the _only_ issue is that the checksum on the inode
became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe
issue.

I wonder whether the journal would contain anything useful, but I don't
know how to use debugfs to find that out - while I can dump the journal,
I'd need to know which block contains the inode, and then work out where
in the journal that block was going to be written.  If that would help,
let me know ASAP as I'll hold off rebooting the platform for a while
(which means the filesystem will remain as-is - and yes, I have the
debugfs file for e2undo to put stuff back.)  Maybe it's possible to pull
the block number out of the e2undo file?

tune2fs says:

Checksum type:            crc32c
Checksum:                 0x682f91b9

I guess this is what is used to checksum the inodes?  If so, it's using
the kernel's crc32c-generic driver (according to /proc/crypto).

Could it be a race condition, or some problem that's specific to the
ARM64 kernel that's provoking this corruption?
Russell King (Oracle) Feb. 29, 2020, 12:08 p.m. UTC | #25
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote:
> > On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote:
> > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > > >
> > > > > Hi Olof,
> > > > >
> > > > > Thanks a lot for your comments!
> > > > > And sorry for my delay respond!
> > > > 
> > > > Actually, they apply with only minor conflicts on top of current -next.
> > > > 
> > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > enable full use of a promising ARM developer system, the SolidRun
> > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > use with mainline or -next without any additional patches applied --
> > > > which this patchset achieves.
> > > > 
> > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > to you and others to determine if that can be done with incremental
> > > > patches on top, or if it should be fixed before the initial patchset
> > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > others of a very interesting platform -- I'm looking to replace my
> > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > before I do.
> > > 
> > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > version; I've already had one instance where ext4 failed to mount
> > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > Serror'd, and then I powered it down after a few hours before later
> > > booting it back up.
> > > 
> > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > EXT4-fs (nvme0n1p2): error loading journal
> > 
> > ... and last night, I just got more ext4fs errors on the NVMe, without
> > any unclean power cycles:
> > 
> > [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> > [73729.565354] Aborting journal on device nvme0n1p2-8.
> > [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> > [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal
> > [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> > [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid
> > 
> > The affected file is /var/backups/dpkg.status.6.gz
> > 
> > It was cleanly shut down and powered off on the 22nd February, booted
> > yesterday morning followed by another reboot a few minutes later.
> > 
> > What worries me is the fact that corruption has happened - and if that
> > happens to a file rather than an inode, it will likely go unnoticed
> > for a considerably longer time.
> > 
> > I think I'm getting to the point of deciding NVMe or the LX2160A to be
> > just too unreliable for serious use.  I hadn't noticed any issues when
> > using the rootfs on the eMMC, so it suggests either the NVMe is
> > unreliable, or there's a problem with PCIe on this platform (which we
> > kind of know about with Jon's GPU rendering issues.)
> 
> Adding Ted and Andreas...
> 
> Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine,
> and probably a similar size):
> 
> debugfs:  id <917527>
> 0000  a481 0000 30ff 0300 bd8e 475e bd77 4f5e  ....0.....G^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 8087 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 c40b 4c0a 0000 0000 0000 0000  ......L.........
> 0160  0000 0000 0000 0000 0000 0000 3884 0000  ............8...
> 0200  2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92   ...D........a..
> 0220  bd31 4a5e ecc5 260c 0000 0000 0000 0000  .1J^..&.........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> and for the affected inode:
> debugfs:  id <917524>
> 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> 0160  0000 0000 0000 0000 0000 0000 af23 0000  .............#..
> 0200  2000 1cc3 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> and "stat" output:
> debugfs:  stat <917527>
> Inode: 917527   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 172755908    Version: 0x00000000:00000001
> User:     0   Group:     0   Project:     0   Size: 261936
> File ACL: 0
> Links: 1   Blockcount: 512
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020
>  atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020
>  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020
> Size of extra inode fields: 32
> Inode checksum: 0xf2958438
> EXTENTS:
> (0-63):3704704-3704767
> debugfs:  stat <917524>
> Inode: 917524   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 3033515103    Version: 0x00000000:00000001
> User:     0   Group:     0   Project:     0   Size: 261936
> File ACL: 0
> Links: 1   Blockcount: 512
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
>  atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
>  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
> Size of extra inode fields: 32
> Inode checksum: 0xc31c23af
> EXTENTS:
> (0-63):3705024-3705087
> 
> When using sif (set_inode_info) to re-set the UID to 0 on this (so
> provoke the checksum to be updated):
> 
> debugfs:  id <917524>
> 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> 0160  0000 0000 0000 0000 0000 0000 b61f 0000  ................
>                                     ^^^^
> 0200  2000 aa15 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
>            ^^^^
> 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> The values with "^^^^" are the checksum, which are the only values
> that have changed here - the checksum is now 0x15aa1fb6 rather than
> 0xc31c23af.
> 
> With that changed, running e2fsck -n on the filesystem results in a
> pass:
> 
> root@cex7:~# e2fsck -n /dev/nvme0n1p2
> e2fsck 1.44.5 (15-Dec-2018)
> Warning: skipping journal recovery because doing a read-only filesystem check.
> /dev/nvme0n1p2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks
> 
> and the file now appears to be intact (being a gzip file, gzip verifies
> that the contents are now as it expects.)
> 
> So, it looks like the _only_ issue is that the checksum on the inode
> became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe
> issue.
> 
> I wonder whether the journal would contain anything useful, but I don't
> know how to use debugfs to find that out - while I can dump the journal,
> I'd need to know which block contains the inode, and then work out where
> in the journal that block was going to be written.  If that would help,
> let me know ASAP as I'll hold off rebooting the platform for a while
> (which means the filesystem will remain as-is - and yes, I have the
> debugfs file for e2undo to put stuff back.)  Maybe it's possible to pull
> the block number out of the e2undo file?

Okay, the inode was stored in block 3670049, and the journal appears
to contains no entries for that block.

> tune2fs says:
> 
> Checksum type:            crc32c
> Checksum:                 0x682f91b9
> 
> I guess this is what is used to checksum the inodes?  If so, it's using
> the kernel's crc32c-generic driver (according to /proc/crypto).
> 
> Could it be a race condition, or some problem that's specific to the
> ARM64 kernel that's provoking this corruption?

Something else occurs to me:

root@cex7:~# ls -li --time=ctime --full-time /var/backups/dpkg.status*
917622 -rw-r--r-- 1 root root 999052 2020-02-29 06:25:01.852231277 +0000 /var/backups/dpkg.status
917583 -rw-r--r-- 1 root root 999052 2020-02-21 06:25:01.958160960 +0000 /var/backups/dpkg.status.0
917520 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.954161050 +0000 /var/backups/dpkg.status.1.gz
917531 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.854163293 +0000 /var/backups/dpkg.status.2.gz
917532 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.3.gz
917509 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.4.gz
917527 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.846163473 +0000 /var/backups/dpkg.status.5.gz
917524 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.842163563 +0000 /var/backups/dpkg.status.6.gz

So the last time that the kernel changed inode 917524 was on the 21th
of February, probably when it was last renamed by logrotate, and like
several other files stored in the same inode block.  Yet, _only_ the
checksum for 917524 was corrupted, the rest were fine.

I would guess that logrotate behaves as follows:
- remove /var/backups/dpkg.status.6.gz
- rename /var/backups/dpkg.status.5.gz to /var/backups/dpkg.status.6.gz
- repeat for other dpkg.status.*.gz files
- gzip /var/backups/dpkg.status.0 to /var/backups/dpkg.status.1.gz
- rename /var/backups/dpkg.status to /var/backups/dpkg.status.0
- create new /var/backups/dpkg.status

Looking at the inode block in the e2undo file, inode 917524 is at
offset 0x300 into the block, which means the first inode in the
block is 917521 and the last is 917536, which means we have several
of the dpkg.status.* files that are stored in this inode block.

That would've meant that the inode for /var/backups/dpkg.status.6.gz
would have been updated just before the inode for
/var/backups/dpkg.status.5.gz.  I wonder if the inode block was
written out somehow out of order, with the ctime for
/var/backups/dpkg.status.6.gz having been updated but not the checksum
as a result of the later changes - maybe as a result of having
executed on a different CPU?  That would suggest a weakness in the
ARM64 locking implementation, coherency issues, or interconnect issues.
Russell King (Oracle) Feb. 29, 2020, 1:32 p.m. UTC | #26
On Sat, Feb 29, 2020 at 12:08:28PM +0000, Russell King - ARM Linux admin wrote:
> On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> > On Sat, Feb 29, 2020 at 09:55:50AM +0000, Russell King - ARM Linux admin wrote:
> > > On Mon, Feb 10, 2020 at 03:22:57PM +0000, Russell King - ARM Linux admin wrote:
> > > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@nxp.com> wrote:
> > > > > >
> > > > > > Hi Olof,
> > > > > >
> > > > > > Thanks a lot for your comments!
> > > > > > And sorry for my delay respond!
> > > > > 
> > > > > Actually, they apply with only minor conflicts on top of current -next.
> > > > > 
> > > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > > enable full use of a promising ARM developer system, the SolidRun
> > > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > > use with mainline or -next without any additional patches applied --
> > > > > which this patchset achieves.
> > > > > 
> > > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > > to you and others to determine if that can be done with incremental
> > > > > patches on top, or if it should be fixed before the initial patchset
> > > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > > others of a very interesting platform -- I'm looking to replace my
> > > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > > before I do.
> > > > 
> > > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > > version; I've already had one instance where ext4 failed to mount
> > > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > > Serror'd, and then I powered it down after a few hours before later
> > > > booting it back up.
> > > > 
> > > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > > EXT4-fs (nvme0n1p2): error loading journal
> > > 
> > > ... and last night, I just got more ext4fs errors on the NVMe, without
> > > any unclean power cycles:
> > > 
> > > [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> > > [73729.565354] Aborting journal on device nvme0n1p2-8.
> > > [73729.568995] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> > > [73729.569077] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:61: Detected aborted journal
> > > [73729.573741] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
> > > [73729.593330] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm mv: iget: checksum invalid
> > > 
> > > The affected file is /var/backups/dpkg.status.6.gz
> > > 
> > > It was cleanly shut down and powered off on the 22nd February, booted
> > > yesterday morning followed by another reboot a few minutes later.
> > > 
> > > What worries me is the fact that corruption has happened - and if that
> > > happens to a file rather than an inode, it will likely go unnoticed
> > > for a considerably longer time.
> > > 
> > > I think I'm getting to the point of deciding NVMe or the LX2160A to be
> > > just too unreliable for serious use.  I hadn't noticed any issues when
> > > using the rootfs on the eMMC, so it suggests either the NVMe is
> > > unreliable, or there's a problem with PCIe on this platform (which we
> > > kind of know about with Jon's GPU rendering issues.)
> > 
> > Adding Ted and Andreas...
> > 
> > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine,
> > and probably a similar size):
> > 
> > debugfs:  id <917527>
> > 0000  a481 0000 30ff 0300 bd8e 475e bd77 4f5e  ....0.....G^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 8087 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 c40b 4c0a 0000 0000 0000 0000  ......L.........
> > 0160  0000 0000 0000 0000 0000 0000 3884 0000  ............8...
> > 0200  2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92   ...D........a..
> > 0220  bd31 4a5e ecc5 260c 0000 0000 0000 0000  .1J^..&.........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > and for the affected inode:
> > debugfs:  id <917524>
> > 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> > 0160  0000 0000 0000 0000 0000 0000 af23 0000  .............#..
> > 0200  2000 1cc3 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> > 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > and "stat" output:
> > debugfs:  stat <917527>
> > Inode: 917527   Type: regular    Mode:  0644   Flags: 0x80000
> > Generation: 172755908    Version: 0x00000000:00000001
> > User:     0   Group:     0   Project:     0   Size: 261936
> > File ACL: 0
> > Links: 1   Blockcount: 512
> > Fragment:  Address: 0    Number: 0    Size: 0
> >  ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020
> >  atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020
> >  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020
> > Size of extra inode fields: 32
> > Inode checksum: 0xf2958438
> > EXTENTS:
> > (0-63):3704704-3704767
> > debugfs:  stat <917524>
> > Inode: 917524   Type: regular    Mode:  0644   Flags: 0x80000
> > Generation: 3033515103    Version: 0x00000000:00000001
> > User:     0   Group:     0   Project:     0   Size: 261936
> > File ACL: 0
> > Links: 1   Blockcount: 512
> > Fragment:  Address: 0    Number: 0    Size: 0
> >  ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
> >  atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
> >  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
> > Size of extra inode fields: 32
> > Inode checksum: 0xc31c23af
> > EXTENTS:
> > (0-63):3705024-3705087
> > 
> > When using sif (set_inode_info) to re-set the UID to 0 on this (so
> > provoke the checksum to be updated):
> > 
> > debugfs:  id <917524>
> > 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> > 0160  0000 0000 0000 0000 0000 0000 b61f 0000  ................
> >                                     ^^^^
> > 0200  2000 aa15 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> >            ^^^^
> > 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > The values with "^^^^" are the checksum, which are the only values
> > that have changed here - the checksum is now 0x15aa1fb6 rather than
> > 0xc31c23af.
> > 
> > With that changed, running e2fsck -n on the filesystem results in a
> > pass:
> > 
> > root@cex7:~# e2fsck -n /dev/nvme0n1p2
> > e2fsck 1.44.5 (15-Dec-2018)
> > Warning: skipping journal recovery because doing a read-only filesystem check.
> > /dev/nvme0n1p2 contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information
> > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks
> > 
> > and the file now appears to be intact (being a gzip file, gzip verifies
> > that the contents are now as it expects.)
> > 
> > So, it looks like the _only_ issue is that the checksum on the inode
> > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe
> > issue.
> > 
> > I wonder whether the journal would contain anything useful, but I don't
> > know how to use debugfs to find that out - while I can dump the journal,
> > I'd need to know which block contains the inode, and then work out where
> > in the journal that block was going to be written.  If that would help,
> > let me know ASAP as I'll hold off rebooting the platform for a while
> > (which means the filesystem will remain as-is - and yes, I have the
> > debugfs file for e2undo to put stuff back.)  Maybe it's possible to pull
> > the block number out of the e2undo file?
> 
> Okay, the inode was stored in block 3670049, and the journal appears
> to contains no entries for that block.
> 
> > tune2fs says:
> > 
> > Checksum type:            crc32c
> > Checksum:                 0x682f91b9
> > 
> > I guess this is what is used to checksum the inodes?  If so, it's using
> > the kernel's crc32c-generic driver (according to /proc/crypto).
> > 
> > Could it be a race condition, or some problem that's specific to the
> > ARM64 kernel that's provoking this corruption?
> 
> Something else occurs to me:
> 
> root@cex7:~# ls -li --time=ctime --full-time /var/backups/dpkg.status*
> 917622 -rw-r--r-- 1 root root 999052 2020-02-29 06:25:01.852231277 +0000 /var/backups/dpkg.status
> 917583 -rw-r--r-- 1 root root 999052 2020-02-21 06:25:01.958160960 +0000 /var/backups/dpkg.status.0
> 917520 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.954161050 +0000 /var/backups/dpkg.status.1.gz
> 917531 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.854163293 +0000 /var/backups/dpkg.status.2.gz
> 917532 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.3.gz
> 917509 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.850163383 +0000 /var/backups/dpkg.status.4.gz
> 917527 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.846163473 +0000 /var/backups/dpkg.status.5.gz
> 917524 -rw-r--r-- 1 root root 261936 2020-02-21 06:25:01.842163563 +0000 /var/backups/dpkg.status.6.gz
> 
> So the last time that the kernel changed inode 917524 was on the 21th
> of February, probably when it was last renamed by logrotate, and like
> several other files stored in the same inode block.  Yet, _only_ the
> checksum for 917524 was corrupted, the rest were fine.
> 
> I would guess that logrotate behaves as follows:
> - remove /var/backups/dpkg.status.6.gz
> - rename /var/backups/dpkg.status.5.gz to /var/backups/dpkg.status.6.gz
> - repeat for other dpkg.status.*.gz files
> - gzip /var/backups/dpkg.status.0 to /var/backups/dpkg.status.1.gz
> - rename /var/backups/dpkg.status to /var/backups/dpkg.status.0
> - create new /var/backups/dpkg.status
> 
> Looking at the inode block in the e2undo file, inode 917524 is at
> offset 0x300 into the block, which means the first inode in the
> block is 917521 and the last is 917536, which means we have several
> of the dpkg.status.* files that are stored in this inode block.
> 
> That would've meant that the inode for /var/backups/dpkg.status.6.gz
> would have been updated just before the inode for
> /var/backups/dpkg.status.5.gz.  I wonder if the inode block was
> written out somehow out of order, with the ctime for
> /var/backups/dpkg.status.6.gz having been updated but not the checksum
> as a result of the later changes - maybe as a result of having
> executed on a different CPU?  That would suggest a weakness in the
> ARM64 locking implementation, coherency issues, or interconnect issues.

Looking at the errata configuration, I have:

# ARM errata workarounds via the alternatives framework
#
CONFIG_ARM64_WORKAROUND_CLEAN_CACHE=y
CONFIG_ARM64_ERRATUM_826319=y
CONFIG_ARM64_ERRATUM_827319=y
CONFIG_ARM64_ERRATUM_824069=y
CONFIG_ARM64_ERRATUM_819472=y
CONFIG_ARM64_ERRATUM_832075=y
CONFIG_ARM64_ERRATUM_834220=y
CONFIG_ARM64_ERRATUM_845719=y
CONFIG_ARM64_ERRATUM_843419=y
CONFIG_ARM64_ERRATUM_1024718=y
CONFIG_ARM64_ERRATUM_1418040=y
CONFIG_ARM64_ERRATUM_1165522=y
CONFIG_ARM64_ERRATUM_1286807=y
CONFIG_ARM64_ERRATUM_1319367=y
CONFIG_ARM64_ERRATUM_1463225=y
# CONFIG_ARM64_ERRATUM_1542419 is not set
# CONFIG_CAVIUM_ERRATUM_22375 is not set
# CONFIG_CAVIUM_ERRATUM_23154 is not set
# CONFIG_CAVIUM_ERRATUM_27456 is not set
# CONFIG_CAVIUM_ERRATUM_30115 is not set
# CONFIG_CAVIUM_TX2_ERRATUM_219 is not set
CONFIG_QCOM_FALKOR_ERRATUM_1003=y
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI=y
CONFIG_QCOM_FALKOR_ERRATUM_1009=y
CONFIG_QCOM_QDF2400_ERRATUM_0065=y
# CONFIG_SOCIONEXT_SYNQUACER_PREITS is not set
# CONFIG_HISILICON_ERRATUM_161600802 is not set
CONFIG_QCOM_FALKOR_ERRATUM_E1041=y
# CONFIG_FUJITSU_ERRATUM_010001 is not set
# end of ARM errata workarounds via the alternatives framework
...
CONFIG_FSL_ERRATUM_A008585=y
CONFIG_HISILICON_ERRATUM_161010101=y
CONFIG_ARM64_ERRATUM_858921=y

so I don't think it's a missing errata kconfig setting, unless there's
an erratum that isn't in v5.5 that's necessary.
Theodore Ts'o Feb. 29, 2020, 3:19 p.m. UTC | #27
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> 
> Could it be a race condition, or some problem that's specific to the
> ARM64 kernel that's provoking this corruption?

Since I got brought in mid-way through this discussion, can someone
summarize the vital details of the bughunt?  What kernel version is
involved, and is this a regression?  If so, what's the last version of
the kernel where you didn't have a problem on this hardware?

Can you trigger this failure reliably?

Unfortunately, while I'm regularly running xfstests on x86_64 on a
Google Compute Engine VM, I'm not doing any runs on arm64.  I can
certainly build an arm-64.

There's a test-appliance designed to be run on ARM64 here[1].

[1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz

which is a Debian chroot, designed to be run via android-xfstests[2], but
if you unpack it, it should be possible to enter the chroot and
trigger the xfstests run manually on any arm64 system.

[2] https://thunk.org/android-xfstests

Does anyone know if kernel CI is running xfstests regularly?

Cheers,

							- Ted
Russell King (Oracle) Feb. 29, 2020, 5:03 p.m. UTC | #28
On Sat, Feb 29, 2020 at 10:19:07AM -0500, Theodore Y. Ts'o wrote:
> On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> > Could it be a race condition, or some problem that's specific to the
> > ARM64 kernel that's provoking this corruption?
> 
> Since I got brought in mid-way through this discussion, can someone
> summarize the vital details of the bughunt?  What kernel version is
> involved, and is this a regression?  If so, what's the last version of
> the kernel where you didn't have a problem on this hardware?

It's a new platform, I've run most 5.x kernels on it, but only recently
have I had a NVMe.  Currently running a 5.5 based kernel (for which I
have to patch in support for the platform), and I've no idea if it is
a regression or not.

> Can you trigger this failure reliably?

No - the very first time I ended up with a corrupted ext4 fs was on the
8th February, and at that time it was put down to the NVMe not being
power-off safe: the machine had crashed sometime over night, resulting
in a section of my network going offline (due to a pause frame storm).
So, I powered it down from crashed state - and from what people tell me,
NVMe _may_ keep blocks unwritten to safe media for a considerable time.

I never bothered to investigate it because the explanation seemed
reasonable, and manually running e2fsck fixed the filesystem.

The system was then booted back into using the NVMe rootfs, and
continued to do so without apparent issue until the 21st Feb, when I
cleanly shut it down, and powered it off.  During the time it was
running, it likely saw many reboots of the 5.5 kernel.

I powered it back on yesterday morning, and this morning it found the
fs corruption while trying to do a logrotate.

As I say in my last email, I suspect it isn't an ext4 bug, but either
a locking implementation issue, coherency issue, or interconnect issue.
The 4k block with the affected inode looks perfectly reasonable with
the only exception that the checksum is incorrect for that one inode -
and other inodes stored in the same 4k block were modified afterwards.
It suggests to me that the writes to update the two 16-bit words
containing the checksum were somehow lost for this particular inode.

> Unfortunately, while I'm regularly running xfstests on x86_64 on a
> Google Compute Engine VM, I'm not doing any runs on arm64.  I can
> certainly build an arm-64.
> 
> There's a test-appliance designed to be run on ARM64 here[1].
> 
> [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz

The filename seems to say "amd64" not "arm64" ?

> which is a Debian chroot, designed to be run via android-xfstests[2], but
> if you unpack it, it should be possible to enter the chroot and
> trigger the xfstests run manually on any arm64 system.
> 
> [2] https://thunk.org/android-xfstests
> 
> Does anyone know if kernel CI is running xfstests regularly?

I don't know...
Theodore Ts'o Feb. 29, 2020, 6:03 p.m. UTC | #29
On Sat, Feb 29, 2020 at 05:03:28PM +0000, Russell King - ARM Linux admin wrote:
> > There's a test-appliance designed to be run on ARM64 here[1].
> > 
> > [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz
> 
> The filename seems to say "amd64" not "arm64" ?

Sorry, I cut and pasted the wrong link: s/amd64/arm64/

If there are arm64-specific locking issues, we can probably flush them
out if we could figure out some way of running some of the stress
tests in xfstests.  I don't know a whole lot about arm-64
architectures; would running xfstests on, say, an Amazon AWS arm-based
VM be representative of your new architecture?  Or are there a lot of
sub-architecture differences in the arm-64 world?

						- Ted
Russell King (Oracle) June 5, 2020, 11:53 p.m. UTC | #30
On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> Adding Ted and Andreas...
> 
> Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine,
> and probably a similar size):
> 
> debugfs:  id <917527>
> 0000  a481 0000 30ff 0300 bd8e 475e bd77 4f5e  ....0.....G^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 8087 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 c40b 4c0a 0000 0000 0000 0000  ......L.........
> 0160  0000 0000 0000 0000 0000 0000 3884 0000  ............8...
> 0200  2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92   ...D........a..
> 0220  bd31 4a5e ecc5 260c 0000 0000 0000 0000  .1J^..&.........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> and for the affected inode:
> debugfs:  id <917524>
> 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> 0160  0000 0000 0000 0000 0000 0000 af23 0000  .............#..
> 0200  2000 1cc3 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> and "stat" output:
> debugfs:  stat <917527>
> Inode: 917527   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 172755908    Version: 0x00000000:00000001
> User:     0   Group:     0   Project:     0   Size: 261936
> File ACL: 0
> Links: 1   Blockcount: 512
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020
>  atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020
>  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020
> Size of extra inode fields: 32
> Inode checksum: 0xf2958438
> EXTENTS:
> (0-63):3704704-3704767
> debugfs:  stat <917524>
> Inode: 917524   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 3033515103    Version: 0x00000000:00000001
> User:     0   Group:     0   Project:     0   Size: 261936
> File ACL: 0
> Links: 1   Blockcount: 512
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
>  atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
>  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
> Size of extra inode fields: 32
> Inode checksum: 0xc31c23af
> EXTENTS:
> (0-63):3705024-3705087
> 
> When using sif (set_inode_info) to re-set the UID to 0 on this (so
> provoke the checksum to be updated):
> 
> debugfs:  id <917524>
> 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> 0160  0000 0000 0000 0000 0000 0000 b61f 0000  ................
>                                     ^^^^
> 0200  2000 aa15 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
>            ^^^^
> 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> The values with "^^^^" are the checksum, which are the only values
> that have changed here - the checksum is now 0x15aa1fb6 rather than
> 0xc31c23af.
> 
> With that changed, running e2fsck -n on the filesystem results in a
> pass:
> 
> root@cex7:~# e2fsck -n /dev/nvme0n1p2
> e2fsck 1.44.5 (15-Dec-2018)
> Warning: skipping journal recovery because doing a read-only filesystem check.
> /dev/nvme0n1p2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks
> 
> and the file now appears to be intact (being a gzip file, gzip verifies
> that the contents are now as it expects.)
> 
> So, it looks like the _only_ issue is that the checksum on the inode
> became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe
> issue.
> 
> I wonder whether the journal would contain anything useful, but I don't
> know how to use debugfs to find that out - while I can dump the journal,
> I'd need to know which block contains the inode, and then work out where
> in the journal that block was going to be written.  If that would help,
> let me know ASAP as I'll hold off rebooting the platform for a while
> (which means the filesystem will remain as-is - and yes, I have the
> debugfs file for e2undo to put stuff back.)  Maybe it's possible to pull
> the block number out of the e2undo file?
> 
> tune2fs says:
> 
> Checksum type:            crc32c
> Checksum:                 0x682f91b9
> 
> I guess this is what is used to checksum the inodes?  If so, it's using
> the kernel's crc32c-generic driver (according to /proc/crypto).
> 
> Could it be a race condition, or some problem that's specific to the
> ARM64 kernel that's provoking this corruption?

Hi,

The corruption has returned this evening:

[25094.614718] EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid
[25094.623781] Aborting journal on device nvme0n1p2-8.
[25094.627419] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[25094.628206] EXT4-fs error (device nvme0n1p2):
ext4_journal_check_start:83: Detected aborted journal
root@cex7:[~]:<506> debugfs /dev/nvme0n1p2
debugfs 1.44.5 (15-Dec-2018)
debugfs:  id <271688>
0000  a481 0000 f108 0000 2518 fd5d 2518 fd5d  ........%..]%..]
0020  9f49 715c 0000 0000 0000 0100 0800 0000  .Iq\............
0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
0060  0000 0000 0000 0000 0100 0000 ed19 1100  ................
0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 b42f 4f06 0000 0000 0000 0000  ...../O.........
0160  0000 0000 0000 0000 0000 0000 c9cf 0000  ................
0200  2000 8d83 086d bebf 0000 0000 086d bebf   ....m.......m..
0220  2518 fd5d 086d bebf 0000 0000 0000 0000  %..].m..........
0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
*

debugfs:  stat <271688>
Inode: 271688   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 105852852    Version: 0x00000000:00000001
User:     0   Group:     0   Project:     0   Size: 2289
File ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
 atime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
 mtime: 0x5c71499f:00000000 -- Sat Feb 23 13:24:47 2019
 crtime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
Size of extra inode fields: 32
Inode checksum: 0x838dcfc9
EXTENTS:
(0):1120749
debugfs:
root@cex7:[~]:<509> e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks

This time, the machine has not been powered down for a very long time,
although I've booted 5.7 (plus the additional patches including several
workarounds in the PCIe driver so my Mellanox card works) on it earlier
today. I did notice that debian decided to run a fsck on the filesystem
at reboot, which is a little weird as it's ext4, and found nothing wrong.

Hmm, I just tried:

root@cex7:[~]:<514> hdparm -f /dev/nvme0n1p2
root@cex7:[~]:<515> hdparm -f /dev/nvme0n1
root@cex7:[~]:<517> e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
Warning: skipping journal recovery because doing a read-only filesystem
check.
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has deleted/unused inode 922603.  Clear? no

Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has an incorrect filetype (was 1, should be 0).
Fix? no

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 920748
Connect to /lost+found? no

Pass 5: Checking group summary information
Block bitmap differences:  +(9259--9280) -3703011 -3703044 -3703053 +3736187 -3827722 -3830272 +3906363 +3911697 +3911699 +3911701 +3911703 +3913228
Fix? no

Free blocks count wrong for group #113 (12615, counted=12606).
Fix? no

Free blocks count wrong (6845889, counted=6845880).
Fix? no

Inode bitmap differences: Group 112 inode bitmap does not match checksum.
IGNORED.
Block bitmap differences: Group 113 block bitmap does not match checksum.
IGNORED.

/dev/nvme0n1p2: ********** WARNING: Filesystem still has errors **********

/dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks

which looks less good, and is likely to be e2fsck reading off the media
rather than using what was in the kernel cache.  However, still nothing
for the offending inode, who's raw data remains unchanged from what I've
quoted above from debugfs.

It /seems/ to be pointing at the data on the media changing, possibly
buggy firmware on the nvme (ADATA SX8200PNP) drive, maybe? Or maybe
undiscovered bugs in the Mobiveil PCIe hardware corrupting transfers
to the nvme?

The problem is, this is rather undebuggable as it happens so rarely. :(

I'm becoming very discouraged to touch nvme ever again by this, as this
is my first and only experience of that technology.  I'm considering
getting some conventional SATA HDDs and junking nvme on the basis of
it being an unreliable technology.
Russell King (Oracle) June 6, 2020, 10:19 a.m. UTC | #31
On Sat, Jun 06, 2020 at 12:53:43AM +0100, Russell King - ARM Linux admin wrote:
> On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote:
> > Adding Ted and Andreas...
> > 
> > Here's the debugfs -n "id" output for dpkg.status.5.gz (which is fine,
> > and probably a similar size):
> > 
> > debugfs:  id <917527>
> > 0000  a481 0000 30ff 0300 bd8e 475e bd77 4f5e  ....0.....G^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 8087 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 c40b 4c0a 0000 0000 0000 0000  ......L.........
> > 0160  0000 0000 0000 0000 0000 0000 3884 0000  ............8...
> > 0200  2000 95f2 44b8 bdc9 a4d2 9883 c861 dc92   ...D........a..
> > 0220  bd31 4a5e ecc5 260c 0000 0000 0000 0000  .1J^..&.........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > and for the affected inode:
> > debugfs:  id <917524>
> > 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> > 0160  0000 0000 0000 0000 0000 0000 af23 0000  .............#..
> > 0200  2000 1cc3 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> > 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > and "stat" output:
> > debugfs:  stat <917527>
> > Inode: 917527   Type: regular    Mode:  0644   Flags: 0x80000
> > Generation: 172755908    Version: 0x00000000:00000001
> > User:     0   Group:     0   Project:     0   Size: 261936
> > File ACL: 0
> > Links: 1   Blockcount: 512
> > Fragment:  Address: 0    Number: 0    Size: 0
> >  ctime: 0x5e4f77bd:c9bdb844 -- Fri Feb 21 06:25:01 2020
> >  atime: 0x5e478ebd:92dc61c8 -- Sat Feb 15 06:25:01 2020
> >  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> > crtime: 0x5e4a31bd:0c26c5ec -- Mon Feb 17 06:25:01 2020
> > Size of extra inode fields: 32
> > Inode checksum: 0xf2958438
> > EXTENTS:
> > (0-63):3704704-3704767
> > debugfs:  stat <917524>
> > Inode: 917524   Type: regular    Mode:  0644   Flags: 0x80000
> > Generation: 3033515103    Version: 0x00000000:00000001
> > User:     0   Group:     0   Project:     0   Size: 261936
> > File ACL: 0
> > Links: 1   Blockcount: 512
> > Fragment:  Address: 0    Number: 0    Size: 0
> >  ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
> >  atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
> >  mtime: 0x5e34ca29:8398d2a4 -- Sat Feb  1 00:45:29 2020
> > crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
> > Size of extra inode fields: 32
> > Inode checksum: 0xc31c23af
> > EXTENTS:
> > (0-63):3705024-3705087
> > 
> > When using sif (set_inode_info) to re-set the UID to 0 on this (so
> > provoke the checksum to be updated):
> > 
> > debugfs:  id <917524>
> > 0000  a481 0000 30ff 0300 3d3d 465e bd77 4f5e  ....0...==F^.wO^
> > 0020  29ca 345e 0000 0000 0000 0100 0002 0000  ).4^............
> > 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> > 0060  0000 0000 0000 0000 4000 0000 c088 3800  ........@.....8.
> > 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 0140  0000 0000 5fc4 cfb4 0000 0000 0000 0000  ...._...........
> > 0160  0000 0000 0000 0000 0000 0000 b61f 0000  ................
> >                                     ^^^^
> > 0200  2000 aa15 ac95 c9c8 a4d2 9883 583e addf   ...........X>..
> >            ^^^^
> > 0220  3de0 485e b04d 7151 0000 0000 0000 0000  =.H^.MqQ........
> > 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> > *
> > 
> > The values with "^^^^" are the checksum, which are the only values
> > that have changed here - the checksum is now 0x15aa1fb6 rather than
> > 0xc31c23af.
> > 
> > With that changed, running e2fsck -n on the filesystem results in a
> > pass:
> > 
> > root@cex7:~# e2fsck -n /dev/nvme0n1p2
> > e2fsck 1.44.5 (15-Dec-2018)
> > Warning: skipping journal recovery because doing a read-only filesystem check.
> > /dev/nvme0n1p2 contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information
> > /dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks
> > 
> > and the file now appears to be intact (being a gzip file, gzip verifies
> > that the contents are now as it expects.)
> > 
> > So, it looks like the _only_ issue is that the checksum on the inode
> > became invalid, which seems to suggest that it *isn't* a NVMe nor PCIe
> > issue.
> > 
> > I wonder whether the journal would contain anything useful, but I don't
> > know how to use debugfs to find that out - while I can dump the journal,
> > I'd need to know which block contains the inode, and then work out where
> > in the journal that block was going to be written.  If that would help,
> > let me know ASAP as I'll hold off rebooting the platform for a while
> > (which means the filesystem will remain as-is - and yes, I have the
> > debugfs file for e2undo to put stuff back.)  Maybe it's possible to pull
> > the block number out of the e2undo file?
> > 
> > tune2fs says:
> > 
> > Checksum type:            crc32c
> > Checksum:                 0x682f91b9
> > 
> > I guess this is what is used to checksum the inodes?  If so, it's using
> > the kernel's crc32c-generic driver (according to /proc/crypto).
> > 
> > Could it be a race condition, or some problem that's specific to the
> > ARM64 kernel that's provoking this corruption?
> 
> Hi,
> 
> The corruption has returned this evening:
> 
> [25094.614718] EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid
> [25094.623781] Aborting journal on device nvme0n1p2-8.
> [25094.627419] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> [25094.628206] EXT4-fs error (device nvme0n1p2):
> ext4_journal_check_start:83: Detected aborted journal
> root@cex7:[~]:<506> debugfs /dev/nvme0n1p2
> debugfs 1.44.5 (15-Dec-2018)
> debugfs:  id <271688>
> 0000  a481 0000 f108 0000 2518 fd5d 2518 fd5d  ........%..]%..]
> 0020  9f49 715c 0000 0000 0000 0100 0800 0000  .Iq\............
> 0040  0000 0800 0100 0000 0af3 0100 0400 0000  ................
> 0060  0000 0000 0000 0000 0100 0000 ed19 1100  ................
> 0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 0140  0000 0000 b42f 4f06 0000 0000 0000 0000  ...../O.........
> 0160  0000 0000 0000 0000 0000 0000 c9cf 0000  ................
> 0200  2000 8d83 086d bebf 0000 0000 086d bebf   ....m.......m..
> 0220  2518 fd5d 086d bebf 0000 0000 0000 0000  %..].m..........
> 0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
> *
> 
> debugfs:  stat <271688>
> Inode: 271688   Type: regular    Mode:  0644   Flags: 0x80000
> Generation: 105852852    Version: 0x00000000:00000001
> User:     0   Group:     0   Project:     0   Size: 2289
> File ACL: 0
> Links: 1   Blockcount: 8
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
>  atime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
>  mtime: 0x5c71499f:00000000 -- Sat Feb 23 13:24:47 2019
>  crtime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
> Size of extra inode fields: 32
> Inode checksum: 0x838dcfc9
> EXTENTS:
> (0):1120749
> debugfs:
> root@cex7:[~]:<509> e2fsck -n /dev/nvme0n1p2
> e2fsck 1.44.5 (15-Dec-2018)
> Warning: skipping journal recovery because doing a read-only filesystem check.
> /dev/nvme0n1p2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks
> 
> This time, the machine has not been powered down for a very long time,
> although I've booted 5.7 (plus the additional patches including several
> workarounds in the PCIe driver so my Mellanox card works) on it earlier
> today. I did notice that debian decided to run a fsck on the filesystem
> at reboot, which is a little weird as it's ext4, and found nothing wrong.
> 
> Hmm, I just tried:
> 
> root@cex7:[~]:<514> hdparm -f /dev/nvme0n1p2
> root@cex7:[~]:<515> hdparm -f /dev/nvme0n1
> root@cex7:[~]:<517> e2fsck -n /dev/nvme0n1p2
> e2fsck 1.44.5 (15-Dec-2018)
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/nvme0n1p2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has deleted/unused inode 922603.  Clear? no
> 
> Entry 'mainlog.2.gz' in /var/log/exim4 (917613) has an incorrect filetype (was 1, should be 0).
> Fix? no
> 
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Unattached inode 920748
> Connect to /lost+found? no
> 
> Pass 5: Checking group summary information
> Block bitmap differences:  +(9259--9280) -3703011 -3703044 -3703053 +3736187 -3827722 -3830272 +3906363 +3911697 +3911699 +3911701 +3911703 +3913228
> Fix? no
> 
> Free blocks count wrong for group #113 (12615, counted=12606).
> Fix? no
> 
> Free blocks count wrong (6845889, counted=6845880).
> Fix? no
> 
> Inode bitmap differences: Group 112 inode bitmap does not match checksum.
> IGNORED.
> Block bitmap differences: Group 113 block bitmap does not match checksum.
> IGNORED.
> 
> /dev/nvme0n1p2: ********** WARNING: Filesystem still has errors **********
> 
> /dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks
> 
> which looks less good, and is likely to be e2fsck reading off the media
> rather than using what was in the kernel cache.  However, still nothing
> for the offending inode, who's raw data remains unchanged from what I've
> quoted above from debugfs.
> 
> It /seems/ to be pointing at the data on the media changing, possibly
> buggy firmware on the nvme (ADATA SX8200PNP) drive, maybe? Or maybe
> undiscovered bugs in the Mobiveil PCIe hardware corrupting transfers
> to the nvme?
> 
> The problem is, this is rather undebuggable as it happens so rarely. :(
> 
> I'm becoming very discouraged to touch nvme ever again by this, as this
> is my first and only experience of that technology.  I'm considering
> getting some conventional SATA HDDs and junking nvme on the basis of
> it being an unreliable technology.

Okay, now I'm confused.  I haven't rebooted the platform (I was just
about to) but because of the issues I've had in the past with the
filesystem not being mountable, I thought I ought to run e2fsck on
the now read-only root filesystem before rebooting to ensure that it
is consistent.

root@cex7:[~]:<587> e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (6845930, counted=6845886).
Fix? no

Free inodes count wrong (1949681, counted=1949673).
Fix? no

/dev/nvme0n1p2: 147471/2097152 files (0.1% non-contiguous),
1542678/8388608 blocks

but but but, the filesystem is still mounted read-only, so how can it
have changed?