Message ID | 20220817132521.3159388-1-heinrich.schuchardt@canonical.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/1] riscv: dts: microchip: correct L2 cache interrupts | expand |
Hey Heinrich, Interesting CC list you got there! Surprised the mailmap didn't sort out Atish & Krzysztof's addresses, but I think I've fixed them up. I see Daire isn't there either so +CC him too. On 17/08/2022 14:25, Heinrich Schuchardt wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > The "PolarFire SoC MSS Technical Reference Manual" documents the > following PLIC interrupts: > > 1 - L2 Cache Controller Signals when a metadata correction event occurs > 2 - L2 Cache Controller Signals when an uncorrectable metadata event occurs > 3 - L2 Cache Controller Signals when a data correction event occurs > 4 - L2 Cache Controller Signals when an uncorrectable data event occurs > > This differs from the SiFive FU540 which only has three L2 cache related > interrupts. > > The sequence in the device tree is defined by an enum: > > enum { > DIR_CORR = 0, > DATA_CORR, > DATA_UNCORR, > DIR_UNCORR, > }; Nit: more accurately by the dt-binding: interrupts: minItems: 3 items: - description: DirError interrupt - description: DataError interrupt - description: DataFail interrupt - description: DirFail interrupt I do find the names in the enum to be a bit more understandable however, and ditto for the descriptions in our TRM... Maybe I should put that on my todo list of cleanups :) > > So the correct sequence of the L2 cache interrupts is > > interrupts = <1>, <3>, <4>, <2>; This looks correct to me. You mentioned on IRC that what you were seeing was a wall of L2CACHE: DataFail @ 0x00000000.0807FFD8 From a quick look at the driver, what seems to be happening here is that at some point (possibly before Linux even comes into the picture) there is an uncorrectable data error. Because the ordering in the dt is wrong, we read the wrong register and so the interrupt is never actually cleared. With this patch applied, I see a single DataFail right as the interrupt gets registed & nothing after that. I am not really sure what value there is in enabling that driver though, mostly just seems like a debugging tool & from our pov we would see the HSS running in the monitor core as being responsible for handling the l2-cache errors. @Daire, maybe you have an opinion here? Patch LGTM, so I'll likely apply it in the next day or two, would just like to see what Daire has to say first. > > Fixes: e35b07a7df9b ("riscv: dts: microchip: mpfs: Group tuples in interrupt properties") BTW, it isn't really fixing this patch right? This is a dependency for backports to 5.15. Thanks for your patch, Conor. > Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE board") > Cc: Conor Dooley <conor.dooley@microchip.com> > Cc: stable@vger.kernel.org > Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com> > --- > arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi b/arch/riscv/boot/dts/microchip/mpfs.dtsi > index 496d3b7642bd..ec1de6344be9 100644 > --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi > +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi > @@ -169,7 +169,7 @@ cctrllr: cache-controller@2010000 { > cache-size = <2097152>; > cache-unified; > interrupt-parent = <&plic>; > - interrupts = <1>, <2>, <3>; > + interrupts = <1>, <3>, <4>, <2>; > }; > > clint: clint@2000000 { > -- > 2.36.1 >
On Wed, 2022-08-17 at 18:04 +0000, Conor Dooley - M52691 wrote: > Hey Heinrich, > Interesting CC list you got there! Surprised the mailmap didn't sort > out Atish & Krzysztof's addresses, but I think I've fixed them up. > I see Daire isn't there either so +CC him too. > > On 17/08/2022 14:25, Heinrich Schuchardt wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you > > know the content is safe > > > > The "PolarFire SoC MSS Technical Reference Manual" documents the > > following PLIC interrupts: > > > > 1 - L2 Cache Controller Signals when a metadata correction event > > occurs > > 2 - L2 Cache Controller Signals when an uncorrectable metadata > > event occurs > > 3 - L2 Cache Controller Signals when a data correction event occurs > > 4 - L2 Cache Controller Signals when an uncorrectable data event > > occurs > > > > This differs from the SiFive FU540 which only has three L2 cache > > related > > interrupts. > > > > The sequence in the device tree is defined by an enum: in drivers/soc/sifive/sifive_l2_cache.c > > > > enum { > > DIR_CORR = 0, > > DATA_CORR, > > DATA_UNCORR, > > DIR_UNCORR, > > }; > > Nit: more accurately by the dt-binding: > interrupts: > minItems: 3 > items: > - description: DirError interrupt > - description: DataError interrupt > - description: DataFail interrupt > - description: DirFail interrupt > > I do find the names in the enum to be a bit more understandable > however, > and ditto for the descriptions in our TRM... Maybe I should put that > on > my todo list of cleanups :) > > > > So the correct sequence of the L2 cache interrupts is > > > > interrupts = <1>, <3>, <4>, <2>; > > This looks correct to me. You mentioned on IRC that what you were > seeing > was a wall of > L2CACHE: DataFail @ 0x00000000.0807FFD8 > From a quick look at the driver, what seems to be happening here is > that > at some point (possibly before Linux even comes into the picture) > there > is an uncorrectable data error. Because the ordering in the dt is > wrong, > we read the wrong register and so the interrupt is never actually > cleared. With this patch applied, I see a single DataFail right as > the > interrupt gets registed & nothing after that. > > I am not really sure what value there is in enabling that driver > though, > mostly just seems like a debugging tool & from our pov we would see > the > HSS running in the monitor core as being responsible for handling the > l2-cache errors. > > @Daire, maybe you have an opinion here? Likewise. The new ordering of the interrupts to what the driver expects looks correct - as far as it goes. However, I'm not convinced enabling the SiFive l2 cache driver out of the box makes sense. Using l2 cache driver doesn't align terribly well with the current MPFS roadmap for mgt of ECC errors. > > Patch LGTM, so I'll likely apply it in the next day or two, would > just > like to see what Daire has to say first. If l2-cache controller is enabled, then interrupts should be connected as per TRM. I think this specific patch lgtm, ideally with a 'disabled' stanza and it's up to individual MPFS customers/boards to enable l2 cache controller if they want it. > > > Fixes: e35b07a7df9b ("riscv: dts: microchip: mpfs: Group tuples in > > interrupt properties") > > BTW, it isn't really fixing this patch right? This is a dependency > for > backports to 5.15. > > Thanks for your patch, > Conor. > > > Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE > > board") > > Cc: Conor Dooley <conor.dooley@microchip.com> > > Cc: stable@vger.kernel.org > > Signed-off-by: Heinrich Schuchardt < > > heinrich.schuchardt@canonical.com> > > --- > > arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi > > b/arch/riscv/boot/dts/microchip/mpfs.dtsi > > index 496d3b7642bd..ec1de6344be9 100644 > > --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi > > +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi > > @@ -169,7 +169,7 @@ cctrllr: cache-controller@2010000 { > > cache-size = <2097152>; > > cache-unified; > > interrupt-parent = <&plic>; > > - interrupts = <1>, <2>, <3>; > > + interrupts = <1>, <3>, <4>, <2>; > > }; > > > > clint: clint@2000000 { > > -- > > 2.36.1 > >
On 8/18/22 09:03, Daire.McNamara@microchip.com wrote: > On Wed, 2022-08-17 at 18:04 +0000, Conor Dooley - M52691 wrote: >> Hey Heinrich, >> Interesting CC list you got there! Surprised the mailmap didn't sort >> out Atish & Krzysztof's addresses, but I think I've fixed them up. >> I see Daire isn't there either so +CC him too. >> >> On 17/08/2022 14:25, Heinrich Schuchardt wrote: >>> EXTERNAL EMAIL: Do not click links or open attachments unless you >>> know the content is safe >>> >>> The "PolarFire SoC MSS Technical Reference Manual" documents the >>> following PLIC interrupts: >>> >>> 1 - L2 Cache Controller Signals when a metadata correction event >>> occurs >>> 2 - L2 Cache Controller Signals when an uncorrectable metadata >>> event occurs >>> 3 - L2 Cache Controller Signals when a data correction event occurs >>> 4 - L2 Cache Controller Signals when an uncorrectable data event >>> occurs >>> >>> This differs from the SiFive FU540 which only has three L2 cache >>> related >>> interrupts. >>> >>> The sequence in the device tree is defined by an enum: > in drivers/soc/sifive/sifive_l2_cache.c >>> >>> enum { >>> DIR_CORR = 0, >>> DATA_CORR, >>> DATA_UNCORR, >>> DIR_UNCORR, >>> }; >> >> Nit: more accurately by the dt-binding: >> interrupts: >> minItems: 3 >> items: >> - description: DirError interrupt >> - description: DataError interrupt >> - description: DataFail interrupt >> - description: DirFail interrupt >> >> I do find the names in the enum to be a bit more understandable >> however, >> and ditto for the descriptions in our TRM... Maybe I should put that >> on >> my todo list of cleanups :) >> >> >>> So the correct sequence of the L2 cache interrupts is >>> >>> interrupts = <1>, <3>, <4>, <2>; >> >> This looks correct to me. You mentioned on IRC that what you were >> seeing >> was a wall of >> L2CACHE: DataFail @ 0x00000000.0807FFD8 >> From a quick look at the driver, what seems to be happening here is >> that >> at some point (possibly before Linux even comes into the picture) >> there >> is an uncorrectable data error. Because the ordering in the dt is >> wrong, >> we read the wrong register and so the interrupt is never actually >> cleared. With this patch applied, I see a single DataFail right as >> the >> interrupt gets registed & nothing after that. >> >> I am not really sure what value there is in enabling that driver >> though, >> mostly just seems like a debugging tool & from our pov we would see >> the >> HSS running in the monitor core as being responsible for handling the >> l2-cache errors. >> >> @Daire, maybe you have an opinion here? > Likewise. The new ordering of the interrupts to what the driver expects > looks correct - as far as it goes. However, I'm not convinced enabling > the SiFive l2 cache driver out of the box makes sense. Using l2 cache > driver doesn't align terribly well with the current MPFS roadmap for > mgt of ECC errors. >> >> Patch LGTM, so I'll likely apply it in the next day or two, would >> just >> like to see what Daire has to say first. > If l2-cache controller is enabled, then interrupts should be connected > as per TRM. I think this specific patch lgtm, ideally with a > 'disabled' stanza and it's up to individual MPFS customers/boards to > enable l2 cache controller if they want it. Disabling the device in the device-tree is orthogonal to fixing the interrupt sequence. I would suggest that you use a separate patch for adding status = "disabled";. Best regards Heinrich >> >>> Fixes: e35b07a7df9b ("riscv: dts: microchip: mpfs: Group tuples in >>> interrupt properties") >> >> BTW, it isn't really fixing this patch right? This is a dependency >> for >> backports to 5.15. >> >> Thanks for your patch, >> Conor. >> >>> Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE >>> board") >>> Cc: Conor Dooley <conor.dooley@microchip.com> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Heinrich Schuchardt < >>> heinrich.schuchardt@canonical.com> >>> --- >>> arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi >>> b/arch/riscv/boot/dts/microchip/mpfs.dtsi >>> index 496d3b7642bd..ec1de6344be9 100644 >>> --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi >>> +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi >>> @@ -169,7 +169,7 @@ cctrllr: cache-controller@2010000 { >>> cache-size = <2097152>; >>> cache-unified; >>> interrupt-parent = <&plic>; >>> - interrupts = <1>, <2>, <3>; >>> + interrupts = <1>, <3>, <4>, <2>; >>> }; >>> >>> clint: clint@2000000 { >>> -- >>> 2.36.1 >>>
On 18/08/2022 09:17, Heinrich Schuchardt wrote: > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > On 8/18/22 09:03, Daire.McNamara@microchip.com wrote: >> On Wed, 2022-08-17 at 18:04 +0000, Conor Dooley - M52691 wrote: >>> Hey Heinrich, >>> Interesting CC list you got there! Surprised the mailmap didn't sort >>> out Atish & Krzysztof's addresses, but I think I've fixed them up. >>> I see Daire isn't there either so +CC him too. >>> >>> On 17/08/2022 14:25, Heinrich Schuchardt wrote: >>>> EXTERNAL EMAIL: Do not click links or open attachments unless you >>>> know the content is safe >>>> >>>> The "PolarFire SoC MSS Technical Reference Manual" documents the >>>> following PLIC interrupts: >>>> >>>> 1 - L2 Cache Controller Signals when a metadata correction event >>>> occurs >>>> 2 - L2 Cache Controller Signals when an uncorrectable metadata >>>> event occurs >>>> 3 - L2 Cache Controller Signals when a data correction event occurs >>>> 4 - L2 Cache Controller Signals when an uncorrectable data event >>>> occurs >>>> >>>> This differs from the SiFive FU540 which only has three L2 cache >>>> related >>>> interrupts. >>>> >>>> The sequence in the device tree is defined by an enum: >> in drivers/soc/sifive/sifive_l2_cache.c >>>> >>>> enum { >>>> DIR_CORR = 0, >>>> DATA_CORR, >>>> DATA_UNCORR, >>>> DIR_UNCORR, >>>> }; >>> >>> Nit: more accurately by the dt-binding: >>> interrupts: >>> minItems: 3 >>> items: >>> - description: DirError interrupt >>> - description: DataError interrupt >>> - description: DataFail interrupt >>> - description: DirFail interrupt >>> >>> I do find the names in the enum to be a bit more understandable >>> however, >>> and ditto for the descriptions in our TRM... Maybe I should put that >>> on >>> my todo list of cleanups :) >>> >>> >>>> So the correct sequence of the L2 cache interrupts is >>>> >>>> interrupts = <1>, <3>, <4>, <2>; >>> >>> This looks correct to me. You mentioned on IRC that what you were >>> seeing >>> was a wall of >>> L2CACHE: DataFail @ 0x00000000.0807FFD8 >>> From a quick look at the driver, what seems to be happening here is >>> that >>> at some point (possibly before Linux even comes into the picture) >>> there >>> is an uncorrectable data error. Because the ordering in the dt is >>> wrong, >>> we read the wrong register and so the interrupt is never actually >>> cleared. With this patch applied, I see a single DataFail right as >>> the >>> interrupt gets registed & nothing after that. >>> >>> I am not really sure what value there is in enabling that driver >>> though, >>> mostly just seems like a debugging tool & from our pov we would see >>> the >>> HSS running in the monitor core as being responsible for handling the >>> l2-cache errors. >>> >>> @Daire, maybe you have an opinion here? >> Likewise. The new ordering of the interrupts to what the driver expects >> looks correct - as far as it goes. However, I'm not convinced enabling >> the SiFive l2 cache driver out of the box makes sense. Using l2 cache >> driver doesn't align terribly well with the current MPFS roadmap for >> mgt of ECC errors. >>> >>> Patch LGTM, so I'll likely apply it in the next day or two, would >>> just >>> like to see what Daire has to say first. >> If l2-cache controller is enabled, then interrupts should be connected >> as per TRM. I think this specific patch lgtm, ideally with a >> 'disabled' stanza and it's up to individual MPFS customers/boards to >> enable l2 cache controller if they want it. > > Disabling the device in the device-tree is orthogonal to fixing the > interrupt sequence. I would suggest that you use a separate patch for > adding status = "disabled";. Aye, not wrong there. At least from me, it was an observation on the way you discovered that the bug existed. I'll apply this patch today so - thanks for fixing it! Conor. > > Best regards > > Heinrich > >>> >>>> Fixes: e35b07a7df9b ("riscv: dts: microchip: mpfs: Group tuples in >>>> interrupt properties") >>> >>> BTW, it isn't really fixing this patch right? This is a dependency >>> for >>> backports to 5.15. >>> >>> Thanks for your patch, >>> Conor. >>> >>>> Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE >>>> board") >>>> Cc: Conor Dooley <conor.dooley@microchip.com> >>>> Cc: stable@vger.kernel.org >>>> Signed-off-by: Heinrich Schuchardt < >>>> heinrich.schuchardt@canonical.com> >>>> --- >>>> arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi >>>> b/arch/riscv/boot/dts/microchip/mpfs.dtsi >>>> index 496d3b7642bd..ec1de6344be9 100644 >>>> --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi >>>> +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi >>>> @@ -169,7 +169,7 @@ cctrllr: cache-controller@2010000 { >>>> cache-size = <2097152>; >>>> cache-unified; >>>> interrupt-parent = <&plic>; >>>> - interrupts = <1>, <2>, <3>; >>>> + interrupts = <1>, <3>, <4>, <2>; >>>> }; >>>> >>>> clint: clint@2000000 { >>>> -- >>>> 2.36.1 >>>> >
From: Conor Dooley <conor.dooley@microchip.com> On Wed, 17 Aug 2022 15:25:21 +0200, Heinrich Schuchardt wrote: > The "PolarFire SoC MSS Technical Reference Manual" documents the > following PLIC interrupts: > > 1 - L2 Cache Controller Signals when a metadata correction event occurs > 2 - L2 Cache Controller Signals when an uncorrectable metadata event occurs > 3 - L2 Cache Controller Signals when a data correction event occurs > 4 - L2 Cache Controller Signals when an uncorrectable data event occurs > > [...] Added the impact of the bug & applied to dt-fixes, thanks! [1/1] riscv: dts: microchip: correct L2 cache interrupts https://git.kernel.org/conor/c/34fc9cc3aebe8b9 Thanks, Conor.
diff --git a/arch/riscv/boot/dts/microchip/mpfs.dtsi b/arch/riscv/boot/dts/microchip/mpfs.dtsi index 496d3b7642bd..ec1de6344be9 100644 --- a/arch/riscv/boot/dts/microchip/mpfs.dtsi +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi @@ -169,7 +169,7 @@ cctrllr: cache-controller@2010000 { cache-size = <2097152>; cache-unified; interrupt-parent = <&plic>; - interrupts = <1>, <2>, <3>; + interrupts = <1>, <3>, <4>, <2>; }; clint: clint@2000000 {
The "PolarFire SoC MSS Technical Reference Manual" documents the following PLIC interrupts: 1 - L2 Cache Controller Signals when a metadata correction event occurs 2 - L2 Cache Controller Signals when an uncorrectable metadata event occurs 3 - L2 Cache Controller Signals when a data correction event occurs 4 - L2 Cache Controller Signals when an uncorrectable data event occurs This differs from the SiFive FU540 which only has three L2 cache related interrupts. The sequence in the device tree is defined by an enum: enum { DIR_CORR = 0, DATA_CORR, DATA_UNCORR, DIR_UNCORR, }; So the correct sequence of the L2 cache interrupts is interrupts = <1>, <3>, <4>, <2>; Fixes: e35b07a7df9b ("riscv: dts: microchip: mpfs: Group tuples in interrupt properties") Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE board") Cc: Conor Dooley <conor.dooley@microchip.com> Cc: stable@vger.kernel.org Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com> --- arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)