Message ID | 20231002094526.1.Ie8f760213053e3d11592f892b30912dbac6b8b48@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] arm64: smp: Fix pseudo NMI issues w/ broken Mediatek FW | expand |
On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > Some mediatek devices have the property > "mediatek,broken-save-restore-fw" in their GIC. This means that, > although the hardware supports pseudo-NMI, the firmware has a bug > that blocks enabling it. When we're in this state, > system_uses_irq_prio_masking() will return true but we'll fail to > actually enable the IRQ in the GIC. > > Let's make the code handle this. We'll detect that we failed to > request an IPI as NMI and fallback to requesting it normally. Though > we expect that either all of our requests will fail or all will > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > robust. > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > Closes: https://issuetracker.google.com/issues/197061987#comment68 > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > 1 file changed, 12 insertions(+), 7 deletions(-) I'm not too keen on falling back here when we have no idea why the request failed. I'd prefer if we could check the `supports_pseudo_nmis` static key directly to account for the case of broken FW, e.g. as below. Mark. ---->8---- From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 From: Mark Rutland <mark.rutland@arm.com> Date: Mon, 2 Oct 2023 18:00:36 +0100 Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW Some MediaTek devices have broken firmware which corrupts some GICR registers behind the back of the OS, and pseudo-NMIs cannot be used on these devices. For more details see commit: 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") We did not take this problem into account in commit: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") Since that commit arm64's SMP code will try to setup some IPIs as pseudo-NMIs, even on systems with broken FW. The GICv3 code will (rightly) reject attempts to request interrupts as pseudo-NMIs, resulting in boot-time failures. Avoid the problem by taking the broken FW into account when deciding to request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key named "supports_pseudo_nmis" which is false on systems with broken FW, and we can consult this within ipi_should_be_nmi(). Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") Reported-by: Chen-Yu Tsai <wenst@chromium.org> Closes: https://issuetracker.google.com/issues/197061987#comment68 Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Douglas Anderson <dianders@chromium.org> Cc: Marc Zyngier <maz@kernel.org> --- arch/arm64/kernel/smp.c | 5 ++++- drivers/irqchip/irq-gic-v3.c | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 814d9aa93b21b..061c69160f90f 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) static bool ipi_should_be_nmi(enum ipi_msg_type ipi) { - if (!system_uses_irq_prio_masking()) + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); + + if (!system_uses_irq_prio_masking() || + !static_branch_likely(&supports_pseudo_nmis)) return false; switch (ipi) { diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 787ccc880b22d..737da1b9aabf2 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -106,7 +106,7 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key); * - Figure 4-7 Secure read of the priority field for a Non-secure Group 1 * interrupt. */ -static DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); +DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); DEFINE_STATIC_KEY_FALSE(gic_nonsecure_priorities); EXPORT_SYMBOL(gic_nonsecure_priorities);
Hi, On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > Some mediatek devices have the property > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > although the hardware supports pseudo-NMI, the firmware has a bug > > that blocks enabling it. When we're in this state, > > system_uses_irq_prio_masking() will return true but we'll fail to > > actually enable the IRQ in the GIC. > > > > Let's make the code handle this. We'll detect that we failed to > > request an IPI as NMI and fallback to requesting it normally. Though > > we expect that either all of our requests will fail or all will > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > robust. > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > --- > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > 1 file changed, 12 insertions(+), 7 deletions(-) > > I'm not too keen on falling back here when we have no idea why the request failed. > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > account for the case of broken FW, e.g. as below. > > Mark. > > ---->8---- > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > From: Mark Rutland <mark.rutland@arm.com> > Date: Mon, 2 Oct 2023 18:00:36 +0100 > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > Some MediaTek devices have broken firmware which corrupts some GICR > registers behind the back of the OS, and pseudo-NMIs cannot be used on > these devices. For more details see commit: > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > We did not take this problem into account in commit: > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Since that commit arm64's SMP code will try to setup some IPIs as > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > (rightly) reject attempts to request interrupts as pseudo-NMIs, > resulting in boot-time failures. > > Avoid the problem by taking the broken FW into account when deciding to > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > named "supports_pseudo_nmis" which is false on systems with broken FW, > and we can consult this within ipi_should_be_nmi(). > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > Closes: https://issuetracker.google.com/issues/197061987#comment68 > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Douglas Anderson <dianders@chromium.org> > Cc: Marc Zyngier <maz@kernel.org> > --- > arch/arm64/kernel/smp.c | 5 ++++- > drivers/irqchip/irq-gic-v3.c | 2 +- > 2 files changed, 5 insertions(+), 2 deletions(-) Sure, this is OK w/ me as long as folks don't mind accessing the global here, it's OK w/ me: Reviewed-by: Douglas Anderson <dianders@chromium.org> It seems to work for me, thus: Tested-by: Douglas Anderson <dianders@chromium.org> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 814d9aa93b21b..061c69160f90f 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > { > - if (!system_uses_irq_prio_masking()) > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > + > + if (!system_uses_irq_prio_masking() || > + !static_branch_likely(&supports_pseudo_nmis)) One thought, actually, is whether we should actually change system_uses_irq_prio_masking() to return the correct value. What do you think? -Doug
On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote: > Hi, > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > > Some mediatek devices have the property > > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > > although the hardware supports pseudo-NMI, the firmware has a bug > > > that blocks enabling it. When we're in this state, > > > system_uses_irq_prio_masking() will return true but we'll fail to > > > actually enable the IRQ in the GIC. > > > > > > Let's make the code handle this. We'll detect that we failed to > > > request an IPI as NMI and fallback to requesting it normally. Though > > > we expect that either all of our requests will fail or all will > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > > robust. > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > > --- > > > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > I'm not too keen on falling back here when we have no idea why the request failed. > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > > account for the case of broken FW, e.g. as below. > > > > Mark. > > > > ---->8---- > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > > From: Mark Rutland <mark.rutland@arm.com> > > Date: Mon, 2 Oct 2023 18:00:36 +0100 > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > > > Some MediaTek devices have broken firmware which corrupts some GICR > > registers behind the back of the OS, and pseudo-NMIs cannot be used on > > these devices. For more details see commit: > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > We did not take this problem into account in commit: > > > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > Since that commit arm64's SMP code will try to setup some IPIs as > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > > (rightly) reject attempts to request interrupts as pseudo-NMIs, > > resulting in boot-time failures. > > > > Avoid the problem by taking the broken FW into account when deciding to > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > > named "supports_pseudo_nmis" which is false on systems with broken FW, > > and we can consult this within ipi_should_be_nmi(). > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > > Cc: Douglas Anderson <dianders@chromium.org> > > Cc: Marc Zyngier <maz@kernel.org> > > --- > > arch/arm64/kernel/smp.c | 5 ++++- > > drivers/irqchip/irq-gic-v3.c | 2 +- > > 2 files changed, 5 insertions(+), 2 deletions(-) > > Sure, this is OK w/ me as long as folks don't mind accessing the > global here, it's OK w/ me: > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > It seems to work for me, thus: > > Tested-by: Douglas Anderson <dianders@chromium.org> > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 814d9aa93b21b..061c69160f90f 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > { > > - if (!system_uses_irq_prio_masking()) > > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > + > > + if (!system_uses_irq_prio_masking() || > > + !static_branch_likely(&supports_pseudo_nmis)) > > One thought, actually, is whether we should actually change > system_uses_irq_prio_masking() to return the correct value. What do > you think? I don't think we should add this to system_uses_irq_prio_masking(); that's used by the low-level flags manipulation code that gets inlined all over the place, and that code will work regarldess of whether we actually use NMI priorities. If we want to avoid using PMR masking *at all* on these platforms, we'd need to detect that within can_use_gic_priorities() or early_enable_pseudo_nmi(). Thanks, Mark.
Hi, On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote: > > Hi, > > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > > > Some mediatek devices have the property > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > > > although the hardware supports pseudo-NMI, the firmware has a bug > > > > that blocks enabling it. When we're in this state, > > > > system_uses_irq_prio_masking() will return true but we'll fail to > > > > actually enable the IRQ in the GIC. > > > > > > > > Let's make the code handle this. We'll detect that we failed to > > > > request an IPI as NMI and fallback to requesting it normally. Though > > > > we expect that either all of our requests will fail or all will > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > > > robust. > > > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > > > --- > > > > > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > > > I'm not too keen on falling back here when we have no idea why the request failed. > > > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > > > account for the case of broken FW, e.g. as below. > > > > > > Mark. > > > > > > ---->8---- > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > > > From: Mark Rutland <mark.rutland@arm.com> > > > Date: Mon, 2 Oct 2023 18:00:36 +0100 > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > > > > > Some MediaTek devices have broken firmware which corrupts some GICR > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on > > > these devices. For more details see commit: > > > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > > > We did not take this problem into account in commit: > > > > > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > Since that commit arm64's SMP code will try to setup some IPIs as > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > > > (rightly) reject attempts to request interrupts as pseudo-NMIs, > > > resulting in boot-time failures. > > > > > > Avoid the problem by taking the broken FW into account when deciding to > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > > > named "supports_pseudo_nmis" which is false on systems with broken FW, > > > and we can consult this within ipi_should_be_nmi(). > > > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > > > Cc: Douglas Anderson <dianders@chromium.org> > > > Cc: Marc Zyngier <maz@kernel.org> > > > --- > > > arch/arm64/kernel/smp.c | 5 ++++- > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > Sure, this is OK w/ me as long as folks don't mind accessing the > > global here, it's OK w/ me: > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > > > It seems to work for me, thus: > > > > Tested-by: Douglas Anderson <dianders@chromium.org> > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > index 814d9aa93b21b..061c69160f90f 100644 > > > --- a/arch/arm64/kernel/smp.c > > > +++ b/arch/arm64/kernel/smp.c > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > > > > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > > { > > > - if (!system_uses_irq_prio_masking()) > > > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > > + > > > + if (!system_uses_irq_prio_masking() || > > > + !static_branch_likely(&supports_pseudo_nmis)) > > > > One thought, actually, is whether we should actually change > > system_uses_irq_prio_masking() to return the correct value. What do > > you think? > > I don't think we should add this to system_uses_irq_prio_masking(); that's used > by the low-level flags manipulation code that gets inlined all over the place, > and that code will work regarldess of whether we actually use NMI priorities. > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi(). I suspect that anyone trying to use PMR masking on these systems for any purpose will be unhappy. The issue is talked about in: https://issuetracker.google.com/281831288 ...where you can see that the firmware on these systems isn't properly saving/restoring some registers, including GICR_IPRIORITYR. -Doug
On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote: > Hi, > > On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote: > > > Hi, > > > > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > > > > Some mediatek devices have the property > > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > > > > although the hardware supports pseudo-NMI, the firmware has a bug > > > > > that blocks enabling it. When we're in this state, > > > > > system_uses_irq_prio_masking() will return true but we'll fail to > > > > > actually enable the IRQ in the GIC. > > > > > > > > > > Let's make the code handle this. We'll detect that we failed to > > > > > request an IPI as NMI and fallback to requesting it normally. Though > > > > > we expect that either all of our requests will fail or all will > > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > > > > robust. > > > > > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > > > > --- > > > > > > > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > > > > > I'm not too keen on falling back here when we have no idea why the request failed. > > > > > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > > > > account for the case of broken FW, e.g. as below. > > > > > > > > Mark. > > > > > > > > ---->8---- > > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > > > > From: Mark Rutland <mark.rutland@arm.com> > > > > Date: Mon, 2 Oct 2023 18:00:36 +0100 > > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > > > > > > > Some MediaTek devices have broken firmware which corrupts some GICR > > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on > > > > these devices. For more details see commit: > > > > > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > > > > > We did not take this problem into account in commit: > > > > > > > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > > > Since that commit arm64's SMP code will try to setup some IPIs as > > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > > > > (rightly) reject attempts to request interrupts as pseudo-NMIs, > > > > resulting in boot-time failures. > > > > > > > > Avoid the problem by taking the broken FW into account when deciding to > > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > > > > named "supports_pseudo_nmis" which is false on systems with broken FW, > > > > and we can consult this within ipi_should_be_nmi(). > > > > > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > > > > Cc: Douglas Anderson <dianders@chromium.org> > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > --- > > > > arch/arm64/kernel/smp.c | 5 ++++- > > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > > > Sure, this is OK w/ me as long as folks don't mind accessing the > > > global here, it's OK w/ me: > > > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > > > > > It seems to work for me, thus: > > > > > > Tested-by: Douglas Anderson <dianders@chromium.org> > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > index 814d9aa93b21b..061c69160f90f 100644 > > > > --- a/arch/arm64/kernel/smp.c > > > > +++ b/arch/arm64/kernel/smp.c > > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > > > > > > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > > > { > > > > - if (!system_uses_irq_prio_masking()) > > > > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > > > + > > > > + if (!system_uses_irq_prio_masking() || > > > > + !static_branch_likely(&supports_pseudo_nmis)) > > > > > > One thought, actually, is whether we should actually change > > > system_uses_irq_prio_masking() to return the correct value. What do > > > you think? > > > > I don't think we should add this to system_uses_irq_prio_masking(); that's used > > by the low-level flags manipulation code that gets inlined all over the place, > > and that code will work regarldess of whether we actually use NMI priorities. > > > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to > > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi(). > > I suspect that anyone trying to use PMR masking on these systems for > any purpose will be unhappy. The issue is talked about in: > > https://issuetracker.google.com/281831288 > > ...where you can see that the firmware on these systems isn't properly > saving/restoring some registers, including GICR_IPRIORITYR. Ok, then that's a latent bug even before the IPI changes, going back to the original workaround in commit: 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") For the sake of those reading the archive, can we have a better description of what exactly happens on these boards? IIUC on these boards the firmware fails to save+restore (some?) GICR registers across (some?) PSCI CPU_SUSPEND idle states. Which registers does it save+restore? Does it reset other registers into a specific state? Thanks, Mark.
Hi, On Tue, Oct 3, 2023 at 9:32 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote: > > Hi, > > > > On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote: > > > > Hi, > > > > > > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > > > > > Some mediatek devices have the property > > > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > > > > > although the hardware supports pseudo-NMI, the firmware has a bug > > > > > > that blocks enabling it. When we're in this state, > > > > > > system_uses_irq_prio_masking() will return true but we'll fail to > > > > > > actually enable the IRQ in the GIC. > > > > > > > > > > > > Let's make the code handle this. We'll detect that we failed to > > > > > > request an IPI as NMI and fallback to requesting it normally. Though > > > > > > we expect that either all of our requests will fail or all will > > > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > > > > > robust. > > > > > > > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > > > > > --- > > > > > > > > > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > > > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > > > > > > > I'm not too keen on falling back here when we have no idea why the request failed. > > > > > > > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > > > > > account for the case of broken FW, e.g. as below. > > > > > > > > > > Mark. > > > > > > > > > > ---->8---- > > > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > > > > > From: Mark Rutland <mark.rutland@arm.com> > > > > > Date: Mon, 2 Oct 2023 18:00:36 +0100 > > > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > > > > > > > > > Some MediaTek devices have broken firmware which corrupts some GICR > > > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on > > > > > these devices. For more details see commit: > > > > > > > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > > > > > > > We did not take this problem into account in commit: > > > > > > > > > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > > > > > Since that commit arm64's SMP code will try to setup some IPIs as > > > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > > > > > (rightly) reject attempts to request interrupts as pseudo-NMIs, > > > > > resulting in boot-time failures. > > > > > > > > > > Avoid the problem by taking the broken FW into account when deciding to > > > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > > > > > named "supports_pseudo_nmis" which is false on systems with broken FW, > > > > > and we can consult this within ipi_should_be_nmi(). > > > > > > > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > > > > > Cc: Douglas Anderson <dianders@chromium.org> > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > > --- > > > > > arch/arm64/kernel/smp.c | 5 ++++- > > > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > > > > > Sure, this is OK w/ me as long as folks don't mind accessing the > > > > global here, it's OK w/ me: > > > > > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > > > > > > > It seems to work for me, thus: > > > > > > > > Tested-by: Douglas Anderson <dianders@chromium.org> > > > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > index 814d9aa93b21b..061c69160f90f 100644 > > > > > --- a/arch/arm64/kernel/smp.c > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > > > > > > > > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > > > > { > > > > > - if (!system_uses_irq_prio_masking()) > > > > > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > > > > + > > > > > + if (!system_uses_irq_prio_masking() || > > > > > + !static_branch_likely(&supports_pseudo_nmis)) > > > > > > > > One thought, actually, is whether we should actually change > > > > system_uses_irq_prio_masking() to return the correct value. What do > > > > you think? > > > > > > I don't think we should add this to system_uses_irq_prio_masking(); that's used > > > by the low-level flags manipulation code that gets inlined all over the place, > > > and that code will work regarldess of whether we actually use NMI priorities. > > > > > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to > > > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi(). > > > > I suspect that anyone trying to use PMR masking on these systems for > > any purpose will be unhappy. The issue is talked about in: > > > > https://issuetracker.google.com/281831288 > > > > ...where you can see that the firmware on these systems isn't properly > > saving/restoring some registers, including GICR_IPRIORITYR. > > Ok, then that's a latent bug even before the IPI changes, going back to the > original workaround in commit: > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > For the sake of those reading the archive, can we have a better description of > what exactly happens on these boards? > > IIUC on these boards the firmware fails to save+restore (some?) GICR registers > across (some?) PSCI CPU_SUSPEND idle states. > > Which registers does it save+restore? > > Does it reset other registers into a specific state? Though I'm not an expert in this area, my understanding is that in some of the deeper idle states then GICR registers are lost. That matches a thread [0] I found. In early investigation I found that I could comment out `cpu-idle-states` in the device tree to avoid the problems [1]. I believe this is fully expected which is why firmware is supposed to save/restore these registers whenever a low power is entered/exited. I'd presume that any register not properly saved/restored comes up in whatever its default state is. As far as pseudo-NMI was concerned, all I really needed to save/restore was "GICR_NUM_IPRIORITYR" [2], but Marc Zyngier looked at the code and identified [3] at least these in addition: * GICR_CTLR * GICR_ISPENDR0 * GICR_ISACTIVER0 * GICR_NSACR That list seems to match the Arm Trusted Firmware patch that fixed the issue [4]. ...but it will be impossible to ever get the fix rolled out to all devices. Even if we could get firmware spins Qualified for every device there will still be cases where we'll boot with the old firmware. Since we _don't_ bundle the device tree with the firmware, we believe that the quirk mechanism that we came up with (add a quirk in never device trees and firmware removes the quirk when we have a fix) is at least a robust/reliable way to detect the issue. The whole issue seems rather concerning, but (apparently) it never caused issues in the kernel until we tried to use pseudo-NMI. [0] https://github.com/ARM-software/tf-issues/issues/464 [1] https://issuetracker.google.com/issues/197061987#comment27 [2] https://crrev.com/c/4519877 [3] https://issuetracker.google.com/issues/281831288#comment5 [4] https://github.com/ARM-software/arm-trusted-firmware/commit/1c62cc7fbdf2ec2a7e69b3c027d507fcafdcaa12
On Tue, Oct 03, 2023 at 12:32:39PM -0700, Doug Anderson wrote: > Hi, > > On Tue, Oct 3, 2023 at 9:32 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote: > > > Hi, > > > > > > On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote: > > > > > Hi, > > > > > > > > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > > > > > > Some mediatek devices have the property > > > > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > > > > > > although the hardware supports pseudo-NMI, the firmware has a bug > > > > > > > that blocks enabling it. When we're in this state, > > > > > > > system_uses_irq_prio_masking() will return true but we'll fail to > > > > > > > actually enable the IRQ in the GIC. > > > > > > > > > > > > > > Let's make the code handle this. We'll detect that we failed to > > > > > > > request an IPI as NMI and fallback to requesting it normally. Though > > > > > > > we expect that either all of our requests will fail or all will > > > > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > > > > > > robust. > > > > > > > > > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > > > > > > --- > > > > > > > > > > > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > > > > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > > > > > > > > > I'm not too keen on falling back here when we have no idea why the request failed. > > > > > > > > > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > > > > > > account for the case of broken FW, e.g. as below. > > > > > > > > > > > > Mark. > > > > > > > > > > > > ---->8---- > > > > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > > > > > > From: Mark Rutland <mark.rutland@arm.com> > > > > > > Date: Mon, 2 Oct 2023 18:00:36 +0100 > > > > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > > > > > > > > > > > Some MediaTek devices have broken firmware which corrupts some GICR > > > > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on > > > > > > these devices. For more details see commit: > > > > > > > > > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > > > > > > > > > We did not take this problem into account in commit: > > > > > > > > > > > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > > > > > > > Since that commit arm64's SMP code will try to setup some IPIs as > > > > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > > > > > > (rightly) reject attempts to request interrupts as pseudo-NMIs, > > > > > > resulting in boot-time failures. > > > > > > > > > > > > Avoid the problem by taking the broken FW into account when deciding to > > > > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > > > > > > named "supports_pseudo_nmis" which is false on systems with broken FW, > > > > > > and we can consult this within ipi_should_be_nmi(). > > > > > > > > > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > > > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > > > > > > Cc: Douglas Anderson <dianders@chromium.org> > > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > > > --- > > > > > > arch/arm64/kernel/smp.c | 5 ++++- > > > > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > > > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > Sure, this is OK w/ me as long as folks don't mind accessing the > > > > > global here, it's OK w/ me: > > > > > > > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > > > > > > > > > It seems to work for me, thus: > > > > > > > > > > Tested-by: Douglas Anderson <dianders@chromium.org> > > > > > > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > > index 814d9aa93b21b..061c69160f90f 100644 > > > > > > --- a/arch/arm64/kernel/smp.c > > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > > > > > > > > > > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > > > > > > { > > > > > > - if (!system_uses_irq_prio_masking()) > > > > > > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > > > > > + > > > > > > + if (!system_uses_irq_prio_masking() || > > > > > > + !static_branch_likely(&supports_pseudo_nmis)) > > > > > > > > > > One thought, actually, is whether we should actually change > > > > > system_uses_irq_prio_masking() to return the correct value. What do > > > > > you think? > > > > > > > > I don't think we should add this to system_uses_irq_prio_masking(); that's used > > > > by the low-level flags manipulation code that gets inlined all over the place, > > > > and that code will work regarldess of whether we actually use NMI priorities. > > > > > > > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to > > > > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi(). > > > > > > I suspect that anyone trying to use PMR masking on these systems for > > > any purpose will be unhappy. The issue is talked about in: > > > > > > https://issuetracker.google.com/281831288 > > > > > > ...where you can see that the firmware on these systems isn't properly > > > saving/restoring some registers, including GICR_IPRIORITYR. > > > > Ok, then that's a latent bug even before the IPI changes, going back to the > > original workaround in commit: > > > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > > > For the sake of those reading the archive, can we have a better description of > > what exactly happens on these boards? > > > > IIUC on these boards the firmware fails to save+restore (some?) GICR registers > > across (some?) PSCI CPU_SUSPEND idle states. > > > > Which registers does it save+restore? > > > > Does it reset other registers into a specific state? > > Though I'm not an expert in this area, my understanding is that in > some of the deeper idle states then GICR registers are lost. That > matches a thread [0] I found. In early investigation I found that I > could comment out `cpu-idle-states` in the device tree to avoid the > problems [1]. I believe this is fully expected which is why firmware > is supposed to save/restore these registers whenever a low power is > entered/exited. I'd presume that any register not properly > saved/restored comes up in whatever its default state is. > > As far as pseudo-NMI was concerned, all I really needed to > save/restore was "GICR_NUM_IPRIORITYR" [2], but Marc Zyngier looked at > the code and identified [3] at least these in addition: > * GICR_CTLR > * GICR_ISPENDR0 > * GICR_ISACTIVER0 > * GICR_NSACR Looking at the GIC spec (Arm IHI 0069H), page 12-673, I see for all of the GICR_IPRIORITYR<n>.Priority_offset_*B fields: | The reset behavior of this field is: | • On a GIC reset, this field resets to an architecturally UNKNOWN value. ... so at least per the architecture these could be reset to arbitrary values, and that priority might permit SGI/PPIs to be taken IRQs are priority-masked, or to prevent SGI/PPIs to be taken when priority-unmasked. I also see for GICR_CTLR that EnableLPIs would be reset to 0, and IIUC that means LPIs won't work on these parts, which seems like a problem. > That list seems to match the Arm Trusted Firmware patch that fixed the > issue [4]. ...but it will be impossible to ever get the fix rolled out > to all devices. Even if we could get firmware spins Qualified for > every device there will still be cases where we'll boot with the old > firmware. Since we _don't_ bundle the device tree with the firmware, > we believe that the quirk mechanism that we came up with (add a quirk > in never device trees and firmware removes the quirk when we have a > fix) is at least a robust/reliable way to detect the issue. > > The whole issue seems rather concerning, but (apparently) it never > caused issues in the kernel until we tried to use pseudo-NMI. Given you haven't seen any issues, I suspect those are getting reset to fixed values that happens to work out for us, but it is a bit worrisome more generally (e.g. the LPI case above). Mark. > > [0] https://github.com/ARM-software/tf-issues/issues/464 > [1] https://issuetracker.google.com/issues/197061987#comment27 > [2] https://crrev.com/c/4519877 > [3] https://issuetracker.google.com/issues/281831288#comment5 > [4] https://github.com/ARM-software/arm-trusted-firmware/commit/1c62cc7fbdf2ec2a7e69b3c027d507fcafdcaa12
On Wed, 04 Oct 2023 10:59:50 +0100, Mark Rutland <mark.rutland@arm.com> wrote: > > Given you haven't seen any issues, I suspect those are getting reset to fixed > values that happens to work out for us, but it is a bit worrisome more > generally (e.g. the LPI case above). It is likely that these SoCs don't even have an ITS. M.
Hi, On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote: > > On Wed, 04 Oct 2023 10:59:50 +0100, > Mark Rutland <mark.rutland@arm.com> wrote: > > > > Given you haven't seen any issues, I suspect those are getting reset to fixed > > values that happens to work out for us, but it is a bit worrisome more > > generally (e.g. the LPI case above). > > It is likely that these SoCs don't even have an ITS. Right. That was what we decided [1] when Marc pointed this out earlier. Overall: we know that this firmware behavior is not good but we're stuck with it. :( At the very least, any new devices coming out will have this fixed. Presumably if old devices are working OK enough today (as long as you don't enable pseudo-NMI) then they can be made to keep working? So circling back: what patch should we actually land? As of right now only pseudo-NMI is broken, but it would be good to make sure that if the kernel later adds other features that would be broken on this hardware that it gets handled properly... [1] https://issuetracker.google.com/issues/281831288#comment4
On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote: > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Wed, 04 Oct 2023 10:59:50 +0100, > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > Given you haven't seen any issues, I suspect those are getting reset to fixed > > > values that happens to work out for us, but it is a bit worrisome more > > > generally (e.g. the LPI case above). > > > > It is likely that these SoCs don't even have an ITS. > > Right. That was what we decided [1] when Marc pointed this out earlier. > > Overall: we know that this firmware behavior is not good but we're > stuck with it. :( At the very least, any new devices coming out will > have this fixed. Presumably if old devices are working OK enough today > (as long as you don't enable pseudo-NMI) then they can be made to keep > working? > > So circling back: what patch should we actually land? For now I'd prefer we took the patch I sent in: https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/ ... as that leaves us no worse than before this series, and it's pretty simple. > As of right now only pseudo-NMI is broken, but it would be good to make sure > that if the kernel later adds other features that would be broken on this > hardware that it gets handled properly... Going further than the above, I think there are three options here: 1) Complete fix: depend on a working firmware, and throw this workaround away. IIUC from the above, that's not something you can commit to. 2) Partial fix: have the kernel save/restore everything. IIUC this is unpalatable. 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities(). That'll avoid potential issues if/when we change the priorities used for pNMI (which is something I've been looking at). I'm happy with (3) if Marc is. Mark.
Hi, On Thu, Oct 5, 2023 at 3:27 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote: > > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Wed, 04 Oct 2023 10:59:50 +0100, > > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > Given you haven't seen any issues, I suspect those are getting reset to fixed > > > > values that happens to work out for us, but it is a bit worrisome more > > > > generally (e.g. the LPI case above). > > > > > > It is likely that these SoCs don't even have an ITS. > > > > Right. That was what we decided [1] when Marc pointed this out earlier. > > > > Overall: we know that this firmware behavior is not good but we're > > stuck with it. :( At the very least, any new devices coming out will > > have this fixed. Presumably if old devices are working OK enough today > > (as long as you don't enable pseudo-NMI) then they can be made to keep > > working? > > > > So circling back: what patch should we actually land? > > For now I'd prefer we took the patch I sent in: > > https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/ > > ... as that leaves us no worse than before this series, and it's pretty simple. Sounds good to me! Catalin / Will: Please yell if there's anything you need me to do. Otherwise I'll assume you'll pick up Mark's patch instead of my patch #1 and then you'll pick up my patch #2. > > As of right now only pseudo-NMI is broken, but it would be good to make sure > > that if the kernel later adds other features that would be broken on this > > hardware that it gets handled properly... > > Going further than the above, I think there are three options here: > > 1) Complete fix: depend on a working firmware, and throw this workaround away. > > IIUC from the above, that's not something you can commit to. Right. We've landed the fix in the firmware branch for many of the devices that had the issue but there's a whole process (and cost involved) in getting this actually rolled out. Each unique board needs to kick off a FW qual. Given that nothing is actually broken today it's hard to justify a FW qual just for this, so we're left piggybacking on the next reason for a FW qual (if there is one). ...even if we could kick them off all instantly, though, it's always best not to rely on a FW fix, especially if (as in this case) it's to keep us from crashing. There'll always be some case where someone tries to boot a new OS with an old firmware. One such case can happen when booting recovery images on ChromeOS where the device always boots from the Read-only (not updatable) firmware. > 2) Partial fix: have the kernel save/restore everything. > > IIUC this is unpalatable. Yeah, Marc already NAKed it. > 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the > absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe > we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities(). > > That'll avoid potential issues if/when we change the priorities used for > pNMI (which is something I've been looking at). > > I'm happy with (3) if Marc is. Yeah, that seems best to me as a long term solution, too. -Doug
On Mon, 02 Oct 2023 18:24:11 +0100, Mark Rutland <mark.rutland@arm.com> wrote: > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > Some mediatek devices have the property > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > although the hardware supports pseudo-NMI, the firmware has a bug > > that blocks enabling it. When we're in this state, > > system_uses_irq_prio_masking() will return true but we'll fail to > > actually enable the IRQ in the GIC. > > > > Let's make the code handle this. We'll detect that we failed to > > request an IPI as NMI and fallback to requesting it normally. Though > > we expect that either all of our requests will fail or all will > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > robust. > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > --- > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > 1 file changed, 12 insertions(+), 7 deletions(-) > > I'm not too keen on falling back here when we have no idea why the request failed. > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > account for the case of broken FW, e.g. as below. > > Mark. > > ---->8---- > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > From: Mark Rutland <mark.rutland@arm.com> > Date: Mon, 2 Oct 2023 18:00:36 +0100 > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > Some MediaTek devices have broken firmware which corrupts some GICR > registers behind the back of the OS, and pseudo-NMIs cannot be used on > these devices. For more details see commit: > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > We did not take this problem into account in commit: > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Since that commit arm64's SMP code will try to setup some IPIs as > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > (rightly) reject attempts to request interrupts as pseudo-NMIs, > resulting in boot-time failures. > > Avoid the problem by taking the broken FW into account when deciding to > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > named "supports_pseudo_nmis" which is false on systems with broken FW, > and we can consult this within ipi_should_be_nmi(). > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > Closes: https://issuetracker.google.com/issues/197061987#comment68 > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Douglas Anderson <dianders@chromium.org> > Cc: Marc Zyngier <maz@kernel.org> > --- > arch/arm64/kernel/smp.c | 5 ++++- > drivers/irqchip/irq-gic-v3.c | 2 +- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 814d9aa93b21b..061c69160f90f 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > { > - if (!system_uses_irq_prio_masking()) > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > + > + if (!system_uses_irq_prio_masking() || > + !static_branch_likely(&supports_pseudo_nmis)) > return false; > > switch (ipi) { > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 787ccc880b22d..737da1b9aabf2 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -106,7 +106,7 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key); > * - Figure 4-7 Secure read of the priority field for a Non-secure Group 1 > * interrupt. > */ > -static DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); > +DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > DEFINE_STATIC_KEY_FALSE(gic_nonsecure_priorities); > EXPORT_SYMBOL(gic_nonsecure_priorities); This last hunk is going to result in more spam from the robots about global objects without a previous declaration. Not that I care the least, but worth mentioning. Other than that: Reviewed-by: Marc Zyngier <maz@kernel.org> Please take it via the arm64 tree with patch #1 Thanks, M.
On Tue, Oct 3, 2023 at 1:24 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote: > > Some mediatek devices have the property > > "mediatek,broken-save-restore-fw" in their GIC. This means that, > > although the hardware supports pseudo-NMI, the firmware has a bug > > that blocks enabling it. When we're in this state, > > system_uses_irq_prio_masking() will return true but we'll fail to > > actually enable the IRQ in the GIC. > > > > Let's make the code handle this. We'll detect that we failed to > > request an IPI as NMI and fallback to requesting it normally. Though > > we expect that either all of our requests will fail or all will > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us > > robust. > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > > Closes: https://issuetracker.google.com/issues/197061987#comment68 > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > > --- > > > > arch/arm64/kernel/smp.c | 19 ++++++++++++------- > > 1 file changed, 12 insertions(+), 7 deletions(-) > > I'm not too keen on falling back here when we have no idea why the request failed. > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to > account for the case of broken FW, e.g. as below. > > Mark. > > ---->8---- > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001 > From: Mark Rutland <mark.rutland@arm.com> > Date: Mon, 2 Oct 2023 18:00:36 +0100 > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW > > Some MediaTek devices have broken firmware which corrupts some GICR > registers behind the back of the OS, and pseudo-NMIs cannot be used on > these devices. For more details see commit: > > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues") > > We did not take this problem into account in commit: > > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > > Since that commit arm64's SMP code will try to setup some IPIs as > pseudo-NMIs, even on systems with broken FW. The GICv3 code will > (rightly) reject attempts to request interrupts as pseudo-NMIs, > resulting in boot-time failures. > > Avoid the problem by taking the broken FW into account when deciding to > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key > named "supports_pseudo_nmis" which is false on systems with broken FW, > and we can consult this within ipi_should_be_nmi(). > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI") > Reported-by: Chen-Yu Tsai <wenst@chromium.org> > Closes: https://issuetracker.google.com/issues/197061987#comment68 > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Douglas Anderson <dianders@chromium.org> > Cc: Marc Zyngier <maz@kernel.org> Tested-by: Chen-Yu Tsai <wenst@chromium.org> > --- > arch/arm64/kernel/smp.c | 5 ++++- > drivers/irqchip/irq-gic-v3.c | 2 +- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 814d9aa93b21b..061c69160f90f 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr) > > static bool ipi_should_be_nmi(enum ipi_msg_type ipi) > { > - if (!system_uses_irq_prio_masking()) > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis); > + > + if (!system_uses_irq_prio_masking() || > + !static_branch_likely(&supports_pseudo_nmis)) > return false; > > switch (ipi) { > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 787ccc880b22d..737da1b9aabf2 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -106,7 +106,7 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key); > * - Figure 4-7 Secure read of the priority field for a Non-secure Group 1 > * interrupt. > */ > -static DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); > +DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis); > > DEFINE_STATIC_KEY_FALSE(gic_nonsecure_priorities); > EXPORT_SYMBOL(gic_nonsecure_priorities); > -- > 2.30.2 > >
On Thu, 05 Oct 2023 11:27:26 +0100, Mark Rutland <mark.rutland@arm.com> wrote: > 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the > absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe > we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities(). > > That'll avoid potential issues if/when we change the priorities used for > pNMI (which is something I've been looking at). > > I'm happy with (3) if Marc is. Definitely worth investigating if you have the bandwidth. M.
On Thu, Oct 05, 2023 at 08:34:56AM -0700, Doug Anderson wrote: > On Thu, Oct 5, 2023 at 3:27 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote: > > > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Wed, 04 Oct 2023 10:59:50 +0100, > > > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > Given you haven't seen any issues, I suspect those are getting reset to fixed > > > > > values that happens to work out for us, but it is a bit worrisome more > > > > > generally (e.g. the LPI case above). > > > > > > > > It is likely that these SoCs don't even have an ITS. > > > > > > Right. That was what we decided [1] when Marc pointed this out earlier. > > > > > > Overall: we know that this firmware behavior is not good but we're > > > stuck with it. :( At the very least, any new devices coming out will > > > have this fixed. Presumably if old devices are working OK enough today > > > (as long as you don't enable pseudo-NMI) then they can be made to keep > > > working? > > > > > > So circling back: what patch should we actually land? > > > > For now I'd prefer we took the patch I sent in: > > > > https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/ > > > > ... as that leaves us no worse than before this series, and it's pretty simple. > > Sounds good to me! > > Catalin / Will: Please yell if there's anything you need me to do. > Otherwise I'll assume you'll pick up Mark's patch instead of my patch > #1 and then you'll pick up my patch #2. I applied both to the arm64 for-next/backtrace-ipi branch. Thanks.
Hi, On Fri, Oct 6, 2023 at 3:20 AM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 05 Oct 2023 11:27:26 +0100, > Mark Rutland <mark.rutland@arm.com> wrote: > > > 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the > > absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe > > we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities(). > > > > That'll avoid potential issues if/when we change the priorities used for > > pNMI (which is something I've been looking at). > > > > I'm happy with (3) if Marc is. > > Definitely worth investigating if you have the bandwidth. I made an attempt at it. See: https://lore.kernel.org/20231006151547.1.Ide945748593cffd8ff0feb9ae22b795935b944d6@changeid
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 814d9aa93b21..0a6002243a8c 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -87,6 +87,7 @@ enum ipi_msg_type { static int ipi_irq_base __ro_after_init; static int nr_ipi __ro_after_init = NR_IPI; static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init; +DECLARE_BITMAP(ipi_is_nmi, MAX_IPI); static void ipi_setup(int cpu); @@ -986,7 +987,7 @@ static void ipi_setup(int cpu) return; for (i = 0; i < nr_ipi; i++) { - if (ipi_should_be_nmi(i)) { + if (test_bit(i, ipi_is_nmi)) { prepare_percpu_nmi(ipi_irq_base + i); enable_percpu_nmi(ipi_irq_base + i, 0); } else { @@ -1004,7 +1005,7 @@ static void ipi_teardown(int cpu) return; for (i = 0; i < nr_ipi; i++) { - if (ipi_should_be_nmi(i)) { + if (test_bit(i, ipi_is_nmi)) { disable_percpu_nmi(ipi_irq_base + i); teardown_percpu_nmi(ipi_irq_base + i); } else { @@ -1022,17 +1023,21 @@ void __init set_smp_ipi_range(int ipi_base, int n) nr_ipi = min(n, MAX_IPI); for (i = 0; i < nr_ipi; i++) { - int err; + int err = -EINVAL; if (ipi_should_be_nmi(i)) { err = request_percpu_nmi(ipi_base + i, ipi_handler, "IPI", &cpu_number); - WARN(err, "Could not request IPI %d as NMI, err=%d\n", - i, err); - } else { + if (err) + pr_info_once("NMI unavailable; fallback to IRQ\n"); + else + set_bit(i, ipi_is_nmi); + } + + if (err) { err = request_percpu_irq(ipi_base + i, ipi_handler, "IPI", &cpu_number); - WARN(err, "Could not request IPI %d as IRQ, err=%d\n", + WARN(err, "Could not request IPI %d, err=%d\n", i, err); }
Some mediatek devices have the property "mediatek,broken-save-restore-fw" in their GIC. This means that, although the hardware supports pseudo-NMI, the firmware has a bug that blocks enabling it. When we're in this state, system_uses_irq_prio_masking() will return true but we'll fail to actually enable the IRQ in the GIC. Let's make the code handle this. We'll detect that we failed to request an IPI as NMI and fallback to requesting it normally. Though we expect that either all of our requests will fail or all will succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us robust. Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") Reported-by: Chen-Yu Tsai <wenst@chromium.org> Closes: https://issuetracker.google.com/issues/197061987#comment68 Signed-off-by: Douglas Anderson <dianders@chromium.org> --- arch/arm64/kernel/smp.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)