Message ID | 20210721175432.2119-5-mdtipton@codeaurora.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | interconnect: Fix sync-state issues | expand |
Quoting Mike Tipton (2021-07-21 10:54:32) > We're only adding BCMs to the commit list in aggregate(), but there are > cases where pre_aggregate() is called without subsequently calling > aggregate(). In particular, in icc_sync_state() when a node with initial > BW has zero requests. Since BCMs aren't added to the commit list in > these cases, we don't actually send the zero BW request to HW. So the > resources remain on unnecessarily. > > Add BCMs to the commit list in pre_aggregate() instead, which is always > called even when there are no requests. > > Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") > Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> > --- This patch breaks reboot for me on sc7180 Lazor [ 107.136454] kvm: exiting hardware virtualization [ 107.163741] platform video-firmware.0: Removing from iommu group 13 [ 107.193412] SError Interrupt on CPU1, code 0xbe000011 -- SError [ 107.193428] CPU: 1 PID: 4289 Comm: reboot Not tainted 5.14.0-rc1+ #12 [ 107.193432] Hardware name: Google Lazor (rev3+) with KB Backlight (DT) [ 107.193436] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO BTYPE=--) [ 107.193440] pc : el1_interrupt+0x20/0x60 [ 107.193443] lr : el1h_64_irq_handler+0x18/0x24 [ 107.193445] sp : ffffffc014093a10 [ 107.193448] x29: ffffffc014093a10 x28: ffffff8088295ec0 x27: 0000000000000000 [ 107.193465] x26: ffffff8080ed4c18 x25: ffffffd0beece000 x24: ffffffd0bef45000 [ 107.193476] x23: 0000000060400009 x22: ffffffd0be0bc1a0 x21: ffffffc014093b90 [ 107.193487] x20: ffffffd0bdc100f8 x19: ffffffc014093a40 x18: 000000000007d829 [ 107.193497] x17: ffffffd067412b54 x16: ffffffd0be0bc164 x15: ffffffd067413d0c [ 107.193507] x14: ffffffd0bdd24fa4 x13: ffffffd0bdc26180 x12: ffffffd0bdc26260 [ 107.193517] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 107.193528] x8 : 00000000000000c0 x7 : bbbbbbbbbbbbbbbb x6 : ffffffd0bde488dc [ 107.193539] x5 : 0000000000200017 x4 : ffffff809b5c4b40 x3 : 0000000000200018 [ 107.193549] x2 : ffffff8088295ec0 x1 : ffffffd0bdc100f8 x0 : ffffffc014093a40 [ 107.193561] Kernel panic - not syncing: Asynchronous SError Interrupt [ 107.193564] CPU: 1 PID: 4289 Comm: reboot Not tainted 5.14.0-rc1+ #12 [ 107.193567] Hardware name: Google Lazor (rev3+) with KB Backlight (DT) [ 107.193570] Call trace: [ 107.193573] dump_backtrace+0x0/0x1c8 [ 107.193577] show_stack+0x24/0x30 [ 107.193579] dump_stack_lvl+0x64/0x7c [ 107.193582] dump_stack+0x18/0x38 [ 107.193584] panic+0x158/0x39c [ 107.193586] nmi_panic+0x88/0xa0 [ 107.193589] arm64_serror_panic+0x80/0x8c [ 107.193593] do_serror+0x0/0x80 [ 107.193595] do_serror+0x58/0x80 [ 107.193597] el1h_64_error_handler+0x30/0x48 [ 107.193601] el1h_64_error+0x78/0x7c [ 107.193603] el1_interrupt+0x20/0x60 [ 107.193606] el1h_64_irq_handler+0x18/0x24 [ 107.193609] el1h_64_irq+0x78/0x7c [ 107.193612] refcount_dec_and_mutex_lock+0x3c/0xb4 [ 107.193616] ipa_clock_put+0x34/0x74 [ipa] [ 107.193619] ipa_deconfig+0x64/0x74 [ipa] [ 107.193622] ipa_remove+0xbc/0x110 [ipa] [ 107.193625] ipa_shutdown+0x24/0x50 [ipa] [ 107.193628] platform_shutdown+0x30/0x3c [ 107.193631] device_shutdown+0x150/0x208 [ 107.193633] kernel_restart_prepare+0x44/0x50 [ 107.193637] kernel_restart+0x24/0x70 [ 107.193640] __arm64_sys_reboot+0x188/0x230 [ 107.193643] invoke_syscall+0x4c/0x120 [ 107.193646] el0_svc_common+0x84/0xe0 [ 107.193648] do_el0_svc_compat+0x2c/0x38 [ 107.193651] el0_svc_compat+0x20/0x30 [ 107.193654] el0t_32_sync_handler+0xc0/0xf0 [ 107.193657] el0t_32_sync+0x19c/0x1a0 Presumably some sort of interconnect is getting turned off earlier than before? > drivers/interconnect/qcom/icc-rpmh.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/interconnect/qcom/icc-rpmh.c b/drivers/interconnect/qcom/icc-rpmh.c > index f118f57eae37..b26fda0588e0 100644 > --- a/drivers/interconnect/qcom/icc-rpmh.c > +++ b/drivers/interconnect/qcom/icc-rpmh.c > @@ -20,13 +20,18 @@ void qcom_icc_pre_aggregate(struct icc_node *node) > { > size_t i; > struct qcom_icc_node *qn; > + struct qcom_icc_provider *qp; > > qn = node->data; > + qp = to_qcom_provider(node->provider); > > for (i = 0; i < QCOM_ICC_NUM_BUCKETS; i++) { > qn->sum_avg[i] = 0; > qn->max_peak[i] = 0; > } > + > + for (i = 0; i < qn->num_bcms; i++) > + qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); > } > EXPORT_SYMBOL_GPL(qcom_icc_pre_aggregate); > > @@ -44,10 +49,8 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, > { > size_t i; > struct qcom_icc_node *qn; > - struct qcom_icc_provider *qp; > > qn = node->data; > - qp = to_qcom_provider(node->provider); > > if (!tag) > tag = QCOM_ICC_TAG_ALWAYS; > @@ -67,9 +70,6 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, > *agg_avg += avg_bw; > *agg_peak = max_t(u32, *agg_peak, peak_bw); > > - for (i = 0; i < qn->num_bcms; i++) > - qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); > - > return 0; > } > EXPORT_SYMBOL_GPL(qcom_icc_aggregate);
On Tue 10 Aug 18:31 CDT 2021, Stephen Boyd wrote: > Quoting Mike Tipton (2021-07-21 10:54:32) > > We're only adding BCMs to the commit list in aggregate(), but there are > > cases where pre_aggregate() is called without subsequently calling > > aggregate(). In particular, in icc_sync_state() when a node with initial > > BW has zero requests. Since BCMs aren't added to the commit list in > > these cases, we don't actually send the zero BW request to HW. So the > > resources remain on unnecessarily. > > > > Add BCMs to the commit list in pre_aggregate() instead, which is always > > called even when there are no requests. > > > > Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") > > Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> > > --- > > This patch breaks reboot for me on sc7180 Lazor > FWIW, it prevents at least SM8150 from booting (need to check my other boards as well), because its no longer okay to have the interconnect providers defined without having all client paths specified. Regards, Bjorn
Quoting Bjorn Andersson (2021-08-10 17:18:02) > On Tue 10 Aug 18:31 CDT 2021, Stephen Boyd wrote: > > > Quoting Mike Tipton (2021-07-21 10:54:32) > > > We're only adding BCMs to the commit list in aggregate(), but there are > > > cases where pre_aggregate() is called without subsequently calling > > > aggregate(). In particular, in icc_sync_state() when a node with initial > > > BW has zero requests. Since BCMs aren't added to the commit list in > > > these cases, we don't actually send the zero BW request to HW. So the > > > resources remain on unnecessarily. > > > > > > Add BCMs to the commit list in pre_aggregate() instead, which is always > > > called even when there are no requests. > > > > > > Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") > > > Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> > > > --- > > > > This patch breaks reboot for me on sc7180 Lazor > > > > FWIW, it prevents at least SM8150 from booting (need to check my other > boards as well), because its no longer okay to have the interconnect > providers defined without having all client paths specified. So maybe the best course of action is to revert this patch from Linus' tree? It's not a super huge deal as "can't boot", but certainly makes reboot annoying on sc7180.
On 8/10/21 6:31 PM, Stephen Boyd wrote: > Quoting Mike Tipton (2021-07-21 10:54:32) >> We're only adding BCMs to the commit list in aggregate(), but there are >> cases where pre_aggregate() is called without subsequently calling >> aggregate(). In particular, in icc_sync_state() when a node with initial >> BW has zero requests. Since BCMs aren't added to the commit list in >> these cases, we don't actually send the zero BW request to HW. So the >> resources remain on unnecessarily. >> >> Add BCMs to the commit list in pre_aggregate() instead, which is always >> called even when there are no requests. >> >> Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") >> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> >> --- > > This patch breaks reboot for me on sc7180 Lazor If I am using the interface improperly or something in the IPA driver, please let me know. I actually plan to switch to using the bulk interfaces soon (FYI). Thanks. -Alex > [ 107.136454] kvm: exiting hardware virtualization > [ 107.163741] platform video-firmware.0: Removing from iommu group 13 > [ 107.193412] SError Interrupt on CPU1, code 0xbe000011 -- SError > [ 107.193428] CPU: 1 PID: 4289 Comm: reboot Not tainted 5.14.0-rc1+ #12 > [ 107.193432] Hardware name: Google Lazor (rev3+) with KB Backlight (DT) > [ 107.193436] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO BTYPE=--) > [ 107.193440] pc : el1_interrupt+0x20/0x60 > [ 107.193443] lr : el1h_64_irq_handler+0x18/0x24 > [ 107.193445] sp : ffffffc014093a10 > [ 107.193448] x29: ffffffc014093a10 x28: ffffff8088295ec0 x27: 0000000000000000 > [ 107.193465] x26: ffffff8080ed4c18 x25: ffffffd0beece000 x24: ffffffd0bef45000 > [ 107.193476] x23: 0000000060400009 x22: ffffffd0be0bc1a0 x21: ffffffc014093b90 > [ 107.193487] x20: ffffffd0bdc100f8 x19: ffffffc014093a40 x18: 000000000007d829 > [ 107.193497] x17: ffffffd067412b54 x16: ffffffd0be0bc164 x15: ffffffd067413d0c > [ 107.193507] x14: ffffffd0bdd24fa4 x13: ffffffd0bdc26180 x12: ffffffd0bdc26260 > [ 107.193517] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 > [ 107.193528] x8 : 00000000000000c0 x7 : bbbbbbbbbbbbbbbb x6 : ffffffd0bde488dc > [ 107.193539] x5 : 0000000000200017 x4 : ffffff809b5c4b40 x3 : 0000000000200018 > [ 107.193549] x2 : ffffff8088295ec0 x1 : ffffffd0bdc100f8 x0 : ffffffc014093a40 > [ 107.193561] Kernel panic - not syncing: Asynchronous SError Interrupt > [ 107.193564] CPU: 1 PID: 4289 Comm: reboot Not tainted 5.14.0-rc1+ #12 > [ 107.193567] Hardware name: Google Lazor (rev3+) with KB Backlight (DT) > [ 107.193570] Call trace: > [ 107.193573] dump_backtrace+0x0/0x1c8 > [ 107.193577] show_stack+0x24/0x30 > [ 107.193579] dump_stack_lvl+0x64/0x7c > [ 107.193582] dump_stack+0x18/0x38 > [ 107.193584] panic+0x158/0x39c > [ 107.193586] nmi_panic+0x88/0xa0 > [ 107.193589] arm64_serror_panic+0x80/0x8c > [ 107.193593] do_serror+0x0/0x80 > [ 107.193595] do_serror+0x58/0x80 > [ 107.193597] el1h_64_error_handler+0x30/0x48 > [ 107.193601] el1h_64_error+0x78/0x7c > [ 107.193603] el1_interrupt+0x20/0x60 > [ 107.193606] el1h_64_irq_handler+0x18/0x24 > [ 107.193609] el1h_64_irq+0x78/0x7c > [ 107.193612] refcount_dec_and_mutex_lock+0x3c/0xb4 > [ 107.193616] ipa_clock_put+0x34/0x74 [ipa] > [ 107.193619] ipa_deconfig+0x64/0x74 [ipa] > [ 107.193622] ipa_remove+0xbc/0x110 [ipa] > [ 107.193625] ipa_shutdown+0x24/0x50 [ipa] > [ 107.193628] platform_shutdown+0x30/0x3c > [ 107.193631] device_shutdown+0x150/0x208 > [ 107.193633] kernel_restart_prepare+0x44/0x50 > [ 107.193637] kernel_restart+0x24/0x70 > [ 107.193640] __arm64_sys_reboot+0x188/0x230 > [ 107.193643] invoke_syscall+0x4c/0x120 > [ 107.193646] el0_svc_common+0x84/0xe0 > [ 107.193648] do_el0_svc_compat+0x2c/0x38 > [ 107.193651] el0_svc_compat+0x20/0x30 > [ 107.193654] el0t_32_sync_handler+0xc0/0xf0 > [ 107.193657] el0t_32_sync+0x19c/0x1a0 > > Presumably some sort of interconnect is getting turned off earlier than > before? > >> drivers/interconnect/qcom/icc-rpmh.c | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/interconnect/qcom/icc-rpmh.c b/drivers/interconnect/qcom/icc-rpmh.c >> index f118f57eae37..b26fda0588e0 100644 >> --- a/drivers/interconnect/qcom/icc-rpmh.c >> +++ b/drivers/interconnect/qcom/icc-rpmh.c >> @@ -20,13 +20,18 @@ void qcom_icc_pre_aggregate(struct icc_node *node) >> { >> size_t i; >> struct qcom_icc_node *qn; >> + struct qcom_icc_provider *qp; >> >> qn = node->data; >> + qp = to_qcom_provider(node->provider); >> >> for (i = 0; i < QCOM_ICC_NUM_BUCKETS; i++) { >> qn->sum_avg[i] = 0; >> qn->max_peak[i] = 0; >> } >> + >> + for (i = 0; i < qn->num_bcms; i++) >> + qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); >> } >> EXPORT_SYMBOL_GPL(qcom_icc_pre_aggregate); >> >> @@ -44,10 +49,8 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, >> { >> size_t i; >> struct qcom_icc_node *qn; >> - struct qcom_icc_provider *qp; >> >> qn = node->data; >> - qp = to_qcom_provider(node->provider); >> >> if (!tag) >> tag = QCOM_ICC_TAG_ALWAYS; >> @@ -67,9 +70,6 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, >> *agg_avg += avg_bw; >> *agg_peak = max_t(u32, *agg_peak, peak_bw); >> >> - for (i = 0; i < qn->num_bcms; i++) >> - qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); >> - >> return 0; >> } >> EXPORT_SYMBOL_GPL(qcom_icc_aggregate);
Quoting Alex Elder (2021-08-11 09:01:27) > On 8/10/21 6:31 PM, Stephen Boyd wrote: > > Quoting Mike Tipton (2021-07-21 10:54:32) > >> We're only adding BCMs to the commit list in aggregate(), but there are > >> cases where pre_aggregate() is called without subsequently calling > >> aggregate(). In particular, in icc_sync_state() when a node with initial > >> BW has zero requests. Since BCMs aren't added to the commit list in > >> these cases, we don't actually send the zero BW request to HW. So the > >> resources remain on unnecessarily. > >> > >> Add BCMs to the commit list in pre_aggregate() instead, which is always > >> called even when there are no requests. > >> > >> Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") > >> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> > >> --- > > > > This patch breaks reboot for me on sc7180 Lazor > > If I am using the interface improperly or something in the > IPA driver, please let me know. I actually plan to switch > to using the bulk interfaces soon (FYI). > I suspect I'm seeing a shutdown ordering issue, where we start dropping interconnect requests in driver shutdown callbacks and then some bus turns off and the CPU can't access a device. Maybe to fix this problem (if reverting isn't an option) would be to add a shutdown hook to rpmh-icc that effectively "props up" the bandwidth requests during shutdown so that we don't have to think about finding the place that the interconnect is turned off. We're shutting down/restarting anyway, so there isn't much point in trying to be power efficient for the last few moments of runtime.
On 8/10/2021 5:18 PM, Bjorn Andersson wrote: > On Tue 10 Aug 18:31 CDT 2021, Stephen Boyd wrote: > >> Quoting Mike Tipton (2021-07-21 10:54:32) >>> We're only adding BCMs to the commit list in aggregate(), but there are >>> cases where pre_aggregate() is called without subsequently calling >>> aggregate(). In particular, in icc_sync_state() when a node with initial >>> BW has zero requests. Since BCMs aren't added to the commit list in >>> these cases, we don't actually send the zero BW request to HW. So the >>> resources remain on unnecessarily. >>> >>> Add BCMs to the commit list in pre_aggregate() instead, which is always >>> called even when there are no requests. >>> >>> Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") >>> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> >>> --- >> >> This patch breaks reboot for me on sc7180 Lazor >> > > FWIW, it prevents at least SM8150 from booting (need to check my other > boards as well), because its no longer okay to have the interconnect > providers defined without having all client paths specified. My testing was limited to sdm845, which didn't show any boot issues. But it's not terribly surprising for this to cause problems on some targets. Previously every node was enabled by default and left on permanently if nobody explicitly voted for them. This would happen even if these nodes weren't enabled in bootloaders, since most of the qcom providers aren't defining a get_bw() callback and thus the framework defaults init_avg/init_peak to INT_MAX. So any drivers relying on this default-on behavior would break. We can try to get dumps of the NOC error registers at the time of failure to pinpoint the problematic access. Or we could try to narrow it down by marking more BCMs as keepalive. If they're marked as keepalive then we won't let them turn off even with this patch. > > Regards, > Bjorn >
On 8/11/2021 11:13 AM, Stephen Boyd wrote: > Quoting Alex Elder (2021-08-11 09:01:27) >> On 8/10/21 6:31 PM, Stephen Boyd wrote: >>> Quoting Mike Tipton (2021-07-21 10:54:32) >>>> We're only adding BCMs to the commit list in aggregate(), but there are >>>> cases where pre_aggregate() is called without subsequently calling >>>> aggregate(). In particular, in icc_sync_state() when a node with initial >>>> BW has zero requests. Since BCMs aren't added to the commit list in >>>> these cases, we don't actually send the zero BW request to HW. So the >>>> resources remain on unnecessarily. >>>> >>>> Add BCMs to the commit list in pre_aggregate() instead, which is always >>>> called even when there are no requests. >>>> >>>> Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") >>>> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> >>>> --- >>> >>> This patch breaks reboot for me on sc7180 Lazor >> >> If I am using the interface improperly or something in the >> IPA driver, please let me know. I actually plan to switch >> to using the bulk interfaces soon (FYI). >> > > I suspect I'm seeing a shutdown ordering issue, where we start dropping > interconnect requests in driver shutdown callbacks and then some bus > turns off and the CPU can't access a device. Maybe to fix this problem > (if reverting isn't an option) would be to add a shutdown hook to > rpmh-icc that effectively "props up" the bandwidth requests during > shutdown so that we don't have to think about finding the place that the > interconnect is turned off. We're shutting down/restarting anyway, so > there isn't much point in trying to be power efficient for the last few > moments of runtime. > I wouldn't have expected this change to impact reboot, since this change should only impact places where pre_aggregate() is called without subsequently calling aggregate(). I don't think there are currently any places that can happen other than icc_sync_state(). I suppose what could be happening is we're now disabling certain paths in icc_sync_state() and their associated drivers just aren't used or attempting accesses until they're being torn down in reboot. That doesn't seem particularly likely, but nothing else immediately comes to mind. We already mark paths critical for the CPU as "keepalive" such that they'll never turn off. This includes the CPU path to DDR and top-level CSRs. Basically just paths that can't actually be turned off while SW is running. That logic is unchanged in this patch. So we generally shouldn't need any shutdown-specific callbacks to place BW votes during this window. Client drivers should still ensure they're sequencing their shutdown logic such that any bus accesses happen before they remove their BW requests.
diff --git a/drivers/interconnect/qcom/icc-rpmh.c b/drivers/interconnect/qcom/icc-rpmh.c index f118f57eae37..b26fda0588e0 100644 --- a/drivers/interconnect/qcom/icc-rpmh.c +++ b/drivers/interconnect/qcom/icc-rpmh.c @@ -20,13 +20,18 @@ void qcom_icc_pre_aggregate(struct icc_node *node) { size_t i; struct qcom_icc_node *qn; + struct qcom_icc_provider *qp; qn = node->data; + qp = to_qcom_provider(node->provider); for (i = 0; i < QCOM_ICC_NUM_BUCKETS; i++) { qn->sum_avg[i] = 0; qn->max_peak[i] = 0; } + + for (i = 0; i < qn->num_bcms; i++) + qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); } EXPORT_SYMBOL_GPL(qcom_icc_pre_aggregate); @@ -44,10 +49,8 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, { size_t i; struct qcom_icc_node *qn; - struct qcom_icc_provider *qp; qn = node->data; - qp = to_qcom_provider(node->provider); if (!tag) tag = QCOM_ICC_TAG_ALWAYS; @@ -67,9 +70,6 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw, *agg_avg += avg_bw; *agg_peak = max_t(u32, *agg_peak, peak_bw); - for (i = 0; i < qn->num_bcms; i++) - qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]); - return 0; } EXPORT_SYMBOL_GPL(qcom_icc_aggregate);
We're only adding BCMs to the commit list in aggregate(), but there are cases where pre_aggregate() is called without subsequently calling aggregate(). In particular, in icc_sync_state() when a node with initial BW has zero requests. Since BCMs aren't added to the commit list in these cases, we don't actually send the zero BW request to HW. So the resources remain on unnecessarily. Add BCMs to the commit list in pre_aggregate() instead, which is always called even when there are no requests. Fixes: 976daac4a1c5 ("interconnect: qcom: Consolidate interconnect RPMh support") Signed-off-by: Mike Tipton <mdtipton@codeaurora.org> --- drivers/interconnect/qcom/icc-rpmh.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)