Message ID | 20240904031324.2901114-3-quic_sibis@quicinc.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | firmware: arm_scmi: Misc Fixes | expand |
On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > Ensure that the bad duplicates reported by the platform firmware doesn't > get added to the opp-tables. Please expand on why this is an issue on Qualcomm platforms, these entries aren't just "bad duplicates" if IIUC. Also here, please add (examples of) the warnings I reported. During boot of the x1e80100 crd, I see: [ 8.992956] cpu cpu4: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. New: freq: 3417600000, volt: 0, enabled: 1 [ 9.021940] cpu cpu4: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. New: freq: 3417600000, volt: 0, enabled: 1 [ 9.036171] cpu cpu8: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. New: freq: 3417600000, volt: 0, enabled: 1 [ 9.036177] cpu cpu8: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. New: freq: 3417600000, volt: 0, enabled: 1 and during resume: [ 85.286615] cpu cpu4: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. N ew: freq: 3417600000, volt: 0, enabled: 1 [ 85.319849] cpu cpu4: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. N ew: freq: 3417600000, volt: 0, enabled: 1 [ 85.334686] debugfs: File 'cpu5' in directory 'opp' already present! [ 85.341399] debugfs: File 'cpu6' in directory 'opp' already present! [ 85.348016] debugfs: File 'cpu7' in directory 'opp' already present! [ 85.443093] cpu cpu8: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. N ew: freq: 3417600000, volt: 0, enabled: 1 [ 85.476595] cpu cpu8: _opp_is_duplicate: duplicate OPPs detected. Existing: freq: 3417600000, volt: 0, enabled: 1. N ew: freq: 3417600000, volt: 0, enabled: 1 [ 85.491645] debugfs: File 'cpu9' in directory 'opp' already present! [ 85.498409] debugfs: File 'cpu10' in directory 'opp' already present! [ 85.505187] debugfs: File 'cpu11' in directory 'opp' already present! Please also add: Reported-by: Johan Hovold <johan+linaro@kernel.org> Link: https://lore.kernel.org/lkml/ZoQjAWse2YxwyRJv@hovoldconsulting.com/ > Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com> But with this patch applied, instead of the above warnings I now get two *errors* at boot: [ 8.952173] cpu cpu4: EM: non-increasing freq: 0 [ 8.979460] cpu cpu8: EM: non-increasing freq: 0 Can you do something about that as well? At least make sure to highlight this in the commit message as this is information that is needed to be able to evaluate the patch. Johan
On Wed, Sep 04, 2024 at 09:09:56AM +0200, Johan Hovold wrote: > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > Ensure that the bad duplicates reported by the platform firmware doesn't > > get added to the opp-tables. > But with this patch applied, instead of the above warnings I now get two > *errors* at boot: > > [ 8.952173] cpu cpu4: EM: non-increasing freq: 0 > [ 8.979460] cpu cpu8: EM: non-increasing freq: 0 > > Can you do something about that as well? At least make sure to highlight > this in the commit message as this is information that is needed to be > able to evaluate the patch. I tried running with just this patch (i.e. without patch 1/2), but then cpufreq fails already during boot with eight of these errors: [ 9.258577] cpufreq: cpufreq_online: ->get() failed ... [ 9.405603] cpufreq: cpufreq_online: ->get() failed and seven of these messages: [ 12.322987] energy_model: Accessing cpu4 policy failed ... [ 18.462780] energy_model: Accessing cpu4 policy failed Johan
On 4.09.2024 5:13 AM, Sibi Sankar wrote: > Ensure that the bad duplicates reported by the platform firmware doesn't > get added to the opp-tables. > > Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com> > --- > drivers/firmware/arm_scmi/perf.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c > index 2d77b5f40ca7..114c3dd70ede 100644 > --- a/drivers/firmware/arm_scmi/perf.c > +++ b/drivers/firmware/arm_scmi/perf.c > @@ -386,9 +386,11 @@ process_response_opp(struct device *dev, struct perf_dom_info *dom, > le16_to_cpu(r->opp[loop_idx].transition_latency_us); > > ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); > - if (ret) > + if (ret) { > dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", > opp->perf, dom->info.name, ret); > + opp->perf = 0; > + } > } > > static inline void > @@ -404,9 +406,12 @@ process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, > le16_to_cpu(r->opp[loop_idx].transition_latency_us); > > ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); > - if (ret) > + if (ret) { > dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", > opp->perf, dom->info.name, ret); > + opp->perf = 0; > + return; > + } > > /* Note that PERF v4 reports always five 32-bit words */ > opp->indicative_freq = le32_to_cpu(r->opp[loop_idx].indicative_freq); > @@ -871,6 +876,10 @@ static int scmi_dvfs_device_opps_add(const struct scmi_protocol_handle *ph, > else > freq = dom->opp[idx].indicative_freq * dom->mult_factor; > > + /* Skip all invalid frequencies reported by the firmware */ > + if (!freq) > + continue; Maybe something like this instead? (not tested) diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c index 2d77b5f40ca7..530692119c79 100644 --- a/drivers/firmware/arm_scmi/perf.c +++ b/drivers/firmware/arm_scmi/perf.c @@ -431,8 +431,14 @@ iter_perf_levels_process_response(const struct scmi_protocol_handle *ph, { struct scmi_opp *opp; struct scmi_perf_ipriv *p = priv; + unsigned int idx = st->desc_index + st->loop_idx; + + opp = &p->perf_dom->opp[idx]; + + /* Avoid duplicate entries coming from buggy firmware */ + if (idx > 0 && opp->perf && p->perf_dom->opp[idx - 1].perf) + return 0; - opp = &p->perf_dom->opp[st->desc_index + st->loop_idx]; if (PROTOCOL_REV_MAJOR(p->version) <= 0x3) process_response_opp(ph->dev, p->perf_dom, opp, st->loop_idx, response); Konrad
On 4.09.2024 3:56 PM, Konrad Dybcio wrote: > On 4.09.2024 5:13 AM, Sibi Sankar wrote: >> Ensure that the bad duplicates reported by the platform firmware doesn't >> get added to the opp-tables. >> >> Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com> >> --- >> drivers/firmware/arm_scmi/perf.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c >> index 2d77b5f40ca7..114c3dd70ede 100644 >> --- a/drivers/firmware/arm_scmi/perf.c >> +++ b/drivers/firmware/arm_scmi/perf.c >> @@ -386,9 +386,11 @@ process_response_opp(struct device *dev, struct perf_dom_info *dom, >> le16_to_cpu(r->opp[loop_idx].transition_latency_us); >> >> ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); >> - if (ret) >> + if (ret) { >> dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", >> opp->perf, dom->info.name, ret); >> + opp->perf = 0; >> + } >> } >> >> static inline void >> @@ -404,9 +406,12 @@ process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, >> le16_to_cpu(r->opp[loop_idx].transition_latency_us); >> >> ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); >> - if (ret) >> + if (ret) { >> dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", >> opp->perf, dom->info.name, ret); >> + opp->perf = 0; >> + return; >> + } >> >> /* Note that PERF v4 reports always five 32-bit words */ >> opp->indicative_freq = le32_to_cpu(r->opp[loop_idx].indicative_freq); >> @@ -871,6 +876,10 @@ static int scmi_dvfs_device_opps_add(const struct scmi_protocol_handle *ph, >> else >> freq = dom->opp[idx].indicative_freq * dom->mult_factor; >> >> + /* Skip all invalid frequencies reported by the firmware */ >> + if (!freq) >> + continue; > > Maybe something like this instead? (not tested) > > diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c > index 2d77b5f40ca7..530692119c79 100644 > --- a/drivers/firmware/arm_scmi/perf.c > +++ b/drivers/firmware/arm_scmi/perf.c > @@ -431,8 +431,14 @@ iter_perf_levels_process_response(const struct scmi_protocol_handle *ph, > { > struct scmi_opp *opp; > struct scmi_perf_ipriv *p = priv; > + unsigned int idx = st->desc_index + st->loop_idx; > + > + opp = &p->perf_dom->opp[idx]; > + > + /* Avoid duplicate entries coming from buggy firmware */ > + if (idx > 0 && opp->perf && p->perf_dom->opp[idx - 1].perf) > + return 0; > > - opp = &p->perf_dom->opp[st->desc_index + st->loop_idx]; > if (PROTOCOL_REV_MAJOR(p->version) <= 0x3) > process_response_opp(ph->dev, p->perf_dom, opp, st->loop_idx, > response); No that won't work, perf_dom->opp has all the entries and that's used in e.g. scmi_dvfs_device_opps_add :/ Konrad
On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > Ensure that the bad duplicates reported by the platform firmware doesn't > get added to the opp-tables. > Hi Sibi, so if the idea is to make the code more robust when FW sends BAD duplicates, you necessarily need to properly drop opps in opp_count too. One other option would be to just loop with xa_for_each BUT opp_count is used in a number of places...so first of all let's try drop count properly. Can you try this patch down below, instead of your patch. If it solves, I will send a patch (after testing it a bit more :D) Thanks, Cristian P.S.: thanks for spotting this, I forgot NOT to trust FW replies as usual :P --->8--- diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c index 397a39729e29..cbac29792d1e 100644 --- a/drivers/firmware/arm_scmi/perf.c +++ b/drivers/firmware/arm_scmi/perf.c @@ -340,7 +340,7 @@ static int iter_perf_levels_update_state(struct scmi_iterator_state *st, return 0; } -static inline void +static inline int process_response_opp(struct device *dev, struct scmi_perf_domain_info *dom, struct scmi_opp *opp, unsigned int loop_idx, const struct scmi_msg_resp_perf_describe_levels *r) @@ -353,12 +353,16 @@ process_response_opp(struct device *dev, struct scmi_perf_domain_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->name, ret); + return ret; + } + + return 0; } -static inline void +static inline int process_response_opp_v4(struct device *dev, struct scmi_perf_domain_info *dom, struct scmi_opp *opp, unsigned int loop_idx, const struct scmi_msg_resp_perf_describe_levels_v4 *r) @@ -371,9 +375,11 @@ process_response_opp_v4(struct device *dev, struct scmi_perf_domain_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->name, ret); + return ret; + } /* Note that PERF v4 reports always five 32-bit words */ opp->indicative_freq = le32_to_cpu(r->opp[loop_idx].indicative_freq); @@ -382,13 +388,21 @@ process_response_opp_v4(struct device *dev, struct scmi_perf_domain_info *dom, ret = xa_insert(&dom->opps_by_idx, opp->level_index, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_idx at %d for %s - ret:%d\n", opp->level_index, dom->name, ret); + /* Cleanup by_lvl too */ + xa_erase(&dom->opps_by_lvl, opp->perf); + + return ret; + } + hash_add(dom->opps_by_freq, &opp->hash, opp->indicative_freq); } + + return 0; } static int @@ -396,16 +410,22 @@ iter_perf_levels_process_response(const struct scmi_protocol_handle *ph, const void *response, struct scmi_iterator_state *st, void *priv) { + int ret; struct scmi_opp *opp; struct scmi_perf_ipriv *p = priv; opp = &p->perf_dom->opp[st->desc_index + st->loop_idx]; if (PROTOCOL_REV_MAJOR(p->version) <= 0x3) - process_response_opp(ph->dev, p->perf_dom, opp, st->loop_idx, - response); + ret = process_response_opp(ph->dev, p->perf_dom, opp, + st->loop_idx, response); else - process_response_opp_v4(ph->dev, p->perf_dom, opp, st->loop_idx, - response); + ret = process_response_opp_v4(ph->dev, p->perf_dom, opp, + st->loop_idx, response); + + /* Skip BAD duplicates received from firmware */ + if (ret) + return ret == -EBUSY ? 0 : ret; + p->perf_dom->opp_count++; dev_dbg(ph->dev, "Level %d Power %d Latency %dus Ifreq %d Index %d\n", ---8<---
On Wed, Sep 04, 2024 at 04:21:49PM +0100, Cristian Marussi wrote: > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > Ensure that the bad duplicates reported by the platform firmware doesn't > > get added to the opp-tables. > > > > Hi Sibi, > > so if the idea is to make the code more robust when FW sends BAD > duplicates, you necessarily need to properly drop opps in opp_count too. > > One other option would be to just loop with xa_for_each BUT opp_count is > used in a number of places...so first of all let's try drop count properly. > > Can you try this patch down below, instead of your patch. > If it solves, I will send a patch (after testing it a bit more :D) Hold on... I sent you a diff that does not apply probably on your tree due to some uncomitted local work of mine...my bad...let me resend. Thanks, Cristian
On Wed, Sep 04, 2024 at 04:30:24PM +0100, Cristian Marussi wrote: > On Wed, Sep 04, 2024 at 04:21:49PM +0100, Cristian Marussi wrote: > > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > > Ensure that the bad duplicates reported by the platform firmware doesn't > > > get added to the opp-tables. > > > > > > > Hi Sibi, > > > > so if the idea is to make the code more robust when FW sends BAD > > duplicates, you necessarily need to properly drop opps in opp_count too. > > > > One other option would be to just loop with xa_for_each BUT opp_count is > > used in a number of places...so first of all let's try drop count properly. > > > > Can you try this patch down below, instead of your patch. > > If it solves, I will send a patch (after testing it a bit more :D) > > Hold on... I sent you a diff that does not apply probably on your tree due > to some uncomitted local work of mine...my bad...let me resend. > This one should be good...apologies ---8<--- diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c index 4b7f1cbb9b04..bca9c6e4a3ab 100644 --- a/drivers/firmware/arm_scmi/perf.c +++ b/drivers/firmware/arm_scmi/perf.c @@ -373,7 +373,7 @@ static int iter_perf_levels_update_state(struct scmi_iterator_state *st, return 0; } -static inline void +static inline int process_response_opp(struct device *dev, struct perf_dom_info *dom, struct scmi_opp *opp, unsigned int loop_idx, const struct scmi_msg_resp_perf_describe_levels *r) @@ -386,12 +386,16 @@ process_response_opp(struct device *dev, struct perf_dom_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->info.name, ret); + return ret; + } + + return 0; } -static inline void +static inline int process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, struct scmi_opp *opp, unsigned int loop_idx, const struct scmi_msg_resp_perf_describe_levels_v4 *r) @@ -404,9 +408,11 @@ process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->info.name, ret); + return ret; + } /* Note that PERF v4 reports always five 32-bit words */ opp->indicative_freq = le32_to_cpu(r->opp[loop_idx].indicative_freq); @@ -415,13 +421,21 @@ process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, ret = xa_insert(&dom->opps_by_idx, opp->level_index, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_idx at %d for %s - ret:%d\n", opp->level_index, dom->info.name, ret); + /* Cleanup by_lvl too */ + xa_erase(&dom->opps_by_lvl, opp->perf); + + return ret; + } + hash_add(dom->opps_by_freq, &opp->hash, opp->indicative_freq); } + + return 0; } static int @@ -429,16 +443,22 @@ iter_perf_levels_process_response(const struct scmi_protocol_handle *ph, const void *response, struct scmi_iterator_state *st, void *priv) { + int ret; struct scmi_opp *opp; struct scmi_perf_ipriv *p = priv; opp = &p->perf_dom->opp[st->desc_index + st->loop_idx]; if (PROTOCOL_REV_MAJOR(p->version) <= 0x3) - process_response_opp(ph->dev, p->perf_dom, opp, st->loop_idx, - response); + ret = process_response_opp(ph->dev, p->perf_dom, opp, + st->loop_idx, response); else - process_response_opp_v4(ph->dev, p->perf_dom, opp, st->loop_idx, - response); + ret = process_response_opp_v4(ph->dev, p->perf_dom, opp, + st->loop_idx, response); + + /* Skip BAD duplicates received from firmware */ + if (ret) + return ret == -EBUSY ? 0 : ret; + p->perf_dom->opp_count++; dev_dbg(ph->dev, "Level %d Power %d Latency %dus Ifreq %d Index %d\n", --->8----
On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > Ensure that the bad duplicates reported by the platform firmware doesn't > get added to the opp-tables. > I am really interested to know if the platform firmware is presenting duplicates intentionally for some unknown reasons and we are just speculating it to be broken firmware or is it really broken firmware. For me, it is very hard to digest something like OPP tables which is there for a very long time now is not very well understood by firmware authors. How many duplicates are we seeing on this platform really ? If it is just one I can understand. More than one is hard to miss from the OPP tables in the firmware. While I am not opposing to make the driver handle these duplicates, I am just worried if they are put there intentionally for reasons we don't understand yet or not published. -- Regards, Sudeep
On Wed, Sep 04, 2024 at 05:12:29PM +0100, Sudeep Holla wrote: > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > Ensure that the bad duplicates reported by the platform firmware doesn't > > get added to the opp-tables. > > > > I am really interested to know if the platform firmware is presenting > duplicates intentionally for some unknown reasons and we are just speculating > it to be broken firmware or is it really broken firmware. > > For me, it is very hard to digest something like OPP tables which is there > for a very long time now is not very well understood by firmware authors. > How many duplicates are we seeing on this platform really ? If it is > just one I can understand. More than one is hard to miss from the OPP > tables in the firmware. > > While I am not opposing to make the driver handle these duplicates, > I am just worried if they are put there intentionally for reasons we > don't understand yet or not published. > The number of duplicates reported in logs makes me suspect the same...seems like intentional/by_design .... but at first I stick to the general issue of handling bad fw replies and how to survive kernel side at first...but I indeed share your same concerns... Thanks, Cristian
On Wed, Sep 04, 2024 at 04:46:08PM +0100, Cristian Marussi wrote: > On Wed, Sep 04, 2024 at 04:30:24PM +0100, Cristian Marussi wrote: > > On Wed, Sep 04, 2024 at 04:21:49PM +0100, Cristian Marussi wrote: > > > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > > > Ensure that the bad duplicates reported by the platform firmware doesn't > > > > get added to the opp-tables. > > > > > > > > > > Hi Sibi, > > > > > > so if the idea is to make the code more robust when FW sends BAD > > > duplicates, you necessarily need to properly drop opps in opp_count too. > > > > > > One other option would be to just loop with xa_for_each BUT opp_count is > > > used in a number of places...so first of all let's try drop count properly. > > > > > > Can you try this patch down below, instead of your patch. > > > If it solves, I will send a patch (after testing it a bit more :D) > > > > Hold on... I sent you a diff that does not apply probably on your tree due > > to some uncomitted local work of mine...my bad...let me resend. > > > > This one should be good...apologies > As a side comment about this, even though we certain can and should make the Kernel more robust to skip possible bad replies from FW (like with this or similar patches), if the bad replies comes by design since, as I have grasped from the other thread, the FW just ALSO exposes resources/OPPs that are just NOT usable by the Kernel OSPM/agent (but maybe by other agents), you should fix your FW to fully avoid the warnings...since the warnings in SCMI/perf are there exactly to catch this kind of situations... (even though we can deal better with the replies as with this patch) IOW, the SCMI server should expose to its agents only the subset of resources and characteristics that are supposed to be used by those respective agents (server can expose a per-agent view of the system)... ...because, even if innocuous and even though most of the time we can cope with such "alien" resources (with the FW simply replying with a DENY to any request not meant to be touched or the Kernel spotting such anomalies and ignoring them) all of these "alien" resources will generate some sort of uneeded background SCMI traffic (of DENY replies) Thanks, Cristian
On 9/4/24 21:50, Cristian Marussi wrote: > On Wed, Sep 04, 2024 at 05:12:29PM +0100, Sudeep Holla wrote: >> On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: >>> Ensure that the bad duplicates reported by the platform firmware doesn't >>> get added to the opp-tables. >>> >> >> I am really interested to know if the platform firmware is presenting >> duplicates intentionally for some unknown reasons and we are just speculating >> it to be broken firmware or is it really broken firmware. >> >> For me, it is very hard to digest something like OPP tables which is there >> for a very long time now is not very well understood by firmware authors. >> How many duplicates are we seeing on this platform really ? If it is >> just one I can understand. More than one is hard to miss from the OPP >> tables in the firmware. >> >> While I am not opposing to make the driver handle these duplicates, >> I am just worried if they are put there intentionally for reasons we >> don't understand yet or not published. >> > > The number of duplicates reported in logs makes me suspect the same...seems > like intentional/by_design .... but at first I stick to the general issue > of handling bad fw replies and how to survive kernel side at first...but I > indeed share your same concerns... Hey Cristian/Sudeep, The number of opps being duplicated is limited to the max sustainable frequency before we see the turbo frequency. This was pretty much the case in older non scmi perf qc cpufreq drivers. They just filter it there, but I've gotten word that this will get fixed in firmware for this SoC and any future ones planning to use scmi-perf for cpufreq. -Sibi > > Thanks, > Cristian
On 9/4/24 21:00, Cristian Marussi wrote: > On Wed, Sep 04, 2024 at 04:21:49PM +0100, Cristian Marussi wrote: >> On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: >>> Ensure that the bad duplicates reported by the platform firmware doesn't >>> get added to the opp-tables. >>> >> >> Hi Sibi, >> >> so if the idea is to make the code more robust when FW sends BAD >> duplicates, you necessarily need to properly drop opps in opp_count too. >> >> One other option would be to just loop with xa_for_each BUT opp_count is >> used in a number of places...so first of all let's try drop count properly. >> >> Can you try this patch down below, instead of your patch. >> If it solves, I will send a patch (after testing it a bit more :D) > > Hold on... I sent you a diff that does not apply probably on your tree due > to some uncomitted local work of mine...my bad...let me resend. Hey Cristian, Thanks for taking time to send out the diff. I thought this would be enough but there will still be a disconnect between opp_count and idx of the opp we populate. Consider a case were we get to have a valid opp just after duplicate opp. The opp_count will still limit us on what levels we are allowed to see. -Sibi > > Thanks, > Cristian
On Mon, Oct 07, 2024 at 12:30:14PM +0530, Sibi Sankar wrote: > > > On 9/4/24 21:00, Cristian Marussi wrote: > > On Wed, Sep 04, 2024 at 04:21:49PM +0100, Cristian Marussi wrote: > > > On Wed, Sep 04, 2024 at 08:43:24AM +0530, Sibi Sankar wrote: > > > > Ensure that the bad duplicates reported by the platform firmware doesn't > > > > get added to the opp-tables. > > > > > > > > > > Hi Sibi, > > > > > > so if the idea is to make the code more robust when FW sends BAD > > > duplicates, you necessarily need to properly drop opps in opp_count too. > > > > > > One other option would be to just loop with xa_for_each BUT opp_count is > > > used in a number of places...so first of all let's try drop count properly. > > > > > > Can you try this patch down below, instead of your patch. > > > If it solves, I will send a patch (after testing it a bit more :D) > > > > Hold on... I sent you a diff that does not apply probably on your tree due > > to some uncomitted local work of mine...my bad...let me resend. > > Hey Cristian, > Thanks for taking time to send out the diff. I thought this would be > enough but there will still be a disconnect between opp_count and idx > of the opp we populate. Consider a case were we get to have a valid > opp just after duplicate opp. The opp_count will still limit us on what > levels we are allowed to see. > Ah right...indeed... I missed that the opp_count is used also to loop on the opps arrays and OPPs are not only accessed by xa_load.... ...anyway the index in the dom->opp arrauy is NOT related to index/level indexing, so we just have to have the bad oop dupicates also in the array and NOT only in the XArray... I am sending you, as a reply to this patch, a new version of my fix with just a one-line difference tthat should solve completely the issue also in the usecase that you describe. Thanks, Cristian
diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c index 2d77b5f40ca7..114c3dd70ede 100644 --- a/drivers/firmware/arm_scmi/perf.c +++ b/drivers/firmware/arm_scmi/perf.c @@ -386,9 +386,11 @@ process_response_opp(struct device *dev, struct perf_dom_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->info.name, ret); + opp->perf = 0; + } } static inline void @@ -404,9 +406,12 @@ process_response_opp_v4(struct device *dev, struct perf_dom_info *dom, le16_to_cpu(r->opp[loop_idx].transition_latency_us); ret = xa_insert(&dom->opps_by_lvl, opp->perf, opp, GFP_KERNEL); - if (ret) + if (ret) { dev_warn(dev, "Failed to add opps_by_lvl at %d for %s - ret:%d\n", opp->perf, dom->info.name, ret); + opp->perf = 0; + return; + } /* Note that PERF v4 reports always five 32-bit words */ opp->indicative_freq = le32_to_cpu(r->opp[loop_idx].indicative_freq); @@ -871,6 +876,10 @@ static int scmi_dvfs_device_opps_add(const struct scmi_protocol_handle *ph, else freq = dom->opp[idx].indicative_freq * dom->mult_factor; + /* Skip all invalid frequencies reported by the firmware */ + if (!freq) + continue; + /* All OPPs above the sustained frequency are treated as turbo */ data.turbo = freq > dom->sustained_freq_khz * 1000;
Ensure that the bad duplicates reported by the platform firmware doesn't get added to the opp-tables. Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com> --- drivers/firmware/arm_scmi/perf.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)