Message ID | 20240523-fix-smn-bad-read-v3-3-aa44c622de39@amd.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | Enhance AMD SMN Error Checking | expand |
> Check the return value of amd_smn_read() before saving a value. > This ensures invalid values aren't saved or used. … Does such information indicate a need for the tag “Fixes” once more? Regards, Markus
On Thu, May 23, 2024 at 01:26:54PM -0500, Yazen Ghannam wrote:
> Cc: stable@vger.kernel.org
So yeah, I'll drop the CC:stable tagging in all patches unless we're
talking about a concrete issue. You need to think about the downstream,
distro folks who need to go through gazillion of patches and wonder
whether they really need to backport them.
And I don't think misusing the stable process like that is the right
way.
Thx.
On 6/5/24 8:20 AM, Borislav Petkov wrote: > On Thu, May 23, 2024 at 01:26:54PM -0500, Yazen Ghannam wrote: >> Cc: stable@vger.kernel.org > > So yeah, I'll drop the CC:stable tagging in all patches unless we're > talking about a concrete issue. You need to think about the downstream, > distro folks who need to go through gazillion of patches and wonder > whether they really need to backport them. > > And I don't think misusing the stable process like that is the right > way. > I agree that patches 1-3 are not stable-worthy on their own. But I think patch 4 is, and it requires 1-3 to avoid build errors. Is there a preferred way to highlight this while patches are in review? Thanks, Yazen
On Wed, Jun 05, 2024 at 09:41:51AM -0400, Yazen Ghannam wrote: > I agree that patches 1-3 are not stable-worthy on their own. But I think > patch 4 is, and it requires 1-3 to avoid build errors. Which of the rules in the first section of Documentation/process/stable-kernel-rules.rst apply for patch 4? Because I don't see it.
On 6/5/24 12:12 PM, Borislav Petkov wrote: > On Wed, Jun 05, 2024 at 09:41:51AM -0400, Yazen Ghannam wrote: >> I agree that patches 1-3 are not stable-worthy on their own. But I think >> patch 4 is, and it requires 1-3 to avoid build errors. > > Which of the rules in the first section of > Documentation/process/stable-kernel-rules.rst apply for patch 4? > > Because I don't see it. > "It fixes a problem like ... a hardware quirk ..." This is described in patch 4: --- Most systems will return 0 for SMN addresses that are not accessible. This is in line with AMD convention that unavailable registers are Read-as-Zero/Writes-Ignored. However, some systems will return a "PCI Error Response" instead. This value, along with an error code of 0 from the PCI config access, will confuse callers of the amd_smn_read() function. --- But I think it's fine to drop the stable tag after reading through the rules again. I'll do option 2 or 3 if there's interest for specific branches. And the cherry-pick thing should be easy to do if all the prerequisites are already upstream. Thanks, Yazen
On Wed, Jun 05, 2024 at 12:30:35PM -0400, Yazen Ghannam wrote: > "It fixes a problem like ... a hardware quirk ..." I'm pretty sure that means a patch which sets a magic bit in some MSR or does something else to make the hardware work again. Errata fix and some other hackery we get to do from time to time. Or my favourite - fix a BIOS f*ckup. > Most systems will return 0 for SMN addresses that are not accessible. > This is in line with AMD convention that unavailable registers are > Read-as-Zero/Writes-Ignored. > > However, some systems will return a "PCI Error Response" instead. This > value, along with an error code of 0 from the PCI config access, will > confuse callers of the amd_smn_read() function. Yes, but it hasn't so far. It is all pretty-much, a hypothetical, "what if" thing. Sure, if that error would cause a serious issue on some system, by any means. But just because it might potentially happen... Meh. > But I think it's fine to drop the stable tag after reading through the > rules again. I'll do option 2 or 3 if there's interest for specific > branches. And the cherry-pick thing should be easy to do if all the > prerequisites are already upstream. Just wait until some real issue happens. Otherwise, you'll be pretty much wasting time and energy. And, btw, people should upgrade their kernels on a regular basis - not run old, Frankenstein backported crap and think they've got the best of both worlds. Thx.
On 6/5/24 12:45 PM, Borislav Petkov wrote: > On Wed, Jun 05, 2024 at 12:30:35PM -0400, Yazen Ghannam wrote: >> "It fixes a problem like ... a hardware quirk ..." > > I'm pretty sure that means a patch which sets a magic bit in some MSR or > does something else to make the hardware work again. Errata fix and some > other hackery we get to do from time to time. Or my favourite - fix > a BIOS f*ckup. > Yeah, makes sense. I agree. >> Most systems will return 0 for SMN addresses that are not accessible. >> This is in line with AMD convention that unavailable registers are >> Read-as-Zero/Writes-Ignored. >> >> However, some systems will return a "PCI Error Response" instead. This >> value, along with an error code of 0 from the PCI config access, will >> confuse callers of the amd_smn_read() function. > > Yes, but it hasn't so far. It is all pretty-much, a hypothetical, "what > if" thing. > > Sure, if that error would cause a serious issue on some system, by any > means. But just because it might potentially happen... Meh. > >> But I think it's fine to drop the stable tag after reading through the >> rules again. I'll do option 2 or 3 if there's interest for specific >> branches. And the cherry-pick thing should be easy to do if all the >> prerequisites are already upstream. > > Just wait until some real issue happens. Otherwise, you'll be pretty > much wasting time and energy. > > And, btw, people should upgrade their kernels on a regular basis - not > run old, Frankenstein backported crap and think they've got the best of > both worlds. > Okay, no problem. Thanks, Yazen
diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index 8092312c0a87..6cad35e7f182 100644 --- a/drivers/hwmon/k10temp.c +++ b/drivers/hwmon/k10temp.c @@ -153,8 +153,9 @@ static void read_tempreg_nb_f15(struct pci_dev *pdev, u32 *regval) static void read_tempreg_nb_zen(struct pci_dev *pdev, u32 *regval) { - amd_smn_read(amd_pci_dev_to_node_id(pdev), - ZEN_REPORTED_TEMP_CTRL_BASE, regval); + if (amd_smn_read(amd_pci_dev_to_node_id(pdev), + ZEN_REPORTED_TEMP_CTRL_BASE, regval)) + *regval = 0; } static long get_raw_temp(struct k10temp_data *data) @@ -205,6 +206,7 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel, long *val) { struct k10temp_data *data = dev_get_drvdata(dev); + int ret = -EOPNOTSUPP; u32 regval; switch (attr) { @@ -221,13 +223,17 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel, *val = 0; break; case 2 ... 13: /* Tccd{1-12} */ - amd_smn_read(amd_pci_dev_to_node_id(data->pdev), - ZEN_CCD_TEMP(data->ccd_offset, channel - 2), - ®val); + ret = amd_smn_read(amd_pci_dev_to_node_id(data->pdev), + ZEN_CCD_TEMP(data->ccd_offset, channel - 2), + ®val); + + if (ret) + return ret; + *val = (regval & ZEN_CCD_TEMP_MASK) * 125 - 49000; break; default: - return -EOPNOTSUPP; + return ret; } break; case hwmon_temp_max: @@ -243,7 +249,7 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel, - ((regval >> 24) & 0xf)) * 500 + 52000; break; default: - return -EOPNOTSUPP; + return ret; } return 0; } @@ -381,8 +387,20 @@ static void k10temp_get_ccd_support(struct pci_dev *pdev, int i; for (i = 0; i < limit; i++) { - amd_smn_read(amd_pci_dev_to_node_id(pdev), - ZEN_CCD_TEMP(data->ccd_offset, i), ®val); + /* + * Ignore inaccessible CCDs. + * + * Some systems will return a register value of 0, and the TEMP_VALID + * bit check below will naturally fail. + * + * Other systems will return a PCI_ERROR_RESPONSE (0xFFFFFFFF) for + * the register value. And this will incorrectly pass the TEMP_VALID + * bit check. + */ + if (amd_smn_read(amd_pci_dev_to_node_id(pdev), + ZEN_CCD_TEMP(data->ccd_offset, i), ®val)) + continue; + if (regval & ZEN_CCD_TEMP_VALID) data->show_temp |= BIT(TCCD_BIT(i)); }