diff mbox series

drm/i915/hwmon: Use 0 to designate disabled PL1 power limit

Message ID 20230328233543.1091127-1-ashutosh.dixit@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/hwmon: Use 0 to designate disabled PL1 power limit | expand

Commit Message

Dixit, Ashutosh March 28, 2023, 11:35 p.m. UTC
On ATSM the PL1 limit is disabled at power up. The previous uapi assumed
that the PL1 limit is always enabled and therefore did not have a notion of
a disabled PL1 limit. This results in erroneous PL1 limit values when the
PL1 limit is disabled. For example at power up, the disabled ATSM PL1 limit
was previously shown as 0 which means a low PL1 limit whereas the limit
being disabled actually implies a high effective PL1 limit value.

To get round this problem, the PL1 limit uapi is expanded to include a
special value 0 to designate a disabled PL1 limit.

Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8062
Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8060
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
---
 .../ABI/testing/sysfs-driver-intel-i915-hwmon |  3 ++-
 drivers/gpu/drm/i915/i915_hwmon.c             | 24 +++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

Comments

Dixit, Ashutosh March 30, 2023, 5:50 a.m. UTC | #1
On Tue, 28 Mar 2023 16:35:43 -0700, Ashutosh Dixit wrote:
>
> On ATSM the PL1 limit is disabled at power up. The previous uapi assumed
> that the PL1 limit is always enabled and therefore did not have a notion of
> a disabled PL1 limit. This results in erroneous PL1 limit values when the
> PL1 limit is disabled. For example at power up, the disabled ATSM PL1 limit
> was previously shown as 0 which means a low PL1 limit whereas the limit
> being disabled actually implies a high effective PL1 limit value.
>
> To get round this problem, the PL1 limit uapi is expanded to include a
> special value 0 to designate a disabled PL1 limit.

This patch is another attempt to show when the PL1 power limit is disabled
and to disable it when it needs to. Previous abandoned attempts to do this
are [1] and [2].

The preferred way to do this was [2] but that was NAK'd by hwmon folks (see
[2]). That is why here we fall back on the approach in [1].

This patch is identical to [1] except that the value used to disable the
PL1 limit has been changed to 0 (from -1 in [1]) as was suggested in [2]
(both -1 and 0 seem ok for the purpose).

> Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8062
> Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8060

The link between this patch and these pretty serious bugs might not be
immediately clear so here's an explanation:

* Because on ATSM the PL1 power limit is disabled on power up and there
  were no means to enable it, in 6fd3d8bf89fc we implemented the means to
  enable the limit when the PL1 hwmon entry (power1_max) was written to.

* Now there is an IGT igt@i915_hwmon@hwmon_write which (a) reads orig value
  from all hwmon sysfs  (b) does a bunch of random writes and finally (c)
  restores the orig value read. On ATSM since the orig value was 0, when
  the IGT restores the 0 value, the PL1 limit is now enabled with a value
  of 0.

* PL1 limit of 0 implies a low PL1 limit which causes GPU freq to fall to
  100 MHz. This causes GuC FW load and several IGT's to start timing out
  and gives rise the above (and even more) bugs about GuC FW load timing
  out.

* After this patch, writing 0 would disable the PL1 limit instead of
  enabling it, avoiding the freq drop issue above, and resolving this Intel
  CI issue.

Thanks.
--
Ashutosh

[1] https://patchwork.freedesktop.org/patch/522612/?series=113972&rev=1
[2] https://patchwork.freedesktop.org/patch/522652/?series=113984&rev=1
Rodrigo Vivi March 30, 2023, 3:44 p.m. UTC | #2
On Wed, Mar 29, 2023 at 10:50:09PM -0700, Dixit, Ashutosh wrote:
> On Tue, 28 Mar 2023 16:35:43 -0700, Ashutosh Dixit wrote:
> >
> > On ATSM the PL1 limit is disabled at power up. The previous uapi assumed
> > that the PL1 limit is always enabled and therefore did not have a notion of
> > a disabled PL1 limit. This results in erroneous PL1 limit values when the
> > PL1 limit is disabled. For example at power up, the disabled ATSM PL1 limit
> > was previously shown as 0 which means a low PL1 limit whereas the limit
> > being disabled actually implies a high effective PL1 limit value.
> >
> > To get round this problem, the PL1 limit uapi is expanded to include a
> > special value 0 to designate a disabled PL1 limit.
> 
> This patch is another attempt to show when the PL1 power limit is disabled
> and to disable it when it needs to. Previous abandoned attempts to do this
> are [1] and [2].
> 
> The preferred way to do this was [2] but that was NAK'd by hwmon folks (see
> [2]). That is why here we fall back on the approach in [1].

I still don't get it, but let's move on...

> 
> This patch is identical to [1] except that the value used to disable the
> PL1 limit has been changed to 0 (from -1 in [1]) as was suggested in [2]
> (both -1 and 0 seem ok for the purpose).
> 
> > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8062
> > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8060
> 
> The link between this patch and these pretty serious bugs might not be
> immediately clear so here's an explanation:
> 
> * Because on ATSM the PL1 power limit is disabled on power up and there
>   were no means to enable it, in 6fd3d8bf89fc we implemented the means to
>   enable the limit when the PL1 hwmon entry (power1_max) was written to.
> 
> * Now there is an IGT igt@i915_hwmon@hwmon_write which (a) reads orig value
>   from all hwmon sysfs  (b) does a bunch of random writes and finally (c)
>   restores the orig value read. On ATSM since the orig value was 0, when
>   the IGT restores the 0 value, the PL1 limit is now enabled with a value
>   of 0.
> 
> * PL1 limit of 0 implies a low PL1 limit which causes GPU freq to fall to
>   100 MHz. This causes GuC FW load and several IGT's to start timing out
>   and gives rise the above (and even more) bugs about GuC FW load timing
>   out.

I believe these 3 bullets are key information that deserves to be in
the commit message itself.

With that there,

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


> 
> * After this patch, writing 0 would disable the PL1 limit instead of
>   enabling it, avoiding the freq drop issue above, and resolving this Intel
>   CI issue.
> 
> Thanks.
> --
> Ashutosh
> 
> [1] https://patchwork.freedesktop.org/patch/522612/?series=113972&rev=1
> [2] https://patchwork.freedesktop.org/patch/522652/?series=113984&rev=1
Dixit, Ashutosh March 31, 2023, 2:17 a.m. UTC | #3
On Thu, 30 Mar 2023 08:44:34 -0700, Rodrigo Vivi wrote:
>
> On Wed, Mar 29, 2023 at 10:50:09PM -0700, Dixit, Ashutosh wrote:
> > On Tue, 28 Mar 2023 16:35:43 -0700, Ashutosh Dixit wrote:
> > >
> > > On ATSM the PL1 limit is disabled at power up. The previous uapi assumed
> > > that the PL1 limit is always enabled and therefore did not have a notion of
> > > a disabled PL1 limit. This results in erroneous PL1 limit values when the
> > > PL1 limit is disabled. For example at power up, the disabled ATSM PL1 limit
> > > was previously shown as 0 which means a low PL1 limit whereas the limit
> > > being disabled actually implies a high effective PL1 limit value.
> > >
> > > To get round this problem, the PL1 limit uapi is expanded to include a
> > > special value 0 to designate a disabled PL1 limit.
> >
> > This patch is another attempt to show when the PL1 power limit is disabled
> > and to disable it when it needs to. Previous abandoned attempts to do this
> > are [1] and [2].
> >
> > The preferred way to do this was [2] but that was NAK'd by hwmon folks (see
> > [2]). That is why here we fall back on the approach in [1].
>
> I still don't get it, but let's move on...
>
> >
> > This patch is identical to [1] except that the value used to disable the
> > PL1 limit has been changed to 0 (from -1 in [1]) as was suggested in [2]
> > (both -1 and 0 seem ok for the purpose).
> >
> > > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8062
> > > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/8060
> >
> > The link between this patch and these pretty serious bugs might not be
> > immediately clear so here's an explanation:
> >
> > * Because on ATSM the PL1 power limit is disabled on power up and there
> >   were no means to enable it, in 6fd3d8bf89fc we implemented the means to
> >   enable the limit when the PL1 hwmon entry (power1_max) was written to.
> >
> > * Now there is an IGT igt@i915_hwmon@hwmon_write which (a) reads orig value
> >   from all hwmon sysfs  (b) does a bunch of random writes and finally (c)
> >   restores the orig value read. On ATSM since the orig value was 0, when
> >   the IGT restores the 0 value, the PL1 limit is now enabled with a value
> >   of 0.
> >
> > * PL1 limit of 0 implies a low PL1 limit which causes GPU freq to fall to
> >   100 MHz. This causes GuC FW load and several IGT's to start timing out
> >   and gives rise the above (and even more) bugs about GuC FW load timing
> >   out.
>
> I believe these 3 bullets are key information that deserves to be in
> the commit message itself.

Done in v2.

>
> With that there,
>
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Thanks.
--
Ashutosh


>
>
> >
> > * After this patch, writing 0 would disable the PL1 limit instead of
> >   enabling it, avoiding the freq drop issue above, and resolving this Intel
> >   CI issue.
> >
> > Thanks.
> > --
> > Ashutosh
> >
> > [1] https://patchwork.freedesktop.org/patch/522612/?series=113972&rev=1
> > [2] https://patchwork.freedesktop.org/patch/522652/?series=113984&rev=1
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
index 2d6a472eef885..96fec0bb74c2c 100644
--- a/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
+++ b/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
@@ -14,7 +14,8 @@  Description:	RW. Card reactive sustained  (PL1/Tau) power limit in microwatts.
 
 		The power controller will throttle the operating frequency
 		if the power averaged over a window (typically seconds)
-		exceeds this limit.
+		exceeds this limit. A read value of 0 means that the PL1 power
+		limit is disabled. Writing 0 disables the limit if possible.
 
 		Only supported for particular Intel i915 graphics platforms.
 
diff --git a/drivers/gpu/drm/i915/i915_hwmon.c b/drivers/gpu/drm/i915/i915_hwmon.c
index 596dd2c070106..c099057888914 100644
--- a/drivers/gpu/drm/i915/i915_hwmon.c
+++ b/drivers/gpu/drm/i915/i915_hwmon.c
@@ -349,6 +349,8 @@  hwm_power_is_visible(const struct hwm_drvdata *ddat, u32 attr, int chan)
 	}
 }
 
+#define PL1_DISABLE 0
+
 /*
  * HW allows arbitrary PL1 limits to be set but silently clamps these values to
  * "typical but not guaranteed" min/max values in rg.pkg_power_sku. Follow the
@@ -362,6 +364,14 @@  hwm_power_max_read(struct hwm_drvdata *ddat, long *val)
 	intel_wakeref_t wakeref;
 	u64 r, min, max;
 
+	/* Check if PL1 limit is disabled */
+	with_intel_runtime_pm(ddat->uncore->rpm, wakeref)
+		r = intel_uncore_read(ddat->uncore, hwmon->rg.pkg_rapl_limit);
+	if (!(r & PKG_PWR_LIM_1_EN)) {
+		*val = PL1_DISABLE;
+		return 0;
+	}
+
 	*val = hwm_field_read_and_scale(ddat,
 					hwmon->rg.pkg_rapl_limit,
 					PKG_PWR_LIM_1,
@@ -385,8 +395,22 @@  static int
 hwm_power_max_write(struct hwm_drvdata *ddat, long val)
 {
 	struct i915_hwmon *hwmon = ddat->hwmon;
+	intel_wakeref_t wakeref;
 	u32 nval;
 
+	if (val == PL1_DISABLE) {
+		/* Disable PL1 limit */
+		hwm_locked_with_pm_intel_uncore_rmw(ddat, hwmon->rg.pkg_rapl_limit,
+						    PKG_PWR_LIM_1_EN, 0);
+
+		/* Verify, because PL1 limit cannot be disabled on all platforms */
+		with_intel_runtime_pm(ddat->uncore->rpm, wakeref)
+			nval = intel_uncore_read(ddat->uncore, hwmon->rg.pkg_rapl_limit);
+		if (nval & PKG_PWR_LIM_1_EN)
+			return -EPERM;
+		return 0;
+	}
+
 	/* Computation in 64-bits to avoid overflow. Round to nearest. */
 	nval = DIV_ROUND_CLOSEST_ULL((u64)val << hwmon->scl_shift_power, SF_POWER);
 	nval = PKG_PWR_LIM_1_EN | REG_FIELD_PREP(PKG_PWR_LIM_1, nval);