Message ID | 20240515214258.59209-9-lucas.demarchi@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/xe: Per client usage | expand |
On 15/05/2024 22:42, Lucas De Marchi wrote: > Print the accumulated runtime for client when printing fdinfo. > Each time a query is done it first does 2 things: > > 1) loop through all the exec queues for the current client and > accumulate the runtime, per engine class. CTX_TIMESTAMP is used for > that, being read from the context image. > > 2) Read a "GPU timestamp" that can be used for considering "how much GPU > time has passed" and that has the same unit/refclock as the one > recording the runtime. RING_TIMESTAMP is used for that via MMIO. > > Since for all current platforms RING_TIMESTAMP follows the same > refclock, just read it once, using any first engine available. > > This is exported to userspace as 2 numbers in fdinfo: > > drm-cycles-<class>: <RUNTIME> > drm-total-cycles-<class>: <TIMESTAMP> > > Userspace is expected to collect at least 2 samples, which allows to > know the client engine busyness as per: > > RUNTIME1 - RUNTIME0 > busyness = --------------------- > T1 - T0 > > Since drm-cycles-<class> always starts at 0, it's also possible to know > if and engine was ever used by a client. > > It's expected that userspace will read any 2 samples every few seconds. > Given the update frequency of the counters involved and that > CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap > around (assuming 100% utilization) after ~200s. The wraparound is not > perceived by userspace since it's just accumulated for all the > exec_queues in a 64-bit counter) but the measurement will not be > accurate if the samples are too far apart. > > This could be mitigated by adding a workqueue to accumulate the counters > every so often, but it's additional complexity for something that is > done already by userspace every few seconds in tools like gputop (from > igt), htop, nvtop, etc, with none of them really defaulting to 1 sample > per minute or more. > > Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> > --- > Documentation/gpu/drm-usage-stats.rst | 21 +++- > Documentation/gpu/xe/index.rst | 1 + > Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ > drivers/gpu/drm/xe/xe_drm_client.c | 121 +++++++++++++++++++- > 4 files changed, 150 insertions(+), 3 deletions(-) > create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst > > diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst > index 6dc299343b48..a80f95ca1b2f 100644 > --- a/Documentation/gpu/drm-usage-stats.rst > +++ b/Documentation/gpu/drm-usage-stats.rst > @@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a value lower than what > was previously read, userspace is expected to stay with that larger previous > value until a monotonic update is seen. > > +- drm-total-cycles-<keystr>: <uint> > + > +Engine identifier string must be the same as the one specified in the > +drm-cycles-<keystr> tag and shall contain the total number cycles for the given > +engine. > + > +This is a timestamp in GPU unspecified unit that matches the update rate > +of drm-cycles-<keystr>. For drivers that implement this interface, the engine > +utilization can be calculated entirely on the GPU clock domain, without > +considering the CPU sleep time between 2 samples. > + > +A driver may implement either this key or drm-maxfreq-<keystr>, but not both. > + > - drm-maxfreq-<keystr>: <uint> [Hz|MHz|KHz] > > Engine identifier string must be the same as the one specified in the > @@ -121,6 +134,9 @@ percentage utilization of the engine, whereas drm-engine-<keystr> only reflects > time active without considering what frequency the engine is operating as a > percentage of its maximum frequency. > > +A driver may implement either this key or drm-total-cycles-<keystr>, but not > +both. > + For the spec part: Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Some minor comments and questions below. > Memory > ^^^^^^ > > @@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers. > Driver specific implementations > ------------------------------- > > -:ref:`i915-usage-stats` > -:ref:`panfrost-usage-stats` > +* :ref:`i915-usage-stats` > +* :ref:`panfrost-usage-stats` > +* :ref:`xe-usage-stats` > diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst > index c224ecaee81e..3f07aa3b5432 100644 > --- a/Documentation/gpu/xe/index.rst > +++ b/Documentation/gpu/xe/index.rst > @@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. > xe_firmware > xe_tile > xe_debugging > + xe-drm-usage-stats.rst > diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst > new file mode 100644 > index 000000000000..482d503ae68a > --- /dev/null > +++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst > @@ -0,0 +1,10 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > +.. _xe-usage-stats: > + > +======================================== > +Xe DRM client usage stats implementation > +======================================== > + > +.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c > + :doc: DRM Client usage stats > diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c > index 08f0b7c95901..952b0cc87708 100644 > --- a/drivers/gpu/drm/xe/xe_drm_client.c > +++ b/drivers/gpu/drm/xe/xe_drm_client.c > @@ -2,6 +2,7 @@ > /* > * Copyright © 2023 Intel Corporation > */ > +#include "xe_drm_client.h" > > #include <drm/drm_print.h> > #include <drm/xe_drm.h> > @@ -12,9 +13,66 @@ > #include "xe_bo.h" > #include "xe_bo_types.h" > #include "xe_device_types.h" > -#include "xe_drm_client.h" > +#include "xe_exec_queue.h" > +#include "xe_force_wake.h" > +#include "xe_gt.h" > +#include "xe_hw_engine.h" > +#include "xe_pm.h" > #include "xe_trace.h" > > +/** > + * DOC: DRM Client usage stats > + * > + * The drm/xe driver implements the DRM client usage stats specification as > + * documented in :ref:`drm-client-usage-stats`. > + * > + * Example of the output showing the implemented key value pairs and entirety of > + * the currently possible format options: > + * > + * :: > + * > + * pos: 0 > + * flags: 0100002 > + * mnt_id: 26 > + * ino: 685 > + * drm-driver: xe > + * drm-client-id: 3 > + * drm-pdev: 0000:03:00.0 > + * drm-total-system: 0 > + * drm-shared-system: 0 > + * drm-active-system: 0 > + * drm-resident-system: 0 > + * drm-purgeable-system: 0 > + * drm-total-gtt: 192 KiB > + * drm-shared-gtt: 0 > + * drm-active-gtt: 0 > + * drm-resident-gtt: 192 KiB > + * drm-total-vram0: 23992 KiB > + * drm-shared-vram0: 16 MiB > + * drm-active-vram0: 0 > + * drm-resident-vram0: 23992 KiB > + * drm-total-stolen: 0 > + * drm-shared-stolen: 0 > + * drm-active-stolen: 0 > + * drm-resident-stolen: 0 > + * drm-cycles-rcs: 28257900 > + * drm-total-cycles-rcs: 7655183225 > + * drm-cycles-bcs: 0 > + * drm-total-cycles-bcs: 7655183225 > + * drm-cycles-vcs: 0 > + * drm-total-cycles-vcs: 7655183225 > + * drm-engine-capacity-vcs: 2 > + * drm-cycles-vecs: 0 > + * drm-total-cycles-vecs: 7655183225 > + * drm-engine-capacity-vecs: 2 > + * drm-cycles-ccs: 0 > + * drm-total-cycles-ccs: 7655183225 > + * drm-engine-capacity-ccs: 4 > + * > + * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and > + * "other". > + */ > + > /** > * xe_drm_client_alloc() - Allocate drm client > * @void: No arg > @@ -179,6 +237,66 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) > } > } > > +static void show_runtime(struct drm_printer *p, struct drm_file *file) > +{ > + unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { }; Why long? But it doesn't matter really, at least on the architectures I know. > + struct xe_file *xef = file->driver_priv; > + struct xe_device *xe = xef->xe; > + struct xe_gt *gt; > + struct xe_hw_engine *hwe; > + struct xe_exec_queue *q; > + u64 gpu_timestamp; > + > + xe_pm_runtime_get(xe); > + > + /* Accumulate all the exec queues from this client */ > + mutex_lock(&xef->exec_queue.lock); > + xa_for_each(&xef->exec_queue.xa, i, q) > + xe_exec_queue_update_runtime(q); > + mutex_unlock(&xef->exec_queue.lock); > + > + /* Get the total GPU cycles */ > + for_each_gt(gt, xe, gt_id) { > + hwe = xe_gt_any_hw_engine(gt); > + if (!hwe) > + continue; > + > + xe_force_wake_get(gt_to_fw(gt), XE_FW_GT); > + gpu_timestamp = xe_hw_engine_read_timestamp(hwe); > + xe_force_wake_put(gt_to_fw(gt), XE_FW_GT); > + break; > + } Runtime pm get/put could be just around this block, right? But overall now it reads super clean, thanks for accomodating my ideas! Regards, Tvrtko > + > + if (unlikely(!hwe)) > + return; > + > + for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) { > + const char *class_name; > + > + for_each_gt(gt, xe, gt_id) > + capacity[class] += gt->user_engines.instances_per_class[class]; > + > + /* > + * Engines may be fused off or not exposed to userspace. Don't > + * return anything if this entire class is not available > + */ > + if (!capacity[class]) > + continue; > + > + class_name = xe_hw_engine_class_to_str(class); > + drm_printf(p, "drm-cycles-%s:\t%llu\n", > + class_name, xef->runtime[class]); > + drm_printf(p, "drm-total-cycles-%s:\t%llu\n", > + class_name, gpu_timestamp); > + > + if (capacity[class] > 1) > + drm_printf(p, "drm-engine-capacity-%s:\t%lu\n", > + class_name, capacity[class]); > + } > + > + xe_pm_runtime_put(xe); > +} > + > /** > * xe_drm_client_fdinfo() - Callback for fdinfo interface > * @p: The drm_printer ptr > @@ -192,5 +310,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) > void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file) > { > show_meminfo(p, file); > + show_runtime(p, file); > } > #endif
On Thu, May 16, 2024 at 08:57:48AM GMT, Tvrtko Ursulin wrote: > >On 15/05/2024 22:42, Lucas De Marchi wrote: >>Print the accumulated runtime for client when printing fdinfo. >>Each time a query is done it first does 2 things: >> >>1) loop through all the exec queues for the current client and >> accumulate the runtime, per engine class. CTX_TIMESTAMP is used for >> that, being read from the context image. >> >>2) Read a "GPU timestamp" that can be used for considering "how much GPU >> time has passed" and that has the same unit/refclock as the one >> recording the runtime. RING_TIMESTAMP is used for that via MMIO. >> >>Since for all current platforms RING_TIMESTAMP follows the same >>refclock, just read it once, using any first engine available. >> >>This is exported to userspace as 2 numbers in fdinfo: >> >> drm-cycles-<class>: <RUNTIME> >> drm-total-cycles-<class>: <TIMESTAMP> >> >>Userspace is expected to collect at least 2 samples, which allows to >>know the client engine busyness as per: >> >> RUNTIME1 - RUNTIME0 >> busyness = --------------------- >> T1 - T0 >> >>Since drm-cycles-<class> always starts at 0, it's also possible to know >>if and engine was ever used by a client. >> >>It's expected that userspace will read any 2 samples every few seconds. >>Given the update frequency of the counters involved and that >>CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap >>around (assuming 100% utilization) after ~200s. The wraparound is not >>perceived by userspace since it's just accumulated for all the >>exec_queues in a 64-bit counter) but the measurement will not be >>accurate if the samples are too far apart. >> >>This could be mitigated by adding a workqueue to accumulate the counters >>every so often, but it's additional complexity for something that is >>done already by userspace every few seconds in tools like gputop (from >>igt), htop, nvtop, etc, with none of them really defaulting to 1 sample >>per minute or more. >> >>Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> >>--- >> Documentation/gpu/drm-usage-stats.rst | 21 +++- >> Documentation/gpu/xe/index.rst | 1 + >> Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ >> drivers/gpu/drm/xe/xe_drm_client.c | 121 +++++++++++++++++++- >> 4 files changed, 150 insertions(+), 3 deletions(-) >> create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst >> >>diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst >>index 6dc299343b48..a80f95ca1b2f 100644 >>--- a/Documentation/gpu/drm-usage-stats.rst >>+++ b/Documentation/gpu/drm-usage-stats.rst >>@@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a value lower than what >> was previously read, userspace is expected to stay with that larger previous >> value until a monotonic update is seen. >>+- drm-total-cycles-<keystr>: <uint> >>+ >>+Engine identifier string must be the same as the one specified in the >>+drm-cycles-<keystr> tag and shall contain the total number cycles for the given >>+engine. >>+ >>+This is a timestamp in GPU unspecified unit that matches the update rate >>+of drm-cycles-<keystr>. For drivers that implement this interface, the engine >>+utilization can be calculated entirely on the GPU clock domain, without >>+considering the CPU sleep time between 2 samples. >>+ >>+A driver may implement either this key or drm-maxfreq-<keystr>, but not both. >>+ >> - drm-maxfreq-<keystr>: <uint> [Hz|MHz|KHz] >> Engine identifier string must be the same as the one specified in the >>@@ -121,6 +134,9 @@ percentage utilization of the engine, whereas drm-engine-<keystr> only reflects >> time active without considering what frequency the engine is operating as a >> percentage of its maximum frequency. >>+A driver may implement either this key or drm-total-cycles-<keystr>, but not >>+both. >>+ > >For the spec part: > >Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> thanks > >Some minor comments and questions below. > >> Memory >> ^^^^^^ >>@@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers. >> Driver specific implementations >> ------------------------------- >>-:ref:`i915-usage-stats` >>-:ref:`panfrost-usage-stats` >>+* :ref:`i915-usage-stats` >>+* :ref:`panfrost-usage-stats` >>+* :ref:`xe-usage-stats` >>diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst >>index c224ecaee81e..3f07aa3b5432 100644 >>--- a/Documentation/gpu/xe/index.rst >>+++ b/Documentation/gpu/xe/index.rst >>@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. >> xe_firmware >> xe_tile >> xe_debugging >>+ xe-drm-usage-stats.rst >>diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst >>new file mode 100644 >>index 000000000000..482d503ae68a >>--- /dev/null >>+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst >>@@ -0,0 +1,10 @@ >>+.. SPDX-License-Identifier: GPL-2.0+ >>+ >>+.. _xe-usage-stats: >>+ >>+======================================== >>+Xe DRM client usage stats implementation >>+======================================== >>+ >>+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c >>+ :doc: DRM Client usage stats >>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c >>index 08f0b7c95901..952b0cc87708 100644 >>--- a/drivers/gpu/drm/xe/xe_drm_client.c >>+++ b/drivers/gpu/drm/xe/xe_drm_client.c >>@@ -2,6 +2,7 @@ >> /* >> * Copyright © 2023 Intel Corporation >> */ >>+#include "xe_drm_client.h" >> #include <drm/drm_print.h> >> #include <drm/xe_drm.h> >>@@ -12,9 +13,66 @@ >> #include "xe_bo.h" >> #include "xe_bo_types.h" >> #include "xe_device_types.h" >>-#include "xe_drm_client.h" >>+#include "xe_exec_queue.h" >>+#include "xe_force_wake.h" >>+#include "xe_gt.h" >>+#include "xe_hw_engine.h" >>+#include "xe_pm.h" >> #include "xe_trace.h" >>+/** >>+ * DOC: DRM Client usage stats >>+ * >>+ * The drm/xe driver implements the DRM client usage stats specification as >>+ * documented in :ref:`drm-client-usage-stats`. >>+ * >>+ * Example of the output showing the implemented key value pairs and entirety of >>+ * the currently possible format options: >>+ * >>+ * :: >>+ * >>+ * pos: 0 >>+ * flags: 0100002 >>+ * mnt_id: 26 >>+ * ino: 685 >>+ * drm-driver: xe >>+ * drm-client-id: 3 >>+ * drm-pdev: 0000:03:00.0 >>+ * drm-total-system: 0 >>+ * drm-shared-system: 0 >>+ * drm-active-system: 0 >>+ * drm-resident-system: 0 >>+ * drm-purgeable-system: 0 >>+ * drm-total-gtt: 192 KiB >>+ * drm-shared-gtt: 0 >>+ * drm-active-gtt: 0 >>+ * drm-resident-gtt: 192 KiB >>+ * drm-total-vram0: 23992 KiB >>+ * drm-shared-vram0: 16 MiB >>+ * drm-active-vram0: 0 >>+ * drm-resident-vram0: 23992 KiB >>+ * drm-total-stolen: 0 >>+ * drm-shared-stolen: 0 >>+ * drm-active-stolen: 0 >>+ * drm-resident-stolen: 0 >>+ * drm-cycles-rcs: 28257900 >>+ * drm-total-cycles-rcs: 7655183225 >>+ * drm-cycles-bcs: 0 >>+ * drm-total-cycles-bcs: 7655183225 >>+ * drm-cycles-vcs: 0 >>+ * drm-total-cycles-vcs: 7655183225 >>+ * drm-engine-capacity-vcs: 2 >>+ * drm-cycles-vecs: 0 >>+ * drm-total-cycles-vecs: 7655183225 >>+ * drm-engine-capacity-vecs: 2 >>+ * drm-cycles-ccs: 0 >>+ * drm-total-cycles-ccs: 7655183225 >>+ * drm-engine-capacity-ccs: 4 >>+ * >>+ * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and >>+ * "other". >>+ */ >>+ >> /** >> * xe_drm_client_alloc() - Allocate drm client >> * @void: No arg >>@@ -179,6 +237,66 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) >> } >> } >>+static void show_runtime(struct drm_printer *p, struct drm_file *file) >>+{ >>+ unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { }; > >Why long? But it doesn't matter really, at least on the architectures >I know. `i` should be long for the xa_for_each(). The other ones it's just because it didn't really matter. Are you suggesting to use unsigned int for capacity and class? > >>+ struct xe_file *xef = file->driver_priv; >>+ struct xe_device *xe = xef->xe; >>+ struct xe_gt *gt; >>+ struct xe_hw_engine *hwe; >>+ struct xe_exec_queue *q; >>+ u64 gpu_timestamp; >>+ >>+ xe_pm_runtime_get(xe); >>+ >>+ /* Accumulate all the exec queues from this client */ >>+ mutex_lock(&xef->exec_queue.lock); >>+ xa_for_each(&xef->exec_queue.xa, i, q) >>+ xe_exec_queue_update_runtime(q); >>+ mutex_unlock(&xef->exec_queue.lock); >>+ >>+ /* Get the total GPU cycles */ >>+ for_each_gt(gt, xe, gt_id) { >>+ hwe = xe_gt_any_hw_engine(gt); >>+ if (!hwe) >>+ continue; >>+ >>+ xe_force_wake_get(gt_to_fw(gt), XE_FW_GT); >>+ gpu_timestamp = xe_hw_engine_read_timestamp(hwe); >>+ xe_force_wake_put(gt_to_fw(gt), XE_FW_GT); >>+ break; >>+ } > >Runtime pm get/put could be just around this block, right? yes, I will update just this patch moving the xe_pm_runtime_put() if there are no other changes for next version. thanks Lucas De Marchi > >But overall now it reads super clean, thanks for accomodating my ideas! > >Regards, > >Tvrtko > >>+ >>+ if (unlikely(!hwe)) >>+ return; >>+ >>+ for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) { >>+ const char *class_name; >>+ >>+ for_each_gt(gt, xe, gt_id) >>+ capacity[class] += gt->user_engines.instances_per_class[class]; >>+ >>+ /* >>+ * Engines may be fused off or not exposed to userspace. Don't >>+ * return anything if this entire class is not available >>+ */ >>+ if (!capacity[class]) >>+ continue; >>+ >>+ class_name = xe_hw_engine_class_to_str(class); >>+ drm_printf(p, "drm-cycles-%s:\t%llu\n", >>+ class_name, xef->runtime[class]); >>+ drm_printf(p, "drm-total-cycles-%s:\t%llu\n", >>+ class_name, gpu_timestamp); >>+ >>+ if (capacity[class] > 1) >>+ drm_printf(p, "drm-engine-capacity-%s:\t%lu\n", >>+ class_name, capacity[class]); >>+ } >>+ >>+ xe_pm_runtime_put(xe); >>+} >>+ >> /** >> * xe_drm_client_fdinfo() - Callback for fdinfo interface >> * @p: The drm_printer ptr >>@@ -192,5 +310,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) >> void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file) >> { >> show_meminfo(p, file); >>+ show_runtime(p, file); >> } >> #endif
On Wed, May 15, 2024 at 02:42:58PM -0700, Lucas De Marchi wrote: >Print the accumulated runtime for client when printing fdinfo. >Each time a query is done it first does 2 things: > >1) loop through all the exec queues for the current client and > accumulate the runtime, per engine class. CTX_TIMESTAMP is used for > that, being read from the context image. > >2) Read a "GPU timestamp" that can be used for considering "how much GPU > time has passed" and that has the same unit/refclock as the one > recording the runtime. RING_TIMESTAMP is used for that via MMIO. > >Since for all current platforms RING_TIMESTAMP follows the same >refclock, just read it once, using any first engine available. > >This is exported to userspace as 2 numbers in fdinfo: > > drm-cycles-<class>: <RUNTIME> > drm-total-cycles-<class>: <TIMESTAMP> > >Userspace is expected to collect at least 2 samples, which allows to >know the client engine busyness as per: > > RUNTIME1 - RUNTIME0 > busyness = --------------------- > T1 - T0 > >Since drm-cycles-<class> always starts at 0, it's also possible to know >if and engine was ever used by a client. > >It's expected that userspace will read any 2 samples every few seconds. >Given the update frequency of the counters involved and that >CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap >around (assuming 100% utilization) after ~200s. The wraparound is not >perceived by userspace since it's just accumulated for all the >exec_queues in a 64-bit counter) but the measurement will not be >accurate if the samples are too far apart. > >This could be mitigated by adding a workqueue to accumulate the counters >every so often, but it's additional complexity for something that is >done already by userspace every few seconds in tools like gputop (from >igt), htop, nvtop, etc, with none of them really defaulting to 1 sample >per minute or more. > >Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> LGTM, Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Thanks, Umesh >--- > Documentation/gpu/drm-usage-stats.rst | 21 +++- > Documentation/gpu/xe/index.rst | 1 + > Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ > drivers/gpu/drm/xe/xe_drm_client.c | 121 +++++++++++++++++++- > 4 files changed, 150 insertions(+), 3 deletions(-) > create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst > >diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst >index 6dc299343b48..a80f95ca1b2f 100644 >--- a/Documentation/gpu/drm-usage-stats.rst >+++ b/Documentation/gpu/drm-usage-stats.rst >@@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a value lower than what > was previously read, userspace is expected to stay with that larger previous > value until a monotonic update is seen. > >+- drm-total-cycles-<keystr>: <uint> >+ >+Engine identifier string must be the same as the one specified in the >+drm-cycles-<keystr> tag and shall contain the total number cycles for the given >+engine. >+ >+This is a timestamp in GPU unspecified unit that matches the update rate >+of drm-cycles-<keystr>. For drivers that implement this interface, the engine >+utilization can be calculated entirely on the GPU clock domain, without >+considering the CPU sleep time between 2 samples. >+ >+A driver may implement either this key or drm-maxfreq-<keystr>, but not both. >+ > - drm-maxfreq-<keystr>: <uint> [Hz|MHz|KHz] > > Engine identifier string must be the same as the one specified in the >@@ -121,6 +134,9 @@ percentage utilization of the engine, whereas drm-engine-<keystr> only reflects > time active without considering what frequency the engine is operating as a > percentage of its maximum frequency. > >+A driver may implement either this key or drm-total-cycles-<keystr>, but not >+both. >+ > Memory > ^^^^^^ > >@@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers. > Driver specific implementations > ------------------------------- > >-:ref:`i915-usage-stats` >-:ref:`panfrost-usage-stats` >+* :ref:`i915-usage-stats` >+* :ref:`panfrost-usage-stats` >+* :ref:`xe-usage-stats` >diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst >index c224ecaee81e..3f07aa3b5432 100644 >--- a/Documentation/gpu/xe/index.rst >+++ b/Documentation/gpu/xe/index.rst >@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. > xe_firmware > xe_tile > xe_debugging >+ xe-drm-usage-stats.rst >diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst >new file mode 100644 >index 000000000000..482d503ae68a >--- /dev/null >+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst >@@ -0,0 +1,10 @@ >+.. SPDX-License-Identifier: GPL-2.0+ >+ >+.. _xe-usage-stats: >+ >+======================================== >+Xe DRM client usage stats implementation >+======================================== >+ >+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c >+ :doc: DRM Client usage stats >diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c >index 08f0b7c95901..952b0cc87708 100644 >--- a/drivers/gpu/drm/xe/xe_drm_client.c >+++ b/drivers/gpu/drm/xe/xe_drm_client.c >@@ -2,6 +2,7 @@ > /* > * Copyright © 2023 Intel Corporation > */ >+#include "xe_drm_client.h" > > #include <drm/drm_print.h> > #include <drm/xe_drm.h> >@@ -12,9 +13,66 @@ > #include "xe_bo.h" > #include "xe_bo_types.h" > #include "xe_device_types.h" >-#include "xe_drm_client.h" >+#include "xe_exec_queue.h" >+#include "xe_force_wake.h" >+#include "xe_gt.h" >+#include "xe_hw_engine.h" >+#include "xe_pm.h" > #include "xe_trace.h" > >+/** >+ * DOC: DRM Client usage stats >+ * >+ * The drm/xe driver implements the DRM client usage stats specification as >+ * documented in :ref:`drm-client-usage-stats`. >+ * >+ * Example of the output showing the implemented key value pairs and entirety of >+ * the currently possible format options: >+ * >+ * :: >+ * >+ * pos: 0 >+ * flags: 0100002 >+ * mnt_id: 26 >+ * ino: 685 >+ * drm-driver: xe >+ * drm-client-id: 3 >+ * drm-pdev: 0000:03:00.0 >+ * drm-total-system: 0 >+ * drm-shared-system: 0 >+ * drm-active-system: 0 >+ * drm-resident-system: 0 >+ * drm-purgeable-system: 0 >+ * drm-total-gtt: 192 KiB >+ * drm-shared-gtt: 0 >+ * drm-active-gtt: 0 >+ * drm-resident-gtt: 192 KiB >+ * drm-total-vram0: 23992 KiB >+ * drm-shared-vram0: 16 MiB >+ * drm-active-vram0: 0 >+ * drm-resident-vram0: 23992 KiB >+ * drm-total-stolen: 0 >+ * drm-shared-stolen: 0 >+ * drm-active-stolen: 0 >+ * drm-resident-stolen: 0 >+ * drm-cycles-rcs: 28257900 >+ * drm-total-cycles-rcs: 7655183225 >+ * drm-cycles-bcs: 0 >+ * drm-total-cycles-bcs: 7655183225 >+ * drm-cycles-vcs: 0 >+ * drm-total-cycles-vcs: 7655183225 >+ * drm-engine-capacity-vcs: 2 >+ * drm-cycles-vecs: 0 >+ * drm-total-cycles-vecs: 7655183225 >+ * drm-engine-capacity-vecs: 2 >+ * drm-cycles-ccs: 0 >+ * drm-total-cycles-ccs: 7655183225 >+ * drm-engine-capacity-ccs: 4 >+ * >+ * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and >+ * "other". >+ */ >+ > /** > * xe_drm_client_alloc() - Allocate drm client > * @void: No arg >@@ -179,6 +237,66 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) > } > } > >+static void show_runtime(struct drm_printer *p, struct drm_file *file) >+{ >+ unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { }; >+ struct xe_file *xef = file->driver_priv; >+ struct xe_device *xe = xef->xe; >+ struct xe_gt *gt; >+ struct xe_hw_engine *hwe; >+ struct xe_exec_queue *q; >+ u64 gpu_timestamp; >+ >+ xe_pm_runtime_get(xe); >+ >+ /* Accumulate all the exec queues from this client */ >+ mutex_lock(&xef->exec_queue.lock); >+ xa_for_each(&xef->exec_queue.xa, i, q) >+ xe_exec_queue_update_runtime(q); >+ mutex_unlock(&xef->exec_queue.lock); >+ >+ /* Get the total GPU cycles */ >+ for_each_gt(gt, xe, gt_id) { >+ hwe = xe_gt_any_hw_engine(gt); >+ if (!hwe) >+ continue; >+ >+ xe_force_wake_get(gt_to_fw(gt), XE_FW_GT); >+ gpu_timestamp = xe_hw_engine_read_timestamp(hwe); >+ xe_force_wake_put(gt_to_fw(gt), XE_FW_GT); >+ break; >+ } >+ >+ if (unlikely(!hwe)) >+ return; >+ >+ for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) { >+ const char *class_name; >+ >+ for_each_gt(gt, xe, gt_id) >+ capacity[class] += gt->user_engines.instances_per_class[class]; >+ >+ /* >+ * Engines may be fused off or not exposed to userspace. Don't >+ * return anything if this entire class is not available >+ */ >+ if (!capacity[class]) >+ continue; >+ >+ class_name = xe_hw_engine_class_to_str(class); >+ drm_printf(p, "drm-cycles-%s:\t%llu\n", >+ class_name, xef->runtime[class]); >+ drm_printf(p, "drm-total-cycles-%s:\t%llu\n", >+ class_name, gpu_timestamp); >+ >+ if (capacity[class] > 1) >+ drm_printf(p, "drm-engine-capacity-%s:\t%lu\n", >+ class_name, capacity[class]); >+ } >+ >+ xe_pm_runtime_put(xe); >+} >+ > /** > * xe_drm_client_fdinfo() - Callback for fdinfo interface > * @p: The drm_printer ptr >@@ -192,5 +310,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) > void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file) > { > show_meminfo(p, file); >+ show_runtime(p, file); > } > #endif >-- >2.43.0 >
diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..a80f95ca1b2f 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a value lower than what was previously read, userspace is expected to stay with that larger previous value until a monotonic update is seen. +- drm-total-cycles-<keystr>: <uint> + +Engine identifier string must be the same as the one specified in the +drm-cycles-<keystr> tag and shall contain the total number cycles for the given +engine. + +This is a timestamp in GPU unspecified unit that matches the update rate +of drm-cycles-<keystr>. For drivers that implement this interface, the engine +utilization can be calculated entirely on the GPU clock domain, without +considering the CPU sleep time between 2 samples. + +A driver may implement either this key or drm-maxfreq-<keystr>, but not both. + - drm-maxfreq-<keystr>: <uint> [Hz|MHz|KHz] Engine identifier string must be the same as the one specified in the @@ -121,6 +134,9 @@ percentage utilization of the engine, whereas drm-engine-<keystr> only reflects time active without considering what frequency the engine is operating as a percentage of its maximum frequency. +A driver may implement either this key or drm-total-cycles-<keystr>, but not +both. + Memory ^^^^^^ @@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers. Driver specific implementations ------------------------------- -:ref:`i915-usage-stats` -:ref:`panfrost-usage-stats` +* :ref:`i915-usage-stats` +* :ref:`panfrost-usage-stats` +* :ref:`xe-usage-stats` diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst index c224ecaee81e..3f07aa3b5432 100644 --- a/Documentation/gpu/xe/index.rst +++ b/Documentation/gpu/xe/index.rst @@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. xe_firmware xe_tile xe_debugging + xe-drm-usage-stats.rst diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst new file mode 100644 index 000000000000..482d503ae68a --- /dev/null +++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +.. _xe-usage-stats: + +======================================== +Xe DRM client usage stats implementation +======================================== + +.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c + :doc: DRM Client usage stats diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 08f0b7c95901..952b0cc87708 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -2,6 +2,7 @@ /* * Copyright © 2023 Intel Corporation */ +#include "xe_drm_client.h" #include <drm/drm_print.h> #include <drm/xe_drm.h> @@ -12,9 +13,66 @@ #include "xe_bo.h" #include "xe_bo_types.h" #include "xe_device_types.h" -#include "xe_drm_client.h" +#include "xe_exec_queue.h" +#include "xe_force_wake.h" +#include "xe_gt.h" +#include "xe_hw_engine.h" +#include "xe_pm.h" #include "xe_trace.h" +/** + * DOC: DRM Client usage stats + * + * The drm/xe driver implements the DRM client usage stats specification as + * documented in :ref:`drm-client-usage-stats`. + * + * Example of the output showing the implemented key value pairs and entirety of + * the currently possible format options: + * + * :: + * + * pos: 0 + * flags: 0100002 + * mnt_id: 26 + * ino: 685 + * drm-driver: xe + * drm-client-id: 3 + * drm-pdev: 0000:03:00.0 + * drm-total-system: 0 + * drm-shared-system: 0 + * drm-active-system: 0 + * drm-resident-system: 0 + * drm-purgeable-system: 0 + * drm-total-gtt: 192 KiB + * drm-shared-gtt: 0 + * drm-active-gtt: 0 + * drm-resident-gtt: 192 KiB + * drm-total-vram0: 23992 KiB + * drm-shared-vram0: 16 MiB + * drm-active-vram0: 0 + * drm-resident-vram0: 23992 KiB + * drm-total-stolen: 0 + * drm-shared-stolen: 0 + * drm-active-stolen: 0 + * drm-resident-stolen: 0 + * drm-cycles-rcs: 28257900 + * drm-total-cycles-rcs: 7655183225 + * drm-cycles-bcs: 0 + * drm-total-cycles-bcs: 7655183225 + * drm-cycles-vcs: 0 + * drm-total-cycles-vcs: 7655183225 + * drm-engine-capacity-vcs: 2 + * drm-cycles-vecs: 0 + * drm-total-cycles-vecs: 7655183225 + * drm-engine-capacity-vecs: 2 + * drm-cycles-ccs: 0 + * drm-total-cycles-ccs: 7655183225 + * drm-engine-capacity-ccs: 4 + * + * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and + * "other". + */ + /** * xe_drm_client_alloc() - Allocate drm client * @void: No arg @@ -179,6 +237,66 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) } } +static void show_runtime(struct drm_printer *p, struct drm_file *file) +{ + unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { }; + struct xe_file *xef = file->driver_priv; + struct xe_device *xe = xef->xe; + struct xe_gt *gt; + struct xe_hw_engine *hwe; + struct xe_exec_queue *q; + u64 gpu_timestamp; + + xe_pm_runtime_get(xe); + + /* Accumulate all the exec queues from this client */ + mutex_lock(&xef->exec_queue.lock); + xa_for_each(&xef->exec_queue.xa, i, q) + xe_exec_queue_update_runtime(q); + mutex_unlock(&xef->exec_queue.lock); + + /* Get the total GPU cycles */ + for_each_gt(gt, xe, gt_id) { + hwe = xe_gt_any_hw_engine(gt); + if (!hwe) + continue; + + xe_force_wake_get(gt_to_fw(gt), XE_FW_GT); + gpu_timestamp = xe_hw_engine_read_timestamp(hwe); + xe_force_wake_put(gt_to_fw(gt), XE_FW_GT); + break; + } + + if (unlikely(!hwe)) + return; + + for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) { + const char *class_name; + + for_each_gt(gt, xe, gt_id) + capacity[class] += gt->user_engines.instances_per_class[class]; + + /* + * Engines may be fused off or not exposed to userspace. Don't + * return anything if this entire class is not available + */ + if (!capacity[class]) + continue; + + class_name = xe_hw_engine_class_to_str(class); + drm_printf(p, "drm-cycles-%s:\t%llu\n", + class_name, xef->runtime[class]); + drm_printf(p, "drm-total-cycles-%s:\t%llu\n", + class_name, gpu_timestamp); + + if (capacity[class] > 1) + drm_printf(p, "drm-engine-capacity-%s:\t%lu\n", + class_name, capacity[class]); + } + + xe_pm_runtime_put(xe); +} + /** * xe_drm_client_fdinfo() - Callback for fdinfo interface * @p: The drm_printer ptr @@ -192,5 +310,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file) void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file) { show_meminfo(p, file); + show_runtime(p, file); } #endif
Print the accumulated runtime for client when printing fdinfo. Each time a query is done it first does 2 things: 1) loop through all the exec queues for the current client and accumulate the runtime, per engine class. CTX_TIMESTAMP is used for that, being read from the context image. 2) Read a "GPU timestamp" that can be used for considering "how much GPU time has passed" and that has the same unit/refclock as the one recording the runtime. RING_TIMESTAMP is used for that via MMIO. Since for all current platforms RING_TIMESTAMP follows the same refclock, just read it once, using any first engine available. This is exported to userspace as 2 numbers in fdinfo: drm-cycles-<class>: <RUNTIME> drm-total-cycles-<class>: <TIMESTAMP> Userspace is expected to collect at least 2 samples, which allows to know the client engine busyness as per: RUNTIME1 - RUNTIME0 busyness = --------------------- T1 - T0 Since drm-cycles-<class> always starts at 0, it's also possible to know if and engine was ever used by a client. It's expected that userspace will read any 2 samples every few seconds. Given the update frequency of the counters involved and that CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap around (assuming 100% utilization) after ~200s. The wraparound is not perceived by userspace since it's just accumulated for all the exec_queues in a 64-bit counter) but the measurement will not be accurate if the samples are too far apart. This could be mitigated by adding a workqueue to accumulate the counters every so often, but it's additional complexity for something that is done already by userspace every few seconds in tools like gputop (from igt), htop, nvtop, etc, with none of them really defaulting to 1 sample per minute or more. Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> --- Documentation/gpu/drm-usage-stats.rst | 21 +++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ drivers/gpu/drm/xe/xe_drm_client.c | 121 +++++++++++++++++++- 4 files changed, 150 insertions(+), 3 deletions(-) create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst