diff mbox

[i-g-t,v6] intel-gpu-top: Rewrite the tool to be safe to use

Message ID 20180404152649.19382-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin April 4, 2018, 3:26 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

intel-gpu-top is a dangerous tool which can hang machines due unsafe mmio
register access. This patch rewrites it to use only PMU.

Only overall command streamer busyness and GPU global data such as power
and frequencies are included in this new version.

For access to more GPU functional unit level data, an OA metric based tool
like gpu-top should be used instead.

v2:
 * Sort engines by class and instance.
 * Do not wait for one sampling period to display something on screen.
 * Move code out of the asserts. (Rinat Ibragimov)
 * Continuously adapt to terminal size. (Rinat Ibragimov)

v3:
 * Change layout and precision of some field. (Chris Wilson)
 Eero Tamminen:
 * Use more user friendly engine names.
 * Don't error out if a counter is missing.
 * Add IMC read/write bandwidth.
 * Report minimum required kernel version.

v4:
 * Really support 4.16 by skipping of missing engines.
 * Simpler and less hacky float printing.
 * Preserve copyright header. (Antonio Argenziano)
 * Simplify engines_ptr macro. (Rinat Ibragimov)

v5:
 * Get RAPL unit from sysfs.
 * Consolidate sysfs paths with a macro.
 * Tidy error handling by carrying over and reporting errno.
 * Check against console height on all prints.
 * More readable minimum kernel version message. (Eero Tamminen)
 * Column banner for per engine stats. (Eero Tamminen)

v6:
 * Man page update. (Eero Tamminen)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Rinat Ibragimov <ibragimovrinat@mail.ru>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> # v1
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> # v0.5
---
 lib/igt_perf.c        |    6 +
 lib/igt_perf.h        |    1 +
 man/intel_gpu_top.rst |   41 +-
 tools/Makefile.am     |    2 +
 tools/intel_gpu_top.c | 1250 +++++++++++++++++++++++++++----------------------
 tools/meson.build     |    6 +-
 6 files changed, 719 insertions(+), 587 deletions(-)

Comments

Tvrtko Ursulin April 9, 2018, 12:26 p.m. UTC | #1
[Adding some people to Cc for more ack/nack type feedback.]

Executive question is ack or nack on replacing intel_gpu_top with a new 
implementation which uses only perf PMU for counter gathering.

A short history on how this came to be:

There was a recent external patch contribution from Rinat Ibragimov to 
support more platforms from the existing intel_gpu_top. But as the tool 
is not safe to use Chris Wilson suggested to maybe just replace it.

As it happens I had a good start to do this quickly and cheaply, in the 
form of one prototype I did recently, which only needed ripping some 
bits out, and polishing the rest.

Eero and Rinat kindly did a lot of platform coverage testing and the 
rewrite seems ready for next steps.

I need to stress that as the commit notes, the new tool has a slightly 
different scope as that it doesn't expose GPU functional level data, but 
only overall stats like power, frequencies, RC6, interrupts, IMC memory 
bandwidth and per command streamer busyness, mi_semaphore and mi_event 
waits. My thinking was that for more functional level profiling gpu-top 
(OA) should be used.

Also the "run a command" and CSV output features are not not supported 
since both can be done directly via perf stat.

Regards,

Tvrtko

On 04/04/2018 16:26, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> intel-gpu-top is a dangerous tool which can hang machines due unsafe mmio
> register access. This patch rewrites it to use only PMU.
> 
> Only overall command streamer busyness and GPU global data such as power
> and frequencies are included in this new version.
> 
> For access to more GPU functional unit level data, an OA metric based tool
> like gpu-top should be used instead.
> 
> v2:
>   * Sort engines by class and instance.
>   * Do not wait for one sampling period to display something on screen.
>   * Move code out of the asserts. (Rinat Ibragimov)
>   * Continuously adapt to terminal size. (Rinat Ibragimov)
> 
> v3:
>   * Change layout and precision of some field. (Chris Wilson)
>   Eero Tamminen:
>   * Use more user friendly engine names.
>   * Don't error out if a counter is missing.
>   * Add IMC read/write bandwidth.
>   * Report minimum required kernel version.
> 
> v4:
>   * Really support 4.16 by skipping of missing engines.
>   * Simpler and less hacky float printing.
>   * Preserve copyright header. (Antonio Argenziano)
>   * Simplify engines_ptr macro. (Rinat Ibragimov)
> 
> v5:
>   * Get RAPL unit from sysfs.
>   * Consolidate sysfs paths with a macro.
>   * Tidy error handling by carrying over and reporting errno.
>   * Check against console height on all prints.
>   * More readable minimum kernel version message. (Eero Tamminen)
>   * Column banner for per engine stats. (Eero Tamminen)
> 
> v6:
>   * Man page update. (Eero Tamminen)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Petri Latvala <petri.latvala@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Rinat Ibragimov <ibragimovrinat@mail.ru>
> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> # v1
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> # v0.5
> ---
>   lib/igt_perf.c        |    6 +
>   lib/igt_perf.h        |    1 +
>   man/intel_gpu_top.rst |   41 +-
>   tools/Makefile.am     |    2 +
>   tools/intel_gpu_top.c | 1250 +++++++++++++++++++++++++++----------------------
>   tools/meson.build     |    6 +-
>   6 files changed, 719 insertions(+), 587 deletions(-)
> 
> diff --git a/lib/igt_perf.c b/lib/igt_perf.c
> index 99d82ea51c9b..e3dec2cc29c7 100644
> --- a/lib/igt_perf.c
> +++ b/lib/igt_perf.c
> @@ -69,3 +69,9 @@ int igt_perf_open(uint64_t type, uint64_t config)
>   	return _perf_open(type, config, -1,
>   			  PERF_FORMAT_TOTAL_TIME_ENABLED);
>   }
> +
> +int igt_perf_open_group(uint64_t type, uint64_t config, int group)
> +{
> +	return _perf_open(type, config, group,
> +			  PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_GROUP);
> +}
> diff --git a/lib/igt_perf.h b/lib/igt_perf.h
> index 614ea5d23fa6..e00718f4769a 100644
> --- a/lib/igt_perf.h
> +++ b/lib/igt_perf.h
> @@ -55,5 +55,6 @@ uint64_t i915_type_id(void);
>   int perf_i915_open(uint64_t config);
>   int perf_i915_open_group(uint64_t config, int group);
>   int igt_perf_open(uint64_t type, uint64_t config);
> +int igt_perf_open_group(uint64_t type, uint64_t config, int group);
>   
>   #endif /* I915_PERF_H */
> diff --git a/man/intel_gpu_top.rst b/man/intel_gpu_top.rst
> index a5f7175bb1a0..19c712307d28 100644
> --- a/man/intel_gpu_top.rst
> +++ b/man/intel_gpu_top.rst
> @@ -7,9 +7,9 @@ Display a top-like summary of Intel GPU usage
>   ---------------------------------------------
>   .. include:: defs.rst
>   :Author: IGT Developers <igt-dev@lists.freedesktop.org>
> -:Date: 2016-03-01
> +:Date: 2018-04-04
>   :Version: |PACKAGE_STRING|
> -:Copyright: 2009,2011,2012,2016 Intel Corporation
> +:Copyright: 2009,2011,2012,2016,2018 Intel Corporation
>   :Manual section: |MANUAL_SECTION|
>   :Manual group: |MANUAL_GROUP|
>   
> @@ -21,42 +21,25 @@ SYNOPSIS
>   DESCRIPTION
>   ===========
>   
> -**intel_gpu_top** is a tool to display usage information of an Intel GPU. It
> -requires root privilege to map the graphics device.
> +**intel_gpu_top** is a tool to display usage information on Intel GPU's.
> +
> +The tool gathers data using perf performance counters (PMU) exposed by i915 and other platform drivers like RAPL (power) and Uncore IMC (memory bandwidth).
>   
>   OPTIONS
>   =======
>   
> --s SAMPLES
> -    Number of samples to acquire per second.
> -
> --o FILE
> -    Collect usage statistics to FILE. If file is "-", run non-interactively
> -    and output statistics to stdout.
> -
> --e COMMAND
> -    Execute COMMAND to profile, and leave when it is finished. Note that the
> -    entire command with all parameters should be included as one parameter.
> +-s <ms>
> +    Refresh period in milliseconds.
>   
>   -h
> -    Show usage notes.
> +    Show help text.
>   
> -EXAMPLES
> -========
> -
> -intel_gpu_top -o "cairo-trace-gvim.log" -s 100 -e "cairo-perf-trace /tmp/gvim"
> -    Run cairo-perf-trace with /tmp/gvim trace, non-interactively, saving the
> -    statistics into cairo-trace-gvim.log file, and collecting 100 samples per
> -    second.
> -
> -Note that idle units are not displayed, so an entirely idle GPU will only
> -display the ring status and header.
> +LIMITATIONS
> +===========
>   
> -BUGS
> -====
> +* Not all metrics are supported on all platforms. Where a metric is unsupported it's value will be replaced by a dashed line.
>   
> -Some GPUs report some units as busy when they aren't, such that even when idle
> -and not hung, it will show up as 100% busy.
> +* Non-root access to perf counters is controlled by the *perf_event_paranoid* sysctl.
>   
>   REPORTING BUGS
>   ==============
> diff --git a/tools/Makefile.am b/tools/Makefile.am
> index 09b6dbcc3ece..a0b016ddd7ff 100644
> --- a/tools/Makefile.am
> +++ b/tools/Makefile.am
> @@ -28,6 +28,8 @@ intel_aubdump_la_LDFLAGS = -module -avoid-version -no-undefined
>   intel_aubdump_la_SOURCES = aubdump.c
>   intel_aubdump_la_LIBADD = $(top_builddir)/lib/libintel_tools.la -ldl
>   
> +intel_gpu_top_LDADD = $(top_builddir)/lib/libigt_perf.la
> +
>   bin_SCRIPTS = intel_aubdump
>   CLEANFILES = $(bin_SCRIPTS)
>   
> diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
> index 098e6ce3ff86..b923c3cfbe97 100644
> --- a/tools/intel_gpu_top.c
> +++ b/tools/intel_gpu_top.c
> @@ -1,6 +1,5 @@
>   /*
> - * Copyright © 2007 Intel Corporation
> - * Copyright © 2011 Intel Corporation
> + * Copyright © 2007-2018 Intel Corporation
>    *
>    * Permission is hereby granted, free of charge, to any person obtaining a
>    * copy of this software and associated documentation files (the "Software"),
> @@ -24,695 +23,832 @@
>    * Authors:
>    *    Eric Anholt <eric@anholt.net>
>    *    Eugeni Dodonov <eugeni.dodonov@intel.com>
> - *
>    */
>   
> -#include "config.h"
> -
> -#include <inttypes.h>
> -#include <unistd.h>
> -#include <stdlib.h>
>   #include <stdio.h>
> -#include <err.h>
> -#include <sys/ioctl.h>
> -#include <sys/time.h>
> -#include <sys/wait.h>
> +#include <sys/types.h>
> +#include <dirent.h>
> +#include <stdint.h>
> +#include <assert.h>
>   #include <string.h>
> -#ifdef HAVE_TERMIOS_H
> -#include <termios.h>
> -#endif
> -#include "intel_io.h"
> -#include "instdone.h"
> -#include "intel_reg.h"
> -#include "intel_chipset.h"
> -#include "drmtest.h"
> -
> -#define  FORCEWAKE	    0xA18C
> -#define  FORCEWAKE_ACK	    0x130090
> -
> -#define SAMPLES_PER_SEC             10000
> -#define SAMPLES_TO_PERCENT_RATIO    (SAMPLES_PER_SEC / 100)
> -
> -#define MAX_NUM_TOP_BITS            100
> -
> -#define HAS_STATS_REGS(devid)		IS_965(devid)
> -
> -struct top_bit {
> -	struct instdone_bit *bit;
> -	int count;
> -} top_bits[MAX_NUM_TOP_BITS];
> -struct top_bit *top_bits_sorted[MAX_NUM_TOP_BITS];
> -
> -static uint32_t instdone, instdone1;
> -
> -static const char *bars[] = {
> -	" ",
> -	"▏",
> -	"▎",
> -	"▍",
> -	"▌",
> -	"▋",
> -	"▊",
> -	"▉",
> -	"█"
> -};
> +#include <ctype.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <sys/ioctl.h>
> +#include <errno.h>
> +#include <math.h>
> +#include <locale.h>
> +
> +#include "igt_perf.h"
>   
> -enum stats_counts {
> -	IA_VERTICES,
> -	IA_PRIMITIVES,
> -	VS_INVOCATION,
> -	GS_INVOCATION,
> -	GS_PRIMITIVES,
> -	CL_INVOCATION,
> -	CL_PRIMITIVES,
> -	PS_INVOCATION,
> -	PS_DEPTH,
> -	STATS_COUNT
> +struct pmu_pair {
> +	uint64_t cur;
> +	uint64_t prev;
>   };
>   
> -const uint32_t stats_regs[STATS_COUNT] = {
> -	IA_VERTICES_COUNT_QW,
> -	IA_PRIMITIVES_COUNT_QW,
> -	VS_INVOCATION_COUNT_QW,
> -	GS_INVOCATION_COUNT_QW,
> -	GS_PRIMITIVES_COUNT_QW,
> -	CL_INVOCATION_COUNT_QW,
> -	CL_PRIMITIVES_COUNT_QW,
> -	PS_INVOCATION_COUNT_QW,
> -	PS_DEPTH_COUNT_QW,
> +struct pmu_counter {
> +	bool present;
> +	uint64_t config;
> +	unsigned int idx;
> +	struct pmu_pair val;
>   };
>   
> -const char *stats_reg_names[STATS_COUNT] = {
> -	"vert fetch",
> -	"prim fetch",
> -	"VS invocations",
> -	"GS invocations",
> -	"GS prims",
> -	"CL invocations",
> -	"CL prims",
> -	"PS invocations",
> -	"PS depth pass",
> +struct engine {
> +	const char *name;
> +	const char *display_name;
> +
> +	unsigned int class;
> +	unsigned int instance;
> +
> +	unsigned int num_counters;
> +
> +	struct pmu_counter busy;
> +	struct pmu_counter wait;
> +	struct pmu_counter sema;
>   };
>   
> -uint64_t stats[STATS_COUNT];
> -uint64_t last_stats[STATS_COUNT];
> +struct engines {
> +	unsigned int num_engines;
> +	unsigned int num_counters;
> +	DIR *root;
> +	int fd;
> +	struct pmu_pair ts;
> +
> +	int rapl_fd;
> +	double rapl_scale;
> +	const char *rapl_unit;
> +
> +	int imc_fd;
> +	double imc_reads_scale;
> +	const char *imc_reads_unit;
> +	double imc_writes_scale;
> +	const char *imc_writes_unit;
> +
> +	struct pmu_counter freq_req;
> +	struct pmu_counter freq_act;
> +	struct pmu_counter irq;
> +	struct pmu_counter rc6;
> +	struct pmu_counter rapl;
> +	struct pmu_counter imc_reads;
> +	struct pmu_counter imc_writes;
> +
> +	struct engine engine;
> +};
>   
> -static unsigned long
> -gettime(void)
> +static uint64_t
> +get_pmu_config(int dirfd, const char *name, const char *counter)
>   {
> -    struct timeval t;
> -    gettimeofday(&t, NULL);
> -    return (t.tv_usec + (t.tv_sec * 1000000));
> -}
> +	char buf[128], *p;
> +	int fd, ret;
>   
> -static int
> -top_bits_sort(const void *a, const void *b)
> -{
> -	struct top_bit * const *bit_a = a;
> -	struct top_bit * const *bit_b = b;
> -	int a_count = (*bit_a)->count;
> -	int b_count = (*bit_b)->count;
> +	ret = snprintf(buf, sizeof(buf), "%s-%s", name, counter);
> +	if (ret < 0 || ret == sizeof(buf))
> +		return -1;
>   
> -	if (a_count < b_count)
> -		return 1;
> -	else if (a_count == b_count)
> -		return 0;
> -	else
> +	fd = openat(dirfd, buf, O_RDONLY);
> +	if (fd < 0)
>   		return -1;
> -}
>   
> -static void
> -update_idle_bit(struct top_bit *top_bit)
> -{
> -	uint32_t reg_val;
> +	ret = read(fd, buf, sizeof(buf));
> +	close(fd);
> +	if (ret <= 0)
> +		return -1;
>   
> -	if (top_bit->bit->reg == INSTDONE_1)
> -		reg_val = instdone1;
> -	else
> -		reg_val = instdone;
> +	p = index(buf, '0');
> +	if (!p)
> +		return -1;
>   
> -	if ((reg_val & top_bit->bit->bit) == 0)
> -		top_bit->count++;
> +	return strtoul(p, NULL, 0);
>   }
>   
> -static void
> -print_clock(const char *name, int clock) {
> -	if (clock == -1)
> -		printf("%s clock: unknown", name);
> +#define engine_ptr(engines, n) (&engines->engine + (n))
> +
> +static const char *class_display_name(unsigned int class)
> +{
> +	switch (class) {
> +	case I915_ENGINE_CLASS_RENDER:
> +		return "Render/3D";
> +	case I915_ENGINE_CLASS_COPY:
> +		return "Blitter";
> +	case I915_ENGINE_CLASS_VIDEO:
> +		return "Video";
> +	case I915_ENGINE_CLASS_VIDEO_ENHANCE:
> +		return "VideoEnhance";
> +	default:
> +		return "[unknown]";
> +	}
> +}
> +
> +static int engine_cmp(const void *__a, const void *__b)
> +{
> +	const struct engine *a = (struct engine *)__a;
> +	const struct engine *b = (struct engine *)__b;
> +
> +	if (a->class != b->class)
> +		return a->class - b->class;
>   	else
> -		printf("%s clock: %d Mhz", name, clock);
> +		return a->instance - b->instance;
>   }
>   
> -static int
> -print_clock_info(struct pci_device *pci_dev)
> +static struct engines *discover_engines(void)
>   {
> -	uint32_t devid = pci_dev->device_id;
> -	uint16_t gcfgc;
> +	const char *sysfs_root = "/sys/devices/i915/events";
> +	struct engines *engines;
> +	struct dirent *dent;
> +	int ret = 0;
> +	DIR *d;
>   
> -	if (IS_GM45(devid)) {
> -		int core_clock = -1;
> +	engines = malloc(sizeof(struct engines));
> +	if (!engines)
> +		return NULL;
>   
> -		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +	memset(engines, 0, sizeof(*engines));
>   
> -		switch (gcfgc & 0xf) {
> -		case 8:
> -			core_clock = 266;
> -			break;
> -		case 9:
> -			core_clock = 320;
> -			break;
> -		case 11:
> -			core_clock = 400;
> -			break;
> -		case 13:
> -			core_clock = 533;
> -			break;
> -		}
> -		print_clock("core", core_clock);
> -	} else if (IS_965(devid) && IS_MOBILE(devid)) {
> -		int render_clock = -1, sampler_clock = -1;
> +	engines->num_engines = 0;
>   
> -		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +	d = opendir(sysfs_root);
> +	if (!d)
> +		return NULL;
>   
> -		switch (gcfgc & 0xf) {
> -		case 2:
> -			render_clock = 250; sampler_clock = 267;
> -			break;
> -		case 3:
> -			render_clock = 320; sampler_clock = 333;
> -			break;
> -		case 4:
> -			render_clock = 400; sampler_clock = 444;
> -			break;
> -		case 5:
> -			render_clock = 500; sampler_clock = 533;
> +	while ((dent = readdir(d)) != NULL) {
> +		const char *endswith = "-busy";
> +		const unsigned int endlen = strlen(endswith);
> +		struct engine *engine =
> +				engine_ptr(engines, engines->num_engines);
> +		char buf[256];
> +
> +		if (dent->d_type != DT_REG)
> +			continue;
> +
> +		if (strlen(dent->d_name) >= sizeof(buf)) {
> +			ret = ENAMETOOLONG;
>   			break;
>   		}
>   
> -		print_clock("render", render_clock);
> -		printf("  ");
> -		print_clock("sampler", sampler_clock);
> -	} else if (IS_945(devid) && IS_MOBILE(devid)) {
> -		int render_clock = -1, display_clock = -1;
> +		strcpy(buf, dent->d_name);
>   
> -		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +		/* xxxN-busy */
> +		if (strlen(buf) < (endlen + 4))
> +			continue;
> +		if (strcmp(&buf[strlen(buf) - endlen], endswith))
> +			continue;
>   
> -		switch (gcfgc & 0x7) {
> -		case 0:
> -			render_clock = 166;
> -			break;
> -		case 1:
> -			render_clock = 200;
> -			break;
> -		case 3:
> -			render_clock = 250;
> -			break;
> -		case 5:
> -			render_clock = 400;
> +		memset(engine, 0, sizeof(*engine));
> +
> +		buf[strlen(buf) - endlen] = 0;
> +		engine->name = strdup(buf);
> +		if (!engine->name) {
> +			ret = errno;
>   			break;
>   		}
>   
> -		switch (gcfgc & 0x70) {
> -		case 0:
> -			display_clock = 200;
> -			break;
> -		case 4:
> -			display_clock = 320;
> +		engine->busy.config = get_pmu_config(dirfd(d), engine->name,
> +						     "busy");
> +		if (engine->busy.config == -1) {
> +			ret = ENOENT;
>   			break;
>   		}
> -		if (gcfgc & (1 << 7))
> -		    display_clock = 133;
>   
> -		print_clock("render", render_clock);
> -		printf("  ");
> -		print_clock("display", display_clock);
> -	} else if (IS_915(devid) && IS_MOBILE(devid)) {
> -		int render_clock = -1, display_clock = -1;
> +		engine->class = (engine->busy.config &
> +				 (__I915_PMU_OTHER(0) - 1)) >>
> +				I915_PMU_CLASS_SHIFT;
>   
> -		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +		engine->instance = (engine->busy.config >>
> +				    I915_PMU_SAMPLE_BITS) &
> +				    ((1 << I915_PMU_SAMPLE_INSTANCE_BITS) - 1);
>   
> -		switch (gcfgc & 0x7) {
> -		case 0:
> -			render_clock = 160;
> -			break;
> -		case 1:
> -			render_clock = 190;
> -			break;
> -		case 4:
> -			render_clock = 333;
> +		ret = snprintf(buf, sizeof(buf), "%s/%u",
> +			       class_display_name(engine->class),
> +			       engine->instance);
> +		if (ret < 0 || ret == sizeof(buf)) {
> +			ret = ENOBUFS;
>   			break;
>   		}
> -		if (gcfgc & (1 << 13))
> -		    render_clock = 133;
> +		ret = 0;
>   
> -		switch (gcfgc & 0x70) {
> -		case 0:
> -			display_clock = 190;
> +		engine->display_name = strdup(buf);
> +		if (!engine->display_name) {
> +			ret = errno;
>   			break;
> -		case 4:
> -			display_clock = 333;
> +		}
> +
> +		engines->num_engines++;
> +		engines = realloc(engines, sizeof(struct engines) +
> +				  engines->num_engines * sizeof(struct engine));
> +		if (!engines) {
> +			ret = errno;
>   			break;
>   		}
> -		if (gcfgc & (1 << 7))
> -		    display_clock = 133;
> +	}
> +
> +	if (ret) {
> +		free(engines);
> +		errno = ret;
>   
> -		print_clock("render", render_clock);
> -		printf("  ");
> -		print_clock("display", display_clock);
> +		return NULL;
>   	}
>   
> +	qsort(engine_ptr(engines, 0), engines->num_engines,
> +	      sizeof(struct engine), engine_cmp);
> +
> +	engines->root = d;
>   
> -	printf("\n");
> -	return -1;
> +	return engines;
>   }
>   
> -#define STATS_LEN (20)
> -#define PERCENTAGE_BAR_END	(79 - STATS_LEN)
> +static int
> +filename_to_buf(const char *filename, char *buf, unsigned int bufsize)
> +{
> +	int fd, err;
> +	ssize_t ret;
>   
> -static void
> -print_percentage_bar(float percent, int cur_line_len)
> +	fd = open(filename, O_RDONLY);
> +	if (fd < 0)
> +		return -1;
> +
> +	ret = read(fd, buf, bufsize - 1);
> +	err = errno;
> +	close(fd);
> +	if (ret < 1) {
> +		errno = ret < 0 ? err : ENOMSG;
> +
> +		return -1;
> +	}
> +
> +	if (ret > 1 && buf[ret - 1] == '\n')
> +		buf[ret - 1] = '\0';
> +	else
> +		buf[ret] = '\0';
> +
> +	return 0;
> +}
> +
> +static uint64_t filename_to_u64(const char *filename, int base)
>   {
> -	int bar_avail_len = (PERCENTAGE_BAR_END - cur_line_len - 1) * 8;
> -	int bar_len = bar_avail_len * (percent + .5) / 100.0;
> -	int i;
> +	char buf[64], *b;
>   
> -	for (i = bar_len; i >= 8; i -= 8) {
> -		printf("%s", bars[8]);
> -		cur_line_len++;
> +	if (filename_to_buf(filename, buf, sizeof(buf)))
> +		return 0;
> +
> +	/*
> +	 * Handle both single integer and key=value formats by skipping
> +	 * leading non-digits.
> +	 */
> +	b = buf;
> +	while (*b && !isdigit(*b))
> +		b++;
> +
> +	return strtoull(b, NULL, base);
> +}
> +
> +static double filename_to_double(const char *filename)
> +{
> +	char *oldlocale;
> +	char buf[80];
> +	double v;
> +
> +	if (filename_to_buf(filename, buf, sizeof(buf)))
> +		return 0;
> +
> +	oldlocale = setlocale(LC_ALL, "C");
> +	v = strtod(buf, NULL);
> +	setlocale(LC_ALL, oldlocale);
> +
> +	return v;
> +}
> +
> +#define RAPL_ROOT "/sys/devices/power/"
> +#define RAPL_EVENT "/sys/devices/power/events/"
> +
> +static uint64_t rapl_type_id(void)
> +{
> +	return filename_to_u64(RAPL_ROOT "type", 10);
> +}
> +
> +static uint64_t rapl_gpu_power(void)
> +{
> +	return filename_to_u64(RAPL_EVENT "energy-gpu", 0);
> +}
> +
> +static double rapl_gpu_power_scale(void)
> +{
> +	return filename_to_double(RAPL_EVENT "energy-gpu.scale");
> +}
> +
> +static const char *rapl_gpu_power_unit(void)
> +{
> +	char buf[32];
> +
> +	if (filename_to_buf(RAPL_EVENT "energy-gpu.unit",
> +			    buf, sizeof(buf)) == 0)
> +		if (!strcmp(buf, "Joules"))
> +			return strdup("Watts");
> +		else
> +			return strdup(buf);
> +	else
> +		return NULL;
> +}
> +
> +#define IMC_ROOT "/sys/devices/uncore_imc/"
> +#define IMC_EVENT "/sys/devices/uncore_imc/events/"
> +
> +static uint64_t imc_type_id(void)
> +{
> +	return filename_to_u64(IMC_ROOT "type", 10);
> +}
> +
> +static uint64_t imc_data_reads(void)
> +{
> +	return filename_to_u64(IMC_EVENT "data_reads", 0);
> +}
> +
> +static double imc_data_reads_scale(void)
> +{
> +	return filename_to_double(IMC_EVENT "data_reads.scale");
> +}
> +
> +static const char *imc_data_reads_unit(void)
> +{
> +	char buf[32];
> +
> +	if (filename_to_buf(IMC_EVENT "data_reads.unit", buf, sizeof(buf)) == 0)
> +		return strdup(buf);
> +	else
> +		return NULL;
> +}
> +
> +static uint64_t imc_data_writes(void)
> +{
> +	return filename_to_u64(IMC_EVENT "data_writes", 0);
> +}
> +
> +static double imc_data_writes_scale(void)
> +{
> +	return filename_to_double(IMC_EVENT "data_writes.scale");
> +}
> +
> +static const char *imc_data_writes_unit(void)
> +{
> +	char buf[32];
> +
> +	if (filename_to_buf(IMC_EVENT "data_writes.unit",
> +			    buf, sizeof(buf)) == 0)
> +		return strdup(buf);
> +	else
> +		return NULL;
> +}
> +
> +#define _open_pmu(cnt, pmu, fd) \
> +({ \
> +	int fd__; \
> +\
> +	fd__ = perf_i915_open_group((pmu)->config, (fd)); \
> +	if (fd__ >= 0) { \
> +		if ((fd) == -1) \
> +			(fd) = fd__; \
> +		(pmu)->present = true; \
> +		(pmu)->idx = (cnt)++; \
> +	} \
> +\
> +	fd__; \
> +})
> +
> +#define _open_imc(cnt, pmu, fd) \
> +({ \
> +	int fd__; \
> +\
> +	fd__ = igt_perf_open_group(imc_type_id(), (pmu)->config, (fd)); \
> +	if (fd__ >= 0) { \
> +		if ((fd) == -1) \
> +			(fd) = fd__; \
> +		(pmu)->present = true; \
> +		(pmu)->idx = (cnt)++; \
> +	} \
> +\
> +	fd__; \
> +})
> +
> +static int pmu_init(struct engines *engines)
> +{
> +	unsigned int i;
> +	int fd;
> +
> +	engines->fd = -1;
> +	engines->num_counters = 0;
> +
> +	engines->irq.config = I915_PMU_INTERRUPTS;
> +	fd = _open_pmu(engines->num_counters, &engines->irq, engines->fd);
> +	if (fd < 0)
> +		return -1;
> +
> +	engines->freq_req.config = I915_PMU_REQUESTED_FREQUENCY;
> +	_open_pmu(engines->num_counters, &engines->freq_req, engines->fd);
> +
> +	engines->freq_act.config = I915_PMU_ACTUAL_FREQUENCY;
> +	_open_pmu(engines->num_counters, &engines->freq_act, engines->fd);
> +
> +	engines->rc6.config = I915_PMU_RC6_RESIDENCY;
> +	_open_pmu(engines->num_counters, &engines->rc6, engines->fd);
> +
> +	for (i = 0; i < engines->num_engines; i++) {
> +		struct engine *engine = engine_ptr(engines, i);
> +		struct {
> +			struct pmu_counter *pmu;
> +			const char *counter;
> +		} *cnt, counters[] = {
> +			{ .pmu = &engine->busy, .counter = "busy" },
> +			{ .pmu = &engine->wait, .counter = "wait" },
> +			{ .pmu = &engine->sema, .counter = "sema" },
> +			{ .pmu = NULL, .counter = NULL },
> +		};
> +
> +		for (cnt = counters; cnt->pmu; cnt++) {
> +			if (!cnt->pmu->config)
> +				cnt->pmu->config =
> +					get_pmu_config(dirfd(engines->root),
> +						       engine->name,
> +						       cnt->counter);
> +			fd = _open_pmu(engines->num_counters, cnt->pmu,
> +				       engines->fd);
> +			if (fd >= 0)
> +				engine->num_counters++;
> +		}
>   	}
> -	if (i) {
> -		printf("%s", bars[i]);
> -		cur_line_len++;
> +
> +	engines->rapl_fd = -1;
> +	if (rapl_type_id()) {
> +		engines->rapl_scale = rapl_gpu_power_scale();
> +		engines->rapl_unit = rapl_gpu_power_unit();
> +		if (!engines->rapl_unit)
> +			return -1;
> +
> +		engines->rapl.config = rapl_gpu_power();
> +		if (!engines->rapl.config)
> +			return -1;
> +
> +		engines->rapl_fd = igt_perf_open(rapl_type_id(),
> +						 engines->rapl.config);
> +		if (engines->rapl_fd < 0)
> +			return -1;
> +
> +		engines->rapl.present = true;
>   	}
>   
> -	/* NB: We can't use a field width with utf8 so we manually
> -	* guarantee a field with of 45 chars for any bar. */
> -	printf("%*s", PERCENTAGE_BAR_END - cur_line_len, "");
> -}
> +	engines->imc_fd = -1;
> +	if (imc_type_id()) {
> +		unsigned int num = 0;
>   
> -struct ring {
> -	const char *name;
> -	uint32_t mmio;
> -	int head, tail, size;
> -	uint64_t full;
> -	int idle;
> -};
> +		engines->imc_reads_scale = imc_data_reads_scale();
> +		engines->imc_writes_scale = imc_data_writes_scale();
> +
> +		engines->imc_reads_unit = imc_data_reads_unit();
> +		if (!engines->imc_reads_unit)
> +			return -1;
> +
> +		engines->imc_writes_unit = imc_data_writes_unit();
> +		if (!engines->imc_writes_unit)
> +			return -1;
> +
> +		engines->imc_reads.config = imc_data_reads();
> +		if (!engines->imc_reads.config)
> +			return -1;
> +
> +		engines->imc_writes.config = imc_data_writes();
> +		if (!engines->imc_writes.config)
> +			return -1;
> +
> +		fd = _open_imc(num, &engines->imc_reads, engines->imc_fd);
> +		if (fd < 0)
> +			return -1;
> +		fd = _open_imc(num, &engines->imc_writes, engines->imc_fd);
> +		if (fd < 0)
> +			return -1;
> +
> +		engines->imc_reads.present = true;
> +		engines->imc_writes.present = true;
> +	}
> +
> +	return 0;
> +}
>   
> -static uint32_t ring_read(struct ring *ring, uint32_t reg)
> +static uint64_t pmu_read_multi(int fd, unsigned int num, uint64_t *val)
>   {
> -	return INREG(ring->mmio + reg);
> +	uint64_t buf[2 + num];
> +	unsigned int i;
> +	ssize_t len;
> +
> +	memset(buf, 0, sizeof(buf));
> +
> +	len = read(fd, buf, sizeof(buf));
> +	assert(len == sizeof(buf));
> +
> +	for (i = 0; i < num; i++)
> +		val[i] = buf[2 + i];
> +
> +	return buf[1];
>   }
>   
> -static void ring_init(struct ring *ring)
> +static double __pmu_calc(struct pmu_pair *p, double d, double t, double s)
>   {
> -	ring->size = (((ring_read(ring, RING_LEN) & RING_NR_PAGES) >> 12) + 1) * 4096;
> +	double v;
> +
> +	v = p->cur - p->prev;
> +	v /= d;
> +	v /= t;
> +	v *= s;
> +
> +	if (s == 100.0 && v > 100.0)
> +		v = 100.0;
> +
> +	return v;
>   }
>   
> -static void ring_reset(struct ring *ring)
> +static void fill_str(char *buf, unsigned int bufsz, char c, unsigned int num)
>   {
> -	ring->idle = ring->full = 0;
> +	unsigned int i;
> +
> +	for (i = 0; i < num && i < (bufsz - 1); i++)
> +		*buf++ = c;
> +
> +	*buf = 0;
>   }
>   
> -static void ring_sample(struct ring *ring)
> +static void pmu_calc(struct pmu_counter *cnt,
> +		     char *buf, unsigned int bufsz,
> +		     unsigned int width, unsigned width_dec,
> +		     double d, double t, double s)
>   {
> -	int full;
> +	double val;
> +	int len;
> +
> +	assert(bufsz >= (width + width_dec + 1));
> +
> +	if (!cnt->present) {
> +		fill_str(buf, bufsz, '-', width + width_dec);
> +		return;
> +	}
>   
> -	if (!ring->size)
> +	val = __pmu_calc(&cnt->val, d, t, s);
> +
> +	len = snprintf(buf, bufsz, "%*.*f", width + width_dec, width_dec, val);
> +	if (len < 0 || len == bufsz) {
> +		fill_str(buf, bufsz, 'X', width + width_dec);
>   		return;
> +	}
> +}
> +
> +static uint64_t __pmu_read_single(int fd, uint64_t *ts)
> +{
> +	uint64_t data[2] = { };
> +	ssize_t len;
>   
> -	ring->head = ring_read(ring, RING_HEAD) & HEAD_ADDR;
> -	ring->tail = ring_read(ring, RING_TAIL) & TAIL_ADDR;
> +	len = read(fd, data, sizeof(data));
> +	assert(len == sizeof(data));
>   
> -	if (ring->tail == ring->head)
> -		ring->idle++;
> +	if (ts)
> +		*ts = data[1];
>   
> -	full = ring->tail - ring->head;
> -	if (full < 0)
> -		full += ring->size;
> -	ring->full += full;
> +	return data[0];
>   }
>   
> -static void ring_print_header(FILE *out, struct ring *ring)
> +static uint64_t pmu_read_single(int fd)
>   {
> -    fprintf(out, "%.6s%%\tops\t",
> -            ring->name
> -          );
> +	return __pmu_read_single(fd, NULL);
>   }
>   
> -static void ring_print(struct ring *ring, unsigned long samples_per_sec)
> +static void __update_sample(struct pmu_counter *counter, uint64_t val)
>   {
> -	int percent_busy, len;
> +	counter->val.prev = counter->val.cur;
> +	counter->val.cur = val;
> +}
>   
> -	if (!ring->size)
> -		return;
> +static void update_sample(struct pmu_counter *counter, uint64_t *val)
> +{
> +	if (counter->present)
> +		__update_sample(counter, val[counter->idx]);
> +}
> +
> +static void pmu_sample(struct engines *engines)
> +{
> +	const int num_val = engines->num_counters;
> +	uint64_t val[2 + num_val];
> +	unsigned int i;
> +
> +	engines->ts.prev = engines->ts.cur;
> +
> +	if (engines->rapl_fd >= 0)
> +		__update_sample(&engines->rapl,
> +				pmu_read_single(engines->rapl_fd));
> +
> +	if (engines->imc_fd >= 0) {
> +		pmu_read_multi(engines->imc_fd, 2, val);
> +		update_sample(&engines->imc_reads, val);
> +		update_sample(&engines->imc_writes, val);
> +	}
>   
> -	percent_busy = 100 - 100 * ring->idle / samples_per_sec;
> +	engines->ts.cur = pmu_read_multi(engines->fd, num_val, val);
>   
> -	len = printf("%25s busy: %3d%%: ", ring->name, percent_busy);
> -	print_percentage_bar (percent_busy, len);
> -	printf("%24s space: %d/%d\n",
> -		   ring->name,
> -		   (int)(ring->full / samples_per_sec),
> -		   ring->size);
> +	update_sample(&engines->freq_req, val);
> +	update_sample(&engines->freq_act, val);
> +	update_sample(&engines->irq, val);
> +	update_sample(&engines->rc6, val);
> +
> +	for (i = 0; i < engines->num_engines; i++) {
> +		struct engine *engine = engine_ptr(engines, i);
> +
> +		update_sample(&engine->busy, val);
> +		update_sample(&engine->sema, val);
> +		update_sample(&engine->wait, val);
> +	}
>   }
>   
> -static void ring_log(struct ring *ring, unsigned long samples_per_sec,
> -		FILE *output)
> +static const char *bars[] = { " ", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█" };
> +
> +static void
> +print_percentage_bar(double percent, int max_len)
>   {
> -	if (ring->size)
> -		fprintf(output, "%3d\t%d\t",
> -			(int)(100 - 100 * ring->idle / samples_per_sec),
> -			(int)(ring->full / samples_per_sec));
> -	else
> -		fprintf(output, "-1\t-1\t");
> +	int bar_len = percent * (8 * (max_len - 2)) / 100.0;
> +	int i;
> +
> +	putchar('|');
> +
> +	for (i = bar_len; i >= 8; i -= 8)
> +		printf("%s", bars[8]);
> +	if (i)
> +		printf("%s", bars[i]);
> +
> +	for (i = 0; i < (max_len - 2 - (bar_len + 7) / 8); i++)
> +		putchar(' ');
> +
> +	putchar('|');
>   }
>   
> +#define DEFAULT_PERIOD_MS (1000)
> +
>   static void
>   usage(const char *appname)
>   {
>   	printf("intel_gpu_top - Display a top-like summary of Intel GPU usage\n"
> -			"\n"
> -			"usage: %s [parameters]\n"
> -			"\n"
> -			"The following parameters apply:\n"
> -			"[-s <samples>]       samples per seconds (default %d)\n"
> -			"[-e <command>]       command to profile\n"
> -			"[-o <file>]          output statistics to file. If file is '-',"
> -			"                     run in batch mode and output statistics to stdio only \n"
> -			"[-h]                 show this help screen\n"
> -			"\n",
> -			appname,
> -			SAMPLES_PER_SEC
> -		  );
> -	return;
> +		"\n"
> +		"Usage: %s [parameters]\n"
> +		"\n"
> +		"\tThe following parameters are optional:\n\n"
> +		"\t[-s <ms>]       Refresh period in milliseconds (default %ums).\n"
> +		"\t[-h]            Show this help text.\n"
> +		"\n",
> +		appname, DEFAULT_PERIOD_MS);
>   }
>   
>   int main(int argc, char **argv)
>   {
> -	uint32_t devid;
> -	struct pci_device *pci_dev;
> -	struct ring render_ring = {
> -		.name = "render",
> -		.mmio = 0x2030,
> -	}, bsd_ring = {
> -		.name = "bitstream",
> -		.mmio = 0x4030,
> -	}, bsd6_ring = {
> -		.name = "bitstream",
> -		.mmio = 0x12030,
> -	}, blt_ring = {
> -		.name = "blitter",
> -		.mmio = 0x22030,
> -	};
> -	int i, ch;
> -	int samples_per_sec = SAMPLES_PER_SEC;
> -	FILE *output = NULL;
> -	double elapsed_time=0;
> -	int print_headers=1;
> -	pid_t child_pid=-1;
> -	int child_stat;
> -	char *cmd=NULL;
> -	int interactive=1;
> -
> -	/* Parse options? */
> -	while ((ch = getopt(argc, argv, "s:o:e:h")) != -1) {
> +	unsigned int period_us = DEFAULT_PERIOD_MS * 1000;
> +	int con_w = -1, con_h = -1;
> +	struct engines *engines;
> +	unsigned int i;
> +	int ret, ch;
> +
> +	/* Parse options */
> +	while ((ch = getopt(argc, argv, "s:h")) != -1) {
>   		switch (ch) {
> -		case 'e': cmd = strdup(optarg);
> -			break;
> -		case 's': samples_per_sec = atoi(optarg);
> -			if (samples_per_sec < 100) {
> -				fprintf(stderr, "Error: samples per second must be >= 100\n");
> -				exit(1);
> -			}
> -			break;
> -		case 'o':
> -			if (!strcmp(optarg, "-")) {
> -				/* Running in non-interactive mode */
> -				interactive = 0;
> -				output = stdout;
> -			}
> -			else
> -				output = fopen(optarg, "w");
> -			if (!output)
> -			{
> -				perror("fopen");
> -				exit(1);
> -			}
> +		case 's':
> +			period_us = atoi(optarg) * 1000;
>   			break;
>   		case 'h':
>   			usage(argv[0]);
>   			exit(0);
> -			break;
>   		default:
> -			fprintf(stderr, "Invalid flag %c!\n", (char)optopt);
> +			fprintf(stderr, "Invalid option %c!\n", (char)optopt);
>   			usage(argv[0]);
>   			exit(1);
> -			break;
>   		}
>   	}
>   
> -	pci_dev = intel_get_pci_device();
> -	devid = pci_dev->device_id;
> -	intel_mmio_use_pci_bar(pci_dev);
> -	init_instdone_definitions(devid);
> -
> -	/* Do we have a command to run? */
> -	if (cmd != NULL) {
> -		if (output) {
> -			fprintf(output, "# Profiling: %s\n", cmd);
> -			fflush(output);
> -		}
> -		child_pid = fork();
> -		if (child_pid < 0) {
> -			perror("fork");
> -			exit(1);
> -		}
> -		else if (child_pid == 0) {
> -			int res;
> -			res = system(cmd);
> -			if (res < 0)
> -				perror("running command");
> -			if (output) {
> -				fflush(output);
> -				fprintf(output, "# %s exited with status %d\n", cmd, res);
> -				fflush(output);
> -			}
> -			free(cmd);
> -			exit(0);
> -		} else {
> -			free(cmd);
> -		}
> +	engines = discover_engines();
> +	if (!engines) {
> +		fprintf(stderr,
> +			"Failed to detect engines! (%s)\n(Kernel 4.16 or newer is required for i915 PMU support.)\n",
> +			strerror(errno));
> +		return 1;
>   	}
>   
> -	for (i = 0; i < num_instdone_bits; i++) {
> -		top_bits[i].bit = &instdone_bits[i];
> -		top_bits[i].count = 0;
> -		top_bits_sorted[i] = &top_bits[i];
> +	ret = pmu_init(engines);
> +	if (ret) {
> +		fprintf(stderr,
> +			"Failed to initialize PMU! (%s)\n", strerror(errno));
> +		return 1;
>   	}
>   
> -	/* Grab access to the registers */
> -	intel_register_access_init(pci_dev, 0, -1);
> +	pmu_sample(engines);
>   
> -	ring_init(&render_ring);
> -	if (IS_GEN4(devid) || IS_GEN5(devid))
> -		ring_init(&bsd_ring);
> -	if (IS_GEN6(devid) || IS_GEN7(devid)) {
> -		ring_init(&bsd6_ring);
> -		ring_init(&blt_ring);
> -	}
> +	for (;;) {
> +		double t;
> +#define BUFSZ 16
> +		char freq[BUFSZ];
> +		char fact[BUFSZ];
> +		char irq[BUFSZ];
> +		char rc6[BUFSZ];
> +		char power[BUFSZ];
> +		char reads[BUFSZ];
> +		char writes[BUFSZ];
> +		struct winsize ws;
> +		int lines = 0;
>   
> -	/* Initialize GPU stats */
> -	if (HAS_STATS_REGS(devid)) {
> -		for (i = 0; i < STATS_COUNT; i++) {
> -			uint32_t stats_high, stats_low, stats_high_2;
> +		/* Update terminal size. */
> +		if (ioctl(0, TIOCGWINSZ, &ws) != -1) {
> +			con_w = ws.ws_col;
> +			con_h = ws.ws_row;
> +		}
>   
> -			do {
> -				stats_high = INREG(stats_regs[i] + 4);
> -				stats_low = INREG(stats_regs[i]);
> -				stats_high_2 = INREG(stats_regs[i] + 4);
> -			} while (stats_high != stats_high_2);
> +		pmu_sample(engines);
> +		t = (double)(engines->ts.cur - engines->ts.prev) / 1e9;
>   
> -			last_stats[i] = (uint64_t)stats_high << 32 |
> -				stats_low;
> -		}
> -	}
> +		printf("\033[H\033[J");
>   
> -	for (;;) {
> -		int j;
> -		unsigned long long t1, ti, tf, t2;
> -		unsigned long long def_sleep = 1000000 / samples_per_sec;
> -		unsigned long long last_samples_per_sec = samples_per_sec;
> -		unsigned short int max_lines;
> -		struct winsize ws;
> -		char clear_screen[] = {0x1b, '[', 'H',
> -				       0x1b, '[', 'J',
> -				       0x0};
> -		int percent;
> -		int len;
> -
> -		t1 = gettime();
> -
> -		ring_reset(&render_ring);
> -		ring_reset(&bsd_ring);
> -		ring_reset(&bsd6_ring);
> -		ring_reset(&blt_ring);
> -
> -		for (i = 0; i < samples_per_sec; i++) {
> -			long long interval;
> -			ti = gettime();
> -			if (IS_965(devid)) {
> -				instdone = INREG(INSTDONE_I965);
> -				instdone1 = INREG(INSTDONE_1);
> -			} else
> -				instdone = INREG(INSTDONE);
> -
> -			for (j = 0; j < num_instdone_bits; j++)
> -				update_idle_bit(&top_bits[j]);
> -
> -			ring_sample(&render_ring);
> -			ring_sample(&bsd_ring);
> -			ring_sample(&bsd6_ring);
> -			ring_sample(&blt_ring);
> -
> -			tf = gettime();
> -			if (tf - t1 >= 1000000) {
> -				/* We are out of sync, bail out */
> -				last_samples_per_sec = i+1;
> -				break;
> -			}
> -			interval = def_sleep - (tf - ti);
> -			if (interval > 0)
> -				usleep(interval);
> -		}
> +		pmu_calc(&engines->freq_req, freq, BUFSZ, 4, 0, 1.0, t, 1);
> +		pmu_calc(&engines->freq_act, fact, BUFSZ, 4, 0, 1.0, t, 1);
> +		pmu_calc(&engines->irq, irq, BUFSZ, 8, 0, 1.0, t, 1);
> +		pmu_calc(&engines->rc6, rc6, BUFSZ, 3, 0, 1e9, t, 100);
> +		pmu_calc(&engines->rapl, power, BUFSZ, 4, 2, 1.0, t,
> +			 engines->rapl_scale);
> +		pmu_calc(&engines->imc_reads, reads, BUFSZ, 6, 0, 1.0, t,
> +			 engines->imc_reads_scale);
> +		pmu_calc(&engines->imc_writes, writes, BUFSZ, 6, 0, 1.0, t,
> +			 engines->imc_writes_scale);
>   
> -		if (HAS_STATS_REGS(devid)) {
> -			for (i = 0; i < STATS_COUNT; i++) {
> -				uint32_t stats_high, stats_low, stats_high_2;
> +		if (lines++ < con_h)
> +			printf("intel-gpu-top - %s/%s MHz;  %s%% RC6; %s %s; %s irqs/s\n",
> +			       fact, freq, rc6, power, engines->rapl_unit, irq);
>   
> -				do {
> -					stats_high = INREG(stats_regs[i] + 4);
> -					stats_low = INREG(stats_regs[i]);
> -					stats_high_2 = INREG(stats_regs[i] + 4);
> -				} while (stats_high != stats_high_2);
> +		if (lines++ < con_h)
> +			printf("\n");
>   
> -				stats[i] = (uint64_t)stats_high << 32 |
> -					stats_low;
> -			}
> -		}
> +		if (engines->imc_fd) {
> +			if (lines++ < con_h)
> +				printf("      IMC reads:   %s %s/s\n",
> +				       reads, engines->imc_reads_unit);
> +
> +			if (lines++ < con_h)
> +				printf("     IMC writes:   %s %s/s\n",
> +				       writes, engines->imc_writes_unit);
>   
> -		qsort(top_bits_sorted, num_instdone_bits,
> -		      sizeof(struct top_bit *), top_bits_sort);
> -
> -		/* Limit the number of lines printed to the terminal height so the
> -		 * most important info (at the top) will stay on screen. */
> -		max_lines = -1;
> -		if (ioctl(0, TIOCGWINSZ, &ws) != -1)
> -			max_lines = ws.ws_row - 6; /* exclude header lines */
> -		if (max_lines >= num_instdone_bits)
> -			max_lines = num_instdone_bits;
> -
> -		t2 = gettime();
> -		elapsed_time += (t2 - t1) / 1000000.0;
> -
> -		if (interactive) {
> -			printf("%s", clear_screen);
> -			print_clock_info(pci_dev);
> -
> -			ring_print(&render_ring, last_samples_per_sec);
> -			ring_print(&bsd_ring, last_samples_per_sec);
> -			ring_print(&bsd6_ring, last_samples_per_sec);
> -			ring_print(&blt_ring, last_samples_per_sec);
> -
> -			printf("\n%30s  %s\n", "task", "percent busy");
> -			for (i = 0; i < max_lines; i++) {
> -				if (top_bits_sorted[i]->count > 0) {
> -					percent = (top_bits_sorted[i]->count * 100) /
> -						last_samples_per_sec;
> -					len = printf("%30s: %3d%%: ",
> -							 top_bits_sorted[i]->bit->name,
> -							 percent);
> -					print_percentage_bar (percent, len);
> -				} else {
> -					printf("%*s", PERCENTAGE_BAR_END, "");
> -				}
> -
> -				if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -					printf("%13s: %llu (%lld/sec)",
> -						   stats_reg_names[i],
> -						   (long long)stats[i],
> -						   (long long)(stats[i] - last_stats[i]));
> -					last_stats[i] = stats[i];
> -				} else {
> -					if (!top_bits_sorted[i]->count)
> -						break;
> -				}
> +			if (++lines < con_h)
>   				printf("\n");
> -			}
>   		}
> -		if (output) {
> -			/* Print headers for columns at first run */
> -			if (print_headers) {
> -				fprintf(output, "# time\t");
> -				ring_print_header(output, &render_ring);
> -				ring_print_header(output, &bsd_ring);
> -				ring_print_header(output, &bsd6_ring);
> -				ring_print_header(output, &blt_ring);
> -				for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
> -					if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -						fprintf(output, "%.6s\t",
> -							   stats_reg_names[i]
> -							   );
> -					}
> -					if (!top_bits[i].count)
> -						continue;
> -				}
> -				fprintf(output, "\n");
> -				print_headers = 0;
> -			}
>   
> -			/* Print statistics */
> -			fprintf(output, "%.2f\t", elapsed_time);
> -			ring_log(&render_ring, last_samples_per_sec, output);
> -			ring_log(&bsd_ring, last_samples_per_sec, output);
> -			ring_log(&bsd6_ring, last_samples_per_sec, output);
> -			ring_log(&blt_ring, last_samples_per_sec, output);
> -
> -			for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
> -				if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -					fprintf(output, "%"PRIu64"\t",
> -						   stats[i] - last_stats[i]);
> -					last_stats[i] = stats[i];
> -				}
> -					if (!top_bits[i].count)
> -						continue;
> -			}
> -			fprintf(output, "\n");
> -			fflush(output);
> -		}
> +		for (i = 0; i < engines->num_engines; i++) {
> +			struct engine *engine = engine_ptr(engines, i);
>   
> -		for (i = 0; i < num_instdone_bits; i++) {
> -			top_bits_sorted[i]->count = 0;
> +			if (engine->num_counters && lines < con_h) {
> +				const char *a = "          ENGINE      BUSY ";
> +				const char *b = " MI_SEMA MI_WAIT";
>   
> -			if (i < STATS_COUNT)
> -				last_stats[i] = stats[i];
> +				printf("\033[7m%s%*s%s\033[0m\n",
> +				       a,
> +				       (int)(con_w - 1 - strlen(a) - strlen(b)),
> +				       " ", b);
> +				lines++;
> +				break;
> +			}
>   		}
>   
> -		/* Check if child has gone */
> -		if (child_pid > 0) {
> -			int res;
> -			if ((res = waitpid(child_pid, &child_stat, WNOHANG)) == -1) {
> -				perror("waitpid");
> -				exit(1);
> -			}
> -			if (res == 0)
> +		for (i = 0; i < engines->num_engines && lines < con_h; i++) {
> +			struct engine *engine = engine_ptr(engines, i);
> +			unsigned int max_w = con_w - 1;
> +			unsigned int len;
> +			char sema[BUFSZ];
> +			char wait[BUFSZ];
> +			char busy[BUFSZ];
> +			char buf[128];
> +			double val;
> +
> +			if (!engine->num_counters)
>   				continue;
> -			if (WIFEXITED(child_stat))
> -				break;
> +
> +			pmu_calc(&engine->sema, sema, BUFSZ, 3, 0, 1e9, t, 100);
> +			pmu_calc(&engine->wait, wait, BUFSZ, 3, 0, 1e9, t, 100);
> +			len = snprintf(buf, sizeof(buf), "    %s%%    %s%%",
> +				       sema, wait);
> +
> +			pmu_calc(&engine->busy, busy, BUFSZ, 6, 2, 1e9, t,
> +				 100);
> +			len += printf("%16s %s%% ", engine->display_name, busy);
> +
> +			val = __pmu_calc(&engine->busy.val, 1e9, t, 100);
> +			print_percentage_bar(val, max_w - len);
> +
> +			printf("%s\n", buf);
> +
> +			lines++;
>   		}
> -	}
>   
> -	fclose(output);
> +		if (lines++ < con_h)
> +			printf("\n");
> +
> +		usleep(period_us);
> +	}
>   
> -	intel_register_access_fini();
>   	return 0;
>   }
> diff --git a/tools/meson.build b/tools/meson.build
> index bd2d313d5156..a918eeb0bef1 100644
> --- a/tools/meson.build
> +++ b/tools/meson.build
> @@ -23,7 +23,6 @@ tools_progs = [
>   	'intel_gpu_frequency',
>   	'intel_firmware_decode',
>   	'intel_gpu_time',
> -	'intel_gpu_top',
>   	'intel_gtt',
>   	'intel_guc_logger',
>   	'intel_infoframes',
> @@ -117,6 +116,11 @@ shared_library('intel_aubdump', 'aubdump.c',
>   	       name_prefix : '',
>   	       install : true)
>   
> +executable('intel_gpu_top', 'intel_gpu_top.c',
> +	   install : true,
> +	   install_rpath : rpathdir,
> +	   dependencies : tool_deps + [ lib_igt_perf ])
> +
>   conf_data = configuration_data()
>   conf_data.set('prefix', prefix)
>   conf_data.set('exec_prefix', '${prefix}')
>
Rinat Ibragimov April 23, 2018, 12:18 p.m. UTC | #2
Ping?


>Понедельник,  9 апреля 2018, 15:26 +03:00 от Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>:
>
>
>[Adding some people to Cc for more ack/nack type feedback.]
>
>Executive question is ack or nack on replacing intel_gpu_top with a new 
>implementation which uses only perf PMU for counter gathering.
>
>A short history on how this came to be:
>
>There was a recent external patch contribution from Rinat Ibragimov to 
>support more platforms from the existing intel_gpu_top. But as the tool 
>is not safe to use Chris Wilson suggested to maybe just replace it.
>
>As it happens I had a good start to do this quickly and cheaply, in the 
>form of one prototype I did recently, which only needed ripping some 
>bits out, and polishing the rest.
>
>Eero and Rinat kindly did a lot of platform coverage testing and the 
>rewrite seems ready for next steps.
>
>I need to stress that as the commit notes, the new tool has a slightly 
>different scope as that it doesn't expose GPU functional level data, but 
>only overall stats like power, frequencies, RC6, interrupts, IMC memory 
>bandwidth and per command streamer busyness, mi_semaphore and mi_event 
>waits. My thinking was that for more functional level profiling gpu-top 
>(OA) should be used.
>
>Also the "run a command" and CSV output features are not not supported 
>since both can be done directly via perf stat.
>
>Regards,
>
>Tvrtko
>
>On 04/04/2018 16:26, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin < tvrtko.ursulin@intel.com >
>> 
>> intel-gpu-top is a dangerous tool which can hang machines due unsafe mmio
>> register access. This patch rewrites it to use only PMU.
>> 
>> Only overall command streamer busyness and GPU global data such as power
>> and frequencies are included in this new version.
>> 
>> For access to more GPU functional unit level data, an OA metric based tool
>> like gpu-top should be used instead.
>> 
>> v2:
>>   * Sort engines by class and instance.
>>   * Do not wait for one sampling period to display something on screen.
>>   * Move code out of the asserts. (Rinat Ibragimov)
>>   * Continuously adapt to terminal size. (Rinat Ibragimov)
>> 
>> v3:
>>   * Change layout and precision of some field. (Chris Wilson)
>>   Eero Tamminen:
>>   * Use more user friendly engine names.
>>   * Don't error out if a counter is missing.
>>   * Add IMC read/write bandwidth.
>>   * Report minimum required kernel version.
>> 
>> v4:
>>   * Really support 4.16 by skipping of missing engines.
>>   * Simpler and less hacky float printing.
>>   * Preserve copyright header. (Antonio Argenziano)
>>   * Simplify engines_ptr macro. (Rinat Ibragimov)
>> 
>> v5:
>>   * Get RAPL unit from sysfs.
>>   * Consolidate sysfs paths with a macro.
>>   * Tidy error handling by carrying over and reporting errno.
>>   * Check against console height on all prints.
>>   * More readable minimum kernel version message. (Eero Tamminen)
>>   * Column banner for per engine stats. (Eero Tamminen)
>> 
>> v6:
>>   * Man page update. (Eero Tamminen)
>> 
>> Signed-off-by: Tvrtko Ursulin < tvrtko.ursulin@intel.com >
>> Cc: Chris Wilson < chris@chris-wilson.co.uk >
>> Cc: Lionel Landwerlin < lionel.g.landwerlin@intel.com >
>> Cc: Petri Latvala < petri.latvala@intel.com >
>> Cc: Eero Tamminen < eero.t.tamminen@intel.com >
>> Cc: Rinat Ibragimov < ibragimovrinat@mail.ru >
>> Reviewed-by: Lionel Landwerlin < lionel.g.landwerlin@intel.com > # v1
>> Reviewed-by: Chris Wilson < chris@chris-wilson.co.uk > # v0.5
>> ---
>>   lib/igt_perf.c        |    6 +
>>   lib/igt_perf.h        |    1 +
>>   man/intel_gpu_top.rst |   41 +-
>>   tools/Makefile.am     |    2 +
>>   tools/intel_gpu_top.c | 1250 +++++++++++++++++++++++++++----------------------
>>   tools/meson.build     |    6 +-
>>   6 files changed, 719 insertions(+), 587 deletions(-)
>> 
>> diff --git a/lib/igt_perf.c b/lib/igt_perf.c
>> index 99d82ea51c9b..e3dec2cc29c7 100644
>> --- a/lib/igt_perf.c
>> +++ b/lib/igt_perf.c
>> @@ -69,3 +69,9 @@ int igt_perf_open(uint64_t type, uint64_t config)
>>   return _perf_open(type, config, -1,
>>     PERF_FORMAT_TOTAL_TIME_ENABLED);
>>   }
>> +
>> +int igt_perf_open_group(uint64_t type, uint64_t config, int group)
>> +{
>> +return _perf_open(type, config, group,
>> +  PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_GROUP);
>> +}
>> diff --git a/lib/igt_perf.h b/lib/igt_perf.h
>> index 614ea5d23fa6..e00718f4769a 100644
>> --- a/lib/igt_perf.h
>> +++ b/lib/igt_perf.h
>> @@ -55,5 +55,6 @@ uint64_t i915_type_id(void);
>>   int perf_i915_open(uint64_t config);
>>   int perf_i915_open_group(uint64_t config, int group);
>>   int igt_perf_open(uint64_t type, uint64_t config);
>> +int igt_perf_open_group(uint64_t type, uint64_t config, int group);
>> 
>>   #endif /* I915_PERF_H */
>> diff --git a/man/intel_gpu_top.rst b/man/intel_gpu_top.rst
>> index a5f7175bb1a0..19c712307d28 100644
>> --- a/man/intel_gpu_top.rst
>> +++ b/man/intel_gpu_top.rst
>> @@ -7,9 +7,9 @@ Display a top-like summary of Intel GPU usage
>>   ---------------------------------------------
>>   .. include:: defs.rst
>>   :Author: IGT Developers < igt-dev@lists.freedesktop.org >
>> -:Date: 2016-03-01
>> +:Date: 2018-04-04
>>   :Version: |PACKAGE_STRING|
>> -:Copyright: 2009,2011,2012,2016 Intel Corporation
>> +:Copyright: 2009,2011,2012,2016,2018 Intel Corporation
>>   :Manual section: |MANUAL_SECTION|
>>   :Manual group: |MANUAL_GROUP|
>> 
>> @@ -21,42 +21,25 @@ SYNOPSIS
>>   DESCRIPTION
>>   ===========
>> 
>> -**intel_gpu_top** is a tool to display usage information of an Intel GPU. It
>> -requires root privilege to map the graphics device.
>> +**intel_gpu_top** is a tool to display usage information on Intel GPU's.
>> +
>> +The tool gathers data using perf performance counters (PMU) exposed by i915 and other platform drivers like RAPL (power) and Uncore IMC (memory bandwidth).
>> 
>>   OPTIONS
>>   =======
>> 
>> --s SAMPLES
>> -    Number of samples to acquire per second.
>> -
>> --o FILE
>> -    Collect usage statistics to FILE. If file is "-", run non-interactively
>> -    and output statistics to stdout.
>> -
>> --e COMMAND
>> -    Execute COMMAND to profile, and leave when it is finished. Note that the
>> -    entire command with all parameters should be included as one parameter.
>> +-s <ms>
>> +    Refresh period in milliseconds.
>> 
>>   -h
>> -    Show usage notes.
>> +    Show help text.
>> 
>> -EXAMPLES
>> -========
>> -
>> -intel_gpu_top -o "cairo-trace-gvim.log" -s 100 -e "cairo-perf-trace /tmp/gvim"
>> -    Run cairo-perf-trace with /tmp/gvim trace, non-interactively, saving the
>> -    statistics into cairo-trace-gvim.log file, and collecting 100 samples per
>> -    second.
>> -
>> -Note that idle units are not displayed, so an entirely idle GPU will only
>> -display the ring status and header.
>> +LIMITATIONS
>> +===========
>> 
>> -BUGS
>> -====
>> +* Not all metrics are supported on all platforms. Where a metric is unsupported it's value will be replaced by a dashed line.
>> 
>> -Some GPUs report some units as busy when they aren't, such that even when idle
>> -and not hung, it will show up as 100% busy.
>> +* Non-root access to perf counters is controlled by the *perf_event_paranoid* sysctl.
>> 
>>   REPORTING BUGS
>>   ==============
>> diff --git a/tools/Makefile.am b/tools/Makefile.am
>> index 09b6dbcc3ece..a0b016ddd7ff 100644
>> --- a/tools/Makefile.am
>> +++ b/tools/Makefile.am
>> @@ -28,6 +28,8 @@ intel_aubdump_la_LDFLAGS = -module -avoid-version -no-undefined
>>   intel_aubdump_la_SOURCES = aubdump.c
>>   intel_aubdump_la_LIBADD = $(top_builddir)/lib/libintel_tools.la -ldl
>> 
>> +intel_gpu_top_LDADD = $(top_builddir)/lib/libigt_perf.la
>> +
>>   bin_SCRIPTS = intel_aubdump
>>   CLEANFILES = $(bin_SCRIPTS)
>> 
>> diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
>> index 098e6ce3ff86..b923c3cfbe97 100644
>> --- a/tools/intel_gpu_top.c
>> +++ b/tools/intel_gpu_top.c
>> @@ -1,6 +1,5 @@
>>   /*
>> - * Copyright © 2007 Intel Corporation
>> - * Copyright © 2011 Intel Corporation
>> + * Copyright © 2007-2018 Intel Corporation
>>    *
>>    * Permission is hereby granted, free of charge, to any person obtaining a
>>    * copy of this software and associated documentation files (the "Software"),
>> @@ -24,695 +23,832 @@
>>    * Authors:
>>    *    Eric Anholt < eric@anholt.net >
>>    *    Eugeni Dodonov < eugeni.dodonov@intel.com >
>> - *
>>    */
>> 
>> -#include "config.h"
>> -
>> -#include <inttypes.h>
>> -#include <unistd.h>
>> -#include <stdlib.h>
>>   #include <stdio.h>
>> -#include <err.h>
>> -#include <sys/ioctl.h>
>> -#include <sys/time.h>
>> -#include <sys/wait.h>
>> +#include <sys/types.h>
>> +#include <dirent.h>
>> +#include <stdint.h>
>> +#include <assert.h>
>>   #include <string.h>
>> -#ifdef HAVE_TERMIOS_H
>> -#include <termios.h>
>> -#endif
>> -#include "intel_io.h"
>> -#include "instdone.h"
>> -#include "intel_reg.h"
>> -#include "intel_chipset.h"
>> -#include "drmtest.h"
>> -
>> -#define  FORCEWAKE    0xA18C
>> -#define  FORCEWAKE_ACK    0x130090
>> -
>> -#define SAMPLES_PER_SEC             10000
>> -#define SAMPLES_TO_PERCENT_RATIO    (SAMPLES_PER_SEC / 100)
>> -
>> -#define MAX_NUM_TOP_BITS            100
>> -
>> -#define HAS_STATS_REGS(devid)IS_965(devid)
>> -
>> -struct top_bit {
>> -struct instdone_bit *bit;
>> -int count;
>> -} top_bits[MAX_NUM_TOP_BITS];
>> -struct top_bit *top_bits_sorted[MAX_NUM_TOP_BITS];
>> -
>> -static uint32_t instdone, instdone1;
>> -
>> -static const char *bars[] = {
>> -" ",
>> -"▏",
>> -"▎",
>> -"▍",
>> -"▌",
>> -"▋",
>> -"▊",
>> -"▉",
>> -"█"
>> -};
>> +#include <ctype.h>
>> +#include <stdlib.h>
>> +#include <unistd.h>
>> +#include <sys/stat.h>
>> +#include <fcntl.h>
>> +#include <inttypes.h>
>> +#include <sys/ioctl.h>
>> +#include <errno.h>
>> +#include <math.h>
>> +#include <locale.h>
>> +
>> +#include "igt_perf.h"
>> 
>> -enum stats_counts {
>> -IA_VERTICES,
>> -IA_PRIMITIVES,
>> -VS_INVOCATION,
>> -GS_INVOCATION,
>> -GS_PRIMITIVES,
>> -CL_INVOCATION,
>> -CL_PRIMITIVES,
>> -PS_INVOCATION,
>> -PS_DEPTH,
>> -STATS_COUNT
>> +struct pmu_pair {
>> +uint64_t cur;
>> +uint64_t prev;
>>   };
>> 
>> -const uint32_t stats_regs[STATS_COUNT] = {
>> -IA_VERTICES_COUNT_QW,
>> -IA_PRIMITIVES_COUNT_QW,
>> -VS_INVOCATION_COUNT_QW,
>> -GS_INVOCATION_COUNT_QW,
>> -GS_PRIMITIVES_COUNT_QW,
>> -CL_INVOCATION_COUNT_QW,
>> -CL_PRIMITIVES_COUNT_QW,
>> -PS_INVOCATION_COUNT_QW,
>> -PS_DEPTH_COUNT_QW,
>> +struct pmu_counter {
>> +bool present;
>> +uint64_t config;
>> +unsigned int idx;
>> +struct pmu_pair val;
>>   };
>> 
>> -const char *stats_reg_names[STATS_COUNT] = {
>> -"vert fetch",
>> -"prim fetch",
>> -"VS invocations",
>> -"GS invocations",
>> -"GS prims",
>> -"CL invocations",
>> -"CL prims",
>> -"PS invocations",
>> -"PS depth pass",
>> +struct engine {
>> +const char *name;
>> +const char *display_name;
>> +
>> +unsigned int class;
>> +unsigned int instance;
>> +
>> +unsigned int num_counters;
>> +
>> +struct pmu_counter busy;
>> +struct pmu_counter wait;
>> +struct pmu_counter sema;
>>   };
>> 
>> -uint64_t stats[STATS_COUNT];
>> -uint64_t last_stats[STATS_COUNT];
>> +struct engines {
>> +unsigned int num_engines;
>> +unsigned int num_counters;
>> +DIR *root;
>> +int fd;
>> +struct pmu_pair ts;
>> +
>> +int rapl_fd;
>> +double rapl_scale;
>> +const char *rapl_unit;
>> +
>> +int imc_fd;
>> +double imc_reads_scale;
>> +const char *imc_reads_unit;
>> +double imc_writes_scale;
>> +const char *imc_writes_unit;
>> +
>> +struct pmu_counter freq_req;
>> +struct pmu_counter freq_act;
>> +struct pmu_counter irq;
>> +struct pmu_counter rc6;
>> +struct pmu_counter rapl;
>> +struct pmu_counter imc_reads;
>> +struct pmu_counter imc_writes;
>> +
>> +struct engine engine;
>> +};
>> 
>> -static unsigned long
>> -gettime(void)
>> +static uint64_t
>> +get_pmu_config(int dirfd, const char *name, const char *counter)
>>   {
>> -    struct timeval t;
>> -    gettimeofday(&t, NULL);
>> -    return (t.tv_usec + (t.tv_sec * 1000000));
>> -}
>> +char buf[128], *p;
>> +int fd, ret;
>> 
>> -static int
>> -top_bits_sort(const void *a, const void *b)
>> -{
>> -struct top_bit * const *bit_a = a;
>> -struct top_bit * const *bit_b = b;
>> -int a_count = (*bit_a)->count;
>> -int b_count = (*bit_b)->count;
>> +ret = snprintf(buf, sizeof(buf), "%s-%s", name, counter);
>> +if (ret < 0 || ret == sizeof(buf))
>> +return -1;
>> 
>> -if (a_count < b_count)
>> -return 1;
>> -else if (a_count == b_count)
>> -return 0;
>> -else
>> +fd = openat(dirfd, buf, O_RDONLY);
>> +if (fd < 0)
>>   return -1;
>> -}
>> 
>> -static void
>> -update_idle_bit(struct top_bit *top_bit)
>> -{
>> -uint32_t reg_val;
>> +ret = read(fd, buf, sizeof(buf));
>> +close(fd);
>> +if (ret <= 0)
>> +return -1;
>> 
>> -if (top_bit->bit->reg == INSTDONE_1)
>> -reg_val = instdone1;
>> -else
>> -reg_val = instdone;
>> +p = index(buf, '0');
>> +if (!p)
>> +return -1;
>> 
>> -if ((reg_val & top_bit->bit->bit) == 0)
>> -top_bit->count++;
>> +return strtoul(p, NULL, 0);
>>   }
>> 
>> -static void
>> -print_clock(const char *name, int clock) {
>> -if (clock == -1)
>> -printf("%s clock: unknown", name);
>> +#define engine_ptr(engines, n) (&engines->engine + (n))
>> +
>> +static const char *class_display_name(unsigned int class)
>> +{
>> +switch (class) {
>> +case I915_ENGINE_CLASS_RENDER:
>> +return "Render/3D";
>> +case I915_ENGINE_CLASS_COPY:
>> +return "Blitter";
>> +case I915_ENGINE_CLASS_VIDEO:
>> +return "Video";
>> +case I915_ENGINE_CLASS_VIDEO_ENHANCE:
>> +return "VideoEnhance";
>> +default:
>> +return "[unknown]";
>> +}
>> +}
>> +
>> +static int engine_cmp(const void *__a, const void *__b)
>> +{
>> +const struct engine *a = (struct engine *)__a;
>> +const struct engine *b = (struct engine *)__b;
>> +
>> +if (a->class != b->class)
>> +return a->class - b->class;
>>   else
>> -printf("%s clock: %d Mhz", name, clock);
>> +return a->instance - b->instance;
>>   }
>> 
>> -static int
>> -print_clock_info(struct pci_device *pci_dev)
>> +static struct engines *discover_engines(void)
>>   {
>> -uint32_t devid = pci_dev->device_id;
>> -uint16_t gcfgc;
>> +const char *sysfs_root = "/sys/devices/i915/events";
>> +struct engines *engines;
>> +struct dirent *dent;
>> +int ret = 0;
>> +DIR *d;
>> 
>> -if (IS_GM45(devid)) {
>> -int core_clock = -1;
>> +engines = malloc(sizeof(struct engines));
>> +if (!engines)
>> +return NULL;
>> 
>> -pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
>> +memset(engines, 0, sizeof(*engines));
>> 
>> -switch (gcfgc & 0xf) {
>> -case 8:
>> -core_clock = 266;
>> -break;
>> -case 9:
>> -core_clock = 320;
>> -break;
>> -case 11:
>> -core_clock = 400;
>> -break;
>> -case 13:
>> -core_clock = 533;
>> -break;
>> -}
>> -print_clock("core", core_clock);
>> -} else if (IS_965(devid) && IS_MOBILE(devid)) {
>> -int render_clock = -1, sampler_clock = -1;
>> +engines->num_engines = 0;
>> 
>> -pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
>> +d = opendir(sysfs_root);
>> +if (!d)
>> +return NULL;
>> 
>> -switch (gcfgc & 0xf) {
>> -case 2:
>> -render_clock = 250; sampler_clock = 267;
>> -break;
>> -case 3:
>> -render_clock = 320; sampler_clock = 333;
>> -break;
>> -case 4:
>> -render_clock = 400; sampler_clock = 444;
>> -break;
>> -case 5:
>> -render_clock = 500; sampler_clock = 533;
>> +while ((dent = readdir(d)) != NULL) {
>> +const char *endswith = "-busy";
>> +const unsigned int endlen = strlen(endswith);
>> +struct engine *engine =
>> +engine_ptr(engines, engines->num_engines);
>> +char buf[256];
>> +
>> +if (dent->d_type != DT_REG)
>> +continue;
>> +
>> +if (strlen(dent->d_name) >= sizeof(buf)) {
>> +ret = ENAMETOOLONG;
>>   break;
>>   }
>> 
>> -print_clock("render", render_clock);
>> -printf("  ");
>> -print_clock("sampler", sampler_clock);
>> -} else if (IS_945(devid) && IS_MOBILE(devid)) {
>> -int render_clock = -1, display_clock = -1;
>> +strcpy(buf, dent->d_name);
>> 
>> -pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
>> +/* xxxN-busy */
>> +if (strlen(buf) < (endlen + 4))
>> +continue;
>> +if (strcmp(&buf[strlen(buf) - endlen], endswith))
>> +continue;
>> 
>> -switch (gcfgc & 0x7) {
>> -case 0:
>> -render_clock = 166;
>> -break;
>> -case 1:
>> -render_clock = 200;
>> -break;
>> -case 3:
>> -render_clock = 250;
>> -break;
>> -case 5:
>> -render_clock = 400;
>> +memset(engine, 0, sizeof(*engine));
>> +
>> +buf[strlen(buf) - endlen] = 0;
>> +engine->name = strdup(buf);
>> +if (!engine->name) {
>> +ret = errno;
>>   break;
>>   }
>> 
>> -switch (gcfgc & 0x70) {
>> -case 0:
>> -display_clock = 200;
>> -break;
>> -case 4:
>> -display_clock = 320;
>> +engine->busy.config = get_pmu_config(dirfd(d), engine->name,
>> +     "busy");
>> +if (engine->busy.config == -1) {
>> +ret = ENOENT;
>>   break;
>>   }
>> -if (gcfgc & (1 << 7))
>> -    display_clock = 133;
>> 
>> -print_clock("render", render_clock);
>> -printf("  ");
>> -print_clock("display", display_clock);
>> -} else if (IS_915(devid) && IS_MOBILE(devid)) {
>> -int render_clock = -1, display_clock = -1;
>> +engine->class = (engine->busy.config &
>> + (__I915_PMU_OTHER(0) - 1)) >>
>> +I915_PMU_CLASS_SHIFT;
>> 
>> -pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
>> +engine->instance = (engine->busy.config >>
>> +    I915_PMU_SAMPLE_BITS) &
>> +    ((1 << I915_PMU_SAMPLE_INSTANCE_BITS) - 1);
>> 
>> -switch (gcfgc & 0x7) {
>> -case 0:
>> -render_clock = 160;
>> -break;
>> -case 1:
>> -render_clock = 190;
>> -break;
>> -case 4:
>> -render_clock = 333;
>> +ret = snprintf(buf, sizeof(buf), "%s/%u",
>> +       class_display_name(engine->class),
>> +       engine->instance);
>> +if (ret < 0 || ret == sizeof(buf)) {
>> +ret = ENOBUFS;
>>   break;
>>   }
>> -if (gcfgc & (1 << 13))
>> -    render_clock = 133;
>> +ret = 0;
>> 
>> -switch (gcfgc & 0x70) {
>> -case 0:
>> -display_clock = 190;
>> +engine->display_name = strdup(buf);
>> +if (!engine->display_name) {
>> +ret = errno;
>>   break;
>> -case 4:
>> -display_clock = 333;
>> +}
>> +
>> +engines->num_engines++;
>> +engines = realloc(engines, sizeof(struct engines) +
>> +  engines->num_engines * sizeof(struct engine));
>> +if (!engines) {
>> +ret = errno;
>>   break;
>>   }
>> -if (gcfgc & (1 << 7))
>> -    display_clock = 133;
>> +}
>> +
>> +if (ret) {
>> +free(engines);
>> +errno = ret;
>> 
>> -print_clock("render", render_clock);
>> -printf("  ");
>> -print_clock("display", display_clock);
>> +return NULL;
>>   }
>> 
>> +qsort(engine_ptr(engines, 0), engines->num_engines,
>> +      sizeof(struct engine), engine_cmp);
>> +
>> +engines->root = d;
>> 
>> -printf("\n");
>> -return -1;
>> +return engines;
>>   }
>> 
>> -#define STATS_LEN (20)
>> -#define PERCENTAGE_BAR_END(79 - STATS_LEN)
>> +static int
>> +filename_to_buf(const char *filename, char *buf, unsigned int bufsize)
>> +{
>> +int fd, err;
>> +ssize_t ret;
>> 
>> -static void
>> -print_percentage_bar(float percent, int cur_line_len)
>> +fd = open(filename, O_RDONLY);
>> +if (fd < 0)
>> +return -1;
>> +
>> +ret = read(fd, buf, bufsize - 1);
>> +err = errno;
>> +close(fd);
>> +if (ret < 1) {
>> +errno = ret < 0 ? err : ENOMSG;
>> +
>> +return -1;
>> +}
>> +
>> +if (ret > 1 && buf[ret - 1] == '\n')
>> +buf[ret - 1] = '\0';
>> +else
>> +buf[ret] = '\0';
>> +
>> +return 0;
>> +}
>> +
>> +static uint64_t filename_to_u64(const char *filename, int base)
>>   {
>> -int bar_avail_len = (PERCENTAGE_BAR_END - cur_line_len - 1) * 8;
>> -int bar_len = bar_avail_len * (percent + .5) / 100.0;
>> -int i;
>> +char buf[64], *b;
>> 
>> -for (i = bar_len; i >= 8; i -= 8) {
>> -printf("%s", bars[8]);
>> -cur_line_len++;
>> +if (filename_to_buf(filename, buf, sizeof(buf)))
>> +return 0;
>> +
>> +/*
>> + * Handle both single integer and key=value formats by skipping
>> + * leading non-digits.
>> + */
>> +b = buf;
>> +while (*b && !isdigit(*b))
>> +b++;
>> +
>> +return strtoull(b, NULL, base);
>> +}
>> +
>> +static double filename_to_double(const char *filename)
>> +{
>> +char *oldlocale;
>> +char buf[80];
>> +double v;
>> +
>> +if (filename_to_buf(filename, buf, sizeof(buf)))
>> +return 0;
>> +
>> +oldlocale = setlocale(LC_ALL, "C");
>> +v = strtod(buf, NULL);
>> +setlocale(LC_ALL, oldlocale);
>> +
>> +return v;
>> +}
>> +
>> +#define RAPL_ROOT "/sys/devices/power/"
>> +#define RAPL_EVENT "/sys/devices/power/events/"
>> +
>> +static uint64_t rapl_type_id(void)
>> +{
>> +return filename_to_u64(RAPL_ROOT "type", 10);
>> +}
>> +
>> +static uint64_t rapl_gpu_power(void)
>> +{
>> +return filename_to_u64(RAPL_EVENT "energy-gpu", 0);
>> +}
>> +
>> +static double rapl_gpu_power_scale(void)
>> +{
>> +return filename_to_double(RAPL_EVENT "energy-gpu.scale");
>> +}
>> +
>> +static const char *rapl_gpu_power_unit(void)
>> +{
>> +char buf[32];
>> +
>> +if (filename_to_buf(RAPL_EVENT "energy-gpu.unit",
>> +    buf, sizeof(buf)) == 0)
>> +if (!strcmp(buf, "Joules"))
>> +return strdup("Watts");
>> +else
>> +return strdup(buf);
>> +else
>> +return NULL;
>> +}
>> +
>> +#define IMC_ROOT "/sys/devices/uncore_imc/"
>> +#define IMC_EVENT "/sys/devices/uncore_imc/events/"
>> +
>> +static uint64_t imc_type_id(void)
>> +{
>> +return filename_to_u64(IMC_ROOT "type", 10);
>> +}
>> +
>> +static uint64_t imc_data_reads(void)
>> +{
>> +return filename_to_u64(IMC_EVENT "data_reads", 0);
>> +}
>> +
>> +static double imc_data_reads_scale(void)
>> +{
>> +return filename_to_double(IMC_EVENT "data_reads.scale");
>> +}
>> +
>> +static const char *imc_data_reads_unit(void)
>> +{
>> +char buf[32];
>> +
>> +if (filename_to_buf(IMC_EVENT "data_reads.unit", buf, sizeof(buf)) == 0)
>> +return strdup(buf);
>> +else
>> +return NULL;
>> +}
>> +
>> +static uint64_t imc_data_writes(void)
>> +{
>> +return filename_to_u64(IMC_EVENT "data_writes", 0);
>> +}
>> +
>> +static double imc_data_writes_scale(void)
>> +{
>> +return filename_to_double(IMC_EVENT "data_writes.scale");
>> +}
>> +
>> +static const char *imc_data_writes_unit(void)
>> +{
>> +char buf[32];
>> +
>> +if (filename_to_buf(IMC_EVENT "data_writes.unit",
>> +    buf, sizeof(buf)) == 0)
>> +return strdup(buf);
>> +else
>> +return NULL;
>> +}
>> +
>> +#define _open_pmu(cnt, pmu, fd) \
>> +({ \
>> +int fd__; \
>> +\
>> +fd__ = perf_i915_open_group((pmu)->config, (fd)); \
>> +if (fd__ >= 0) { \
>> +if ((fd) == -1) \
>> +(fd) = fd__; \
>> +(pmu)->present = true; \
>> +(pmu)->idx = (cnt)++; \
>> +} \
>> +\
>> +fd__; \
>> +})
>> +
>> +#define _open_imc(cnt, pmu, fd) \
>> +({ \
>> +int fd__; \
>> +\
>> +fd__ = igt_perf_open_group(imc_type_id(), (pmu)->config, (fd)); \
>> +if (fd__ >= 0) { \
>> +if ((fd) == -1) \
>> +(fd) = fd__; \
>> +(pmu)->present = true; \
>> +(pmu)->idx = (cnt)++; \
>> +} \
>> +\
>> +fd__; \
>> +})
>> +
>> +static int pmu_init(struct engines *engines)
>> +{
>> +unsigned int i;
>> +int fd;
>> +
>> +engines->fd = -1;
>> +engines->num_counters = 0;
>> +
>> +engines->irq.config = I915_PMU_INTERRUPTS;
>> +fd = _open_pmu(engines->num_counters, &engines->irq, engines->fd);
>> +if (fd < 0)
>> +return -1;
>> +
>> +engines->freq_req.config = I915_PMU_REQUESTED_FREQUENCY;
>> +_open_pmu(engines->num_counters, &engines->freq_req, engines->fd);
>> +
>> +engines->freq_act.config = I915_PMU_ACTUAL_FREQUENCY;
>> +_open_pmu(engines->num_counters, &engines->freq_act, engines->fd);
>> +
>> +engines->rc6.config = I915_PMU_RC6_RESIDENCY;
>> +_open_pmu(engines->num_counters, &engines->rc6, engines->fd);
>> +
>> +for (i = 0; i < engines->num_engines; i++) {
>> +struct engine *engine = engine_ptr(engines, i);
>> +struct {
>> +struct pmu_counter *pmu;
>> +const char *counter;
>> +} *cnt, counters[] = {
>> +{ .pmu = &engine->busy, .counter = "busy" },
>> +{ .pmu = &engine->wait, .counter = "wait" },
>> +{ .pmu = &engine->sema, .counter = "sema" },
>> +{ .pmu = NULL, .counter = NULL },
>> +};
>> +
>> +for (cnt = counters; cnt->pmu; cnt++) {
>> +if (!cnt->pmu->config)
>> +cnt->pmu->config =
>> +get_pmu_config(dirfd(engines->root),
>> +       engine->name,
>> +       cnt->counter);
>> +fd = _open_pmu(engines->num_counters, cnt->pmu,
>> +       engines->fd);
>> +if (fd >= 0)
>> +engine->num_counters++;
>> +}
>>   }
>> -if (i) {
>> -printf("%s", bars[i]);
>> -cur_line_len++;
>> +
>> +engines->rapl_fd = -1;
>> +if (rapl_type_id()) {
>> +engines->rapl_scale = rapl_gpu_power_scale();
>> +engines->rapl_unit = rapl_gpu_power_unit();
>> +if (!engines->rapl_unit)
>> +return -1;
>> +
>> +engines->rapl.config = rapl_gpu_power();
>> +if (!engines->rapl.config)
>> +return -1;
>> +
>> +engines->rapl_fd = igt_perf_open(rapl_type_id(),
>> + engines->rapl.config);
>> +if (engines->rapl_fd < 0)
>> +return -1;
>> +
>> +engines->rapl.present = true;
>>   }
>> 
>> -/* NB: We can't use a field width with utf8 so we manually
>> -* guarantee a field with of 45 chars for any bar. */
>> -printf("%*s", PERCENTAGE_BAR_END - cur_line_len, "");
>> -}
>> +engines->imc_fd = -1;
>> +if (imc_type_id()) {
>> +unsigned int num = 0;
>> 
>> -struct ring {
>> -const char *name;
>> -uint32_t mmio;
>> -int head, tail, size;
>> -uint64_t full;
>> -int idle;
>> -};
>> +engines->imc_reads_scale = imc_data_reads_scale();
>> +engines->imc_writes_scale = imc_data_writes_scale();
>> +
>> +engines->imc_reads_unit = imc_data_reads_unit();
>> +if (!engines->imc_reads_unit)
>> +return -1;
>> +
>> +engines->imc_writes_unit = imc_data_writes_unit();
>> +if (!engines->imc_writes_unit)
>> +return -1;
>> +
>> +engines->imc_reads.config = imc_data_reads();
>> +if (!engines->imc_reads.config)
>> +return -1;
>> +
>> +engines->imc_writes.config = imc_data_writes();
>> +if (!engines->imc_writes.config)
>> +return -1;
>> +
>> +fd = _open_imc(num, &engines->imc_reads, engines->imc_fd);
>> +if (fd < 0)
>> +return -1;
>> +fd = _open_imc(num, &engines->imc_writes, engines->imc_fd);
>> +if (fd < 0)
>> +return -1;
>> +
>> +engines->imc_reads.present = true;
>> +engines->imc_writes.present = true;
>> +}
>> +
>> +return 0;
>> +}
>> 
>> -static uint32_t ring_read(struct ring *ring, uint32_t reg)
>> +static uint64_t pmu_read_multi(int fd, unsigned int num, uint64_t *val)
>>   {
>> -return INREG(ring->mmio + reg);
>> +uint64_t buf[2 + num];
>> +unsigned int i;
>> +ssize_t len;
>> +
>> +memset(buf, 0, sizeof(buf));
>> +
>> +len = read(fd, buf, sizeof(buf));
>> +assert(len == sizeof(buf));
>> +
>> +for (i = 0; i < num; i++)
>> +val[i] = buf[2 + i];
>> +
>> +return buf[1];
>>   }
>> 
>> -static void ring_init(struct ring *ring)
>> +static double __pmu_calc(struct pmu_pair *p, double d, double t, double s)
>>   {
>> -ring->size = (((ring_read(ring, RING_LEN) & RING_NR_PAGES) >> 12) + 1) * 4096;
>> +double v;
>> +
>> +v = p->cur - p->prev;
>> +v /= d;
>> +v /= t;
>> +v *= s;
>> +
>> +if (s == 100.0 && v > 100.0)
>> +v = 100.0;
>> +
>> +return v;
>>   }
>> 
>> -static void ring_reset(struct ring *ring)
>> +static void fill_str(char *buf, unsigned int bufsz, char c, unsigned int num)
>>   {
>> -ring->idle = ring->full = 0;
>> +unsigned int i;
>> +
>> +for (i = 0; i < num && i < (bufsz - 1); i++)
>> +*buf++ = c;
>> +
>> +*buf = 0;
>>   }
>> 
>> -static void ring_sample(struct ring *ring)
>> +static void pmu_calc(struct pmu_counter *cnt,
>> +     char *buf, unsigned int bufsz,
>> +     unsigned int width, unsigned width_dec,
>> +     double d, double t, double s)
>>   {
>> -int full;
>> +double val;
>> +int len;
>> +
>> +assert(bufsz >= (width + width_dec + 1));
>> +
>> +if (!cnt->present) {
>> +fill_str(buf, bufsz, '-', width + width_dec);
>> +return;
>> +}
>> 
>> -if (!ring->size)
>> +val = __pmu_calc(&cnt->val, d, t, s);
>> +
>> +len = snprintf(buf, bufsz, "%*.*f", width + width_dec, width_dec, val);
>> +if (len < 0 || len == bufsz) {
>> +fill_str(buf, bufsz, 'X', width + width_dec);
>>   return;
>> +}
>> +}
>> +
>> +static uint64_t __pmu_read_single(int fd, uint64_t *ts)
>> +{
>> +uint64_t data[2] = { };
>> +ssize_t len;
>> 
>> -ring->head = ring_read(ring, RING_HEAD) & HEAD_ADDR;
>> -ring->tail = ring_read(ring, RING_TAIL) & TAIL_ADDR;
>> +len = read(fd, data, sizeof(data));
>> +assert(len == sizeof(data));
>> 
>> -if (ring->tail == ring->head)
>> -ring->idle++;
>> +if (ts)
>> +*ts = data[1];
>> 
>> -full = ring->tail - ring->head;
>> -if (full < 0)
>> -full += ring->size;
>> -ring->full += full;
>> +return data[0];
>>   }
>> 
>> -static void ring_print_header(FILE *out, struct ring *ring)
>> +static uint64_t pmu_read_single(int fd)
>>   {
>> -    fprintf(out, "%.6s%%\tops\t",
>> -            ring->name
>> -          );
>> +return __pmu_read_single(fd, NULL);
>>   }
>> 
>> -static void ring_print(struct ring *ring, unsigned long samples_per_sec)
>> +static void __update_sample(struct pmu_counter *counter, uint64_t val)
>>   {
>> -int percent_busy, len;
>> +counter->val.prev = counter->val.cur;
>> +counter->val.cur = val;
>> +}
>> 
>> -if (!ring->size)
>> -return;
>> +static void update_sample(struct pmu_counter *counter, uint64_t *val)
>> +{
>> +if (counter->present)
>> +__update_sample(counter, val[counter->idx]);
>> +}
>> +
>> +static void pmu_sample(struct engines *engines)
>> +{
>> +const int num_val = engines->num_counters;
>> +uint64_t val[2 + num_val];
>> +unsigned int i;
>> +
>> +engines->ts.prev = engines->ts.cur;
>> +
>> +if (engines->rapl_fd >= 0)
>> +__update_sample(&engines->rapl,
>> +pmu_read_single(engines->rapl_fd));
>> +
>> +if (engines->imc_fd >= 0) {
>> +pmu_read_multi(engines->imc_fd, 2, val);
>> +update_sample(&engines->imc_reads, val);
>> +update_sample(&engines->imc_writes, val);
>> +}
>> 
>> -percent_busy = 100 - 100 * ring->idle / samples_per_sec;
>> +engines->ts.cur = pmu_read_multi(engines->fd, num_val, val);
>> 
>> -len = printf("%25s busy: %3d%%: ", ring->name, percent_busy);
>> -print_percentage_bar (percent_busy, len);
>> -printf("%24s space: %d/%d\n",
>> -   ring->name,
>> -   (int)(ring->full / samples_per_sec),
>> -   ring->size);
>> +update_sample(&engines->freq_req, val);
>> +update_sample(&engines->freq_act, val);
>> +update_sample(&engines->irq, val);
>> +update_sample(&engines->rc6, val);
>> +
>> +for (i = 0; i < engines->num_engines; i++) {
>> +struct engine *engine = engine_ptr(engines, i);
>> +
>> +update_sample(&engine->busy, val);
>> +update_sample(&engine->sema, val);
>> +update_sample(&engine->wait, val);
>> +}
>>   }
>> 
>> -static void ring_log(struct ring *ring, unsigned long samples_per_sec,
>> -FILE *output)
>> +static const char *bars[] = { " ", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█" };
>> +
>> +static void
>> +print_percentage_bar(double percent, int max_len)
>>   {
>> -if (ring->size)
>> -fprintf(output, "%3d\t%d\t",
>> -(int)(100 - 100 * ring->idle / samples_per_sec),
>> -(int)(ring->full / samples_per_sec));
>> -else
>> -fprintf(output, "-1\t-1\t");
>> +int bar_len = percent * (8 * (max_len - 2)) / 100.0;
>> +int i;
>> +
>> +putchar('|');
>> +
>> +for (i = bar_len; i >= 8; i -= 8)
>> +printf("%s", bars[8]);
>> +if (i)
>> +printf("%s", bars[i]);
>> +
>> +for (i = 0; i < (max_len - 2 - (bar_len + 7) / 8); i++)
>> +putchar(' ');
>> +
>> +putchar('|');
>>   }
>> 
>> +#define DEFAULT_PERIOD_MS (1000)
>> +
>>   static void
>>   usage(const char *appname)
>>   {
>>   printf("intel_gpu_top - Display a top-like summary of Intel GPU usage\n"
>> -"\n"
>> -"usage: %s [parameters]\n"
>> -"\n"
>> -"The following parameters apply:\n"
>> -"[-s <samples>]       samples per seconds (default %d)\n"
>> -"[-e <command>]       command to profile\n"
>> -"[-o <file>]          output statistics to file. If file is '-',"
>> -"                     run in batch mode and output statistics to stdio only \n"
>> -"[-h]                 show this help screen\n"
>> -"\n",
>> -appname,
>> -SAMPLES_PER_SEC
>> -  );
>> -return;
>> +"\n"
>> +"Usage: %s [parameters]\n"
>> +"\n"
>> +"\tThe following parameters are optional:\n\n"
>> +"\t[-s <ms>]       Refresh period in milliseconds (default %ums).\n"
>> +"\t[-h]            Show this help text.\n"
>> +"\n",
>> +appname, DEFAULT_PERIOD_MS);
>>   }
>> 
>>   int main(int argc, char **argv)
>>   {
>> -uint32_t devid;
>> -struct pci_device *pci_dev;
>> -struct ring render_ring = {
>> -.name = "render",
>> -.mmio = 0x2030,
>> -}, bsd_ring = {
>> -.name = "bitstream",
>> -.mmio = 0x4030,
>> -}, bsd6_ring = {
>> -.name = "bitstream",
>> -.mmio = 0x12030,
>> -}, blt_ring = {
>> -.name = "blitter",
>> -.mmio = 0x22030,
>> -};
>> -int i, ch;
>> -int samples_per_sec = SAMPLES_PER_SEC;
>> -FILE *output = NULL;
>> -double elapsed_time=0;
>> -int print_headers=1;
>> -pid_t child_pid=-1;
>> -int child_stat;
>> -char *cmd=NULL;
>> -int interactive=1;
>> -
>> -/* Parse options? */
>> -while ((ch = getopt(argc, argv, "s:o:e:h")) != -1) {
>> +unsigned int period_us = DEFAULT_PERIOD_MS * 1000;
>> +int con_w = -1, con_h = -1;
>> +struct engines *engines;
>> +unsigned int i;
>> +int ret, ch;
>> +
>> +/* Parse options */
>> +while ((ch = getopt(argc, argv, "s:h")) != -1) {
>>   switch (ch) {
>> -case 'e': cmd = strdup(optarg);
>> -break;
>> -case 's': samples_per_sec = atoi(optarg);
>> -if (samples_per_sec < 100) {
>> -fprintf(stderr, "Error: samples per second must be >= 100\n");
>> -exit(1);
>> -}
>> -break;
>> -case 'o':
>> -if (!strcmp(optarg, "-")) {
>> -/* Running in non-interactive mode */
>> -interactive = 0;
>> -output = stdout;
>> -}
>> -else
>> -output = fopen(optarg, "w");
>> -if (!output)
>> -{
>> -perror("fopen");
>> -exit(1);
>> -}
>> +case 's':
>> +period_us = atoi(optarg) * 1000;
>>   break;
>>   case 'h':
>>   usage(argv[0]);
>>   exit(0);
>> -break;
>>   default:
>> -fprintf(stderr, "Invalid flag %c!\n", (char)optopt);
>> +fprintf(stderr, "Invalid option %c!\n", (char)optopt);
>>   usage(argv[0]);
>>   exit(1);
>> -break;
>>   }
>>   }
>> 
>> -pci_dev = intel_get_pci_device();
>> -devid = pci_dev->device_id;
>> -intel_mmio_use_pci_bar(pci_dev);
>> -init_instdone_definitions(devid);
>> -
>> -/* Do we have a command to run? */
>> -if (cmd != NULL) {
>> -if (output) {
>> -fprintf(output, "# Profiling: %s\n", cmd);
>> -fflush(output);
>> -}
>> -child_pid = fork();
>> -if (child_pid < 0) {
>> -perror("fork");
>> -exit(1);
>> -}
>> -else if (child_pid == 0) {
>> -int res;
>> -res = system(cmd);
>> -if (res < 0)
>> -perror("running command");
>> -if (output) {
>> -fflush(output);
>> -fprintf(output, "# %s exited with status %d\n", cmd, res);
>> -fflush(output);
>> -}
>> -free(cmd);
>> -exit(0);
>> -} else {
>> -free(cmd);
>> -}
>> +engines = discover_engines();
>> +if (!engines) {
>> +fprintf(stderr,
>> +"Failed to detect engines! (%s)\n(Kernel 4.16 or newer is required for i915 PMU support.)\n",
>> +strerror(errno));
>> +return 1;
>>   }
>> 
>> -for (i = 0; i < num_instdone_bits; i++) {
>> -top_bits[i].bit = &instdone_bits[i];
>> -top_bits[i].count = 0;
>> -top_bits_sorted[i] = &top_bits[i];
>> +ret = pmu_init(engines);
>> +if (ret) {
>> +fprintf(stderr,
>> +"Failed to initialize PMU! (%s)\n", strerror(errno));
>> +return 1;
>>   }
>> 
>> -/* Grab access to the registers */
>> -intel_register_access_init(pci_dev, 0, -1);
>> +pmu_sample(engines);
>> 
>> -ring_init(&render_ring);
>> -if (IS_GEN4(devid) || IS_GEN5(devid))
>> -ring_init(&bsd_ring);
>> -if (IS_GEN6(devid) || IS_GEN7(devid)) {
>> -ring_init(&bsd6_ring);
>> -ring_init(&blt_ring);
>> -}
>> +for (;;) {
>> +double t;
>> +#define BUFSZ 16
>> +char freq[BUFSZ];
>> +char fact[BUFSZ];
>> +char irq[BUFSZ];
>> +char rc6[BUFSZ];
>> +char power[BUFSZ];
>> +char reads[BUFSZ];
>> +char writes[BUFSZ];
>> +struct winsize ws;
>> +int lines = 0;
>> 
>> -/* Initialize GPU stats */
>> -if (HAS_STATS_REGS(devid)) {
>> -for (i = 0; i < STATS_COUNT; i++) {
>> -uint32_t stats_high, stats_low, stats_high_2;
>> +/* Update terminal size. */
>> +if (ioctl(0, TIOCGWINSZ, &ws) != -1) {
>> +con_w = ws.ws_col;
>> +con_h = ws.ws_row;
>> +}
>> 
>> -do {
>> -stats_high = INREG(stats_regs[i] + 4);
>> -stats_low = INREG(stats_regs[i]);
>> -stats_high_2 = INREG(stats_regs[i] + 4);
>> -} while (stats_high != stats_high_2);
>> +pmu_sample(engines);
>> +t = (double)(engines->ts.cur - engines->ts.prev) / 1e9;
>> 
>> -last_stats[i] = (uint64_t)stats_high << 32 |
>> -stats_low;
>> -}
>> -}
>> +printf("\033[H\033[J");
>> 
>> -for (;;) {
>> -int j;
>> -unsigned long long t1, ti, tf, t2;
>> -unsigned long long def_sleep = 1000000 / samples_per_sec;
>> -unsigned long long last_samples_per_sec = samples_per_sec;
>> -unsigned short int max_lines;
>> -struct winsize ws;
>> -char clear_screen[] = {0x1b, '[', 'H',
>> -       0x1b, '[', 'J',
>> -       0x0};
>> -int percent;
>> -int len;
>> -
>> -t1 = gettime();
>> -
>> -ring_reset(&render_ring);
>> -ring_reset(&bsd_ring);
>> -ring_reset(&bsd6_ring);
>> -ring_reset(&blt_ring);
>> -
>> -for (i = 0; i < samples_per_sec; i++) {
>> -long long interval;
>> -ti = gettime();
>> -if (IS_965(devid)) {
>> -instdone = INREG(INSTDONE_I965);
>> -instdone1 = INREG(INSTDONE_1);
>> -} else
>> -instdone = INREG(INSTDONE);
>> -
>> -for (j = 0; j < num_instdone_bits; j++)
>> -update_idle_bit(&top_bits[j]);
>> -
>> -ring_sample(&render_ring);
>> -ring_sample(&bsd_ring);
>> -ring_sample(&bsd6_ring);
>> -ring_sample(&blt_ring);
>> -
>> -tf = gettime();
>> -if (tf - t1 >= 1000000) {
>> -/* We are out of sync, bail out */
>> -last_samples_per_sec = i+1;
>> -break;
>> -}
>> -interval = def_sleep - (tf - ti);
>> -if (interval > 0)
>> -usleep(interval);
>> -}
>> +pmu_calc(&engines->freq_req, freq, BUFSZ, 4, 0, 1.0, t, 1);
>> +pmu_calc(&engines->freq_act, fact, BUFSZ, 4, 0, 1.0, t, 1);
>> +pmu_calc(&engines->irq, irq, BUFSZ, 8, 0, 1.0, t, 1);
>> +pmu_calc(&engines->rc6, rc6, BUFSZ, 3, 0, 1e9, t, 100);
>> +pmu_calc(&engines->rapl, power, BUFSZ, 4, 2, 1.0, t,
>> + engines->rapl_scale);
>> +pmu_calc(&engines->imc_reads, reads, BUFSZ, 6, 0, 1.0, t,
>> + engines->imc_reads_scale);
>> +pmu_calc(&engines->imc_writes, writes, BUFSZ, 6, 0, 1.0, t,
>> + engines->imc_writes_scale);
>> 
>> -if (HAS_STATS_REGS(devid)) {
>> -for (i = 0; i < STATS_COUNT; i++) {
>> -uint32_t stats_high, stats_low, stats_high_2;
>> +if (lines++ < con_h)
>> +printf("intel-gpu-top - %s/%s MHz;  %s%% RC6; %s %s; %s irqs/s\n",
>> +       fact, freq, rc6, power, engines->rapl_unit, irq);
>> 
>> -do {
>> -stats_high = INREG(stats_regs[i] + 4);
>> -stats_low = INREG(stats_regs[i]);
>> -stats_high_2 = INREG(stats_regs[i] + 4);
>> -} while (stats_high != stats_high_2);
>> +if (lines++ < con_h)
>> +printf("\n");
>> 
>> -stats[i] = (uint64_t)stats_high << 32 |
>> -stats_low;
>> -}
>> -}
>> +if (engines->imc_fd) {
>> +if (lines++ < con_h)
>> +printf("      IMC reads:   %s %s/s\n",
>> +       reads, engines->imc_reads_unit);
>> +
>> +if (lines++ < con_h)
>> +printf("     IMC writes:   %s %s/s\n",
>> +       writes, engines->imc_writes_unit);
>> 
>> -qsort(top_bits_sorted, num_instdone_bits,
>> -      sizeof(struct top_bit *), top_bits_sort);
>> -
>> -/* Limit the number of lines printed to the terminal height so the
>> - * most important info (at the top) will stay on screen. */
>> -max_lines = -1;
>> -if (ioctl(0, TIOCGWINSZ, &ws) != -1)
>> -max_lines = ws.ws_row - 6; /* exclude header lines */
>> -if (max_lines >= num_instdone_bits)
>> -max_lines = num_instdone_bits;
>> -
>> -t2 = gettime();
>> -elapsed_time += (t2 - t1) / 1000000.0;
>> -
>> -if (interactive) {
>> -printf("%s", clear_screen);
>> -print_clock_info(pci_dev);
>> -
>> -ring_print(&render_ring, last_samples_per_sec);
>> -ring_print(&bsd_ring, last_samples_per_sec);
>> -ring_print(&bsd6_ring, last_samples_per_sec);
>> -ring_print(&blt_ring, last_samples_per_sec);
>> -
>> -printf("\n%30s  %s\n", "task", "percent busy");
>> -for (i = 0; i < max_lines; i++) {
>> -if (top_bits_sorted[i]->count > 0) {
>> -percent = (top_bits_sorted[i]->count * 100) /
>> -last_samples_per_sec;
>> -len = printf("%30s: %3d%%: ",
>> - top_bits_sorted[i]->bit->name,
>> - percent);
>> -print_percentage_bar (percent, len);
>> -} else {
>> -printf("%*s", PERCENTAGE_BAR_END, "");
>> -}
>> -
>> -if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
>> -printf("%13s: %llu (%lld/sec)",
>> -   stats_reg_names[i],
>> -   (long long)stats[i],
>> -   (long long)(stats[i] - last_stats[i]));
>> -last_stats[i] = stats[i];
>> -} else {
>> -if (!top_bits_sorted[i]->count)
>> -break;
>> -}
>> +if (++lines < con_h)
>>   printf("\n");
>> -}
>>   }
>> -if (output) {
>> -/* Print headers for columns at first run */
>> -if (print_headers) {
>> -fprintf(output, "# time\t");
>> -ring_print_header(output, &render_ring);
>> -ring_print_header(output, &bsd_ring);
>> -ring_print_header(output, &bsd6_ring);
>> -ring_print_header(output, &blt_ring);
>> -for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
>> -if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
>> -fprintf(output, "%.6s\t",
>> -   stats_reg_names[i]
>> -   );
>> -}
>> -if (!top_bits[i].count)
>> -continue;
>> -}
>> -fprintf(output, "\n");
>> -print_headers = 0;
>> -}
>> 
>> -/* Print statistics */
>> -fprintf(output, "%.2f\t", elapsed_time);
>> -ring_log(&render_ring, last_samples_per_sec, output);
>> -ring_log(&bsd_ring, last_samples_per_sec, output);
>> -ring_log(&bsd6_ring, last_samples_per_sec, output);
>> -ring_log(&blt_ring, last_samples_per_sec, output);
>> -
>> -for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
>> -if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
>> -fprintf(output, "%"PRIu64"\t",
>> -   stats[i] - last_stats[i]);
>> -last_stats[i] = stats[i];
>> -}
>> -if (!top_bits[i].count)
>> -continue;
>> -}
>> -fprintf(output, "\n");
>> -fflush(output);
>> -}
>> +for (i = 0; i < engines->num_engines; i++) {
>> +struct engine *engine = engine_ptr(engines, i);
>> 
>> -for (i = 0; i < num_instdone_bits; i++) {
>> -top_bits_sorted[i]->count = 0;
>> +if (engine->num_counters && lines < con_h) {
>> +const char *a = "          ENGINE      BUSY ";
>> +const char *b = " MI_SEMA MI_WAIT";
>> 
>> -if (i < STATS_COUNT)
>> -last_stats[i] = stats[i];
>> +printf("\033[7m%s%*s%s\033[0m\n",
>> +       a,
>> +       (int)(con_w - 1 - strlen(a) - strlen(b)),
>> +       " ", b);
>> +lines++;
>> +break;
>> +}
>>   }
>> 
>> -/* Check if child has gone */
>> -if (child_pid > 0) {
>> -int res;
>> -if ((res = waitpid(child_pid, &child_stat, WNOHANG)) == -1) {
>> -perror("waitpid");
>> -exit(1);
>> -}
>> -if (res == 0)
>> +for (i = 0; i < engines->num_engines && lines < con_h; i++) {
>> +struct engine *engine = engine_ptr(engines, i);
>> +unsigned int max_w = con_w - 1;
>> +unsigned int len;
>> +char sema[BUFSZ];
>> +char wait[BUFSZ];
>> +char busy[BUFSZ];
>> +char buf[128];
>> +double val;
>> +
>> +if (!engine->num_counters)
>>   continue;
>> -if (WIFEXITED(child_stat))
>> -break;
>> +
>> +pmu_calc(&engine->sema, sema, BUFSZ, 3, 0, 1e9, t, 100);
>> +pmu_calc(&engine->wait, wait, BUFSZ, 3, 0, 1e9, t, 100);
>> +len = snprintf(buf, sizeof(buf), "    %s%%    %s%%",
>> +       sema, wait);
>> +
>> +pmu_calc(&engine->busy, busy, BUFSZ, 6, 2, 1e9, t,
>> + 100);
>> +len += printf("%16s %s%% ", engine->display_name, busy);
>> +
>> +val = __pmu_calc(&engine->busy.val, 1e9, t, 100);
>> +print_percentage_bar(val, max_w - len);
>> +
>> +printf("%s\n", buf);
>> +
>> +lines++;
>>   }
>> -}
>> 
>> -fclose(output);
>> +if (lines++ < con_h)
>> +printf("\n");
>> +
>> +usleep(period_us);
>> +}
>> 
>> -intel_register_access_fini();
>>   return 0;
>>   }
>> diff --git a/tools/meson.build b/tools/meson.build
>> index bd2d313d5156..a918eeb0bef1 100644
>> --- a/tools/meson.build
>> +++ b/tools/meson.build
>> @@ -23,7 +23,6 @@ tools_progs = [
>>   'intel_gpu_frequency',
>>   'intel_firmware_decode',
>>   'intel_gpu_time',
>> -'intel_gpu_top',
>>   'intel_gtt',
>>   'intel_guc_logger',
>>   'intel_infoframes',
>> @@ -117,6 +116,11 @@ shared_library('intel_aubdump', 'aubdump.c',
>>          name_prefix : '',
>>          install : true)
>> 
>> +executable('intel_gpu_top', 'intel_gpu_top.c',
>> +   install : true,
>> +   install_rpath : rpathdir,
>> +   dependencies : tool_deps + [ lib_igt_perf ])
>> +
>>   conf_data = configuration_data()
>>   conf_data.set('prefix', prefix)
>>   conf_data.set('exec_prefix', '${prefix}')
>> 
>_______________________________________________
>igt-dev mailing list
>igt-dev@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/igt-dev


---
Rinat
Matthew Auld May 29, 2018, 9:58 a.m. UTC | #3
On 4 April 2018 at 16:26, Tvrtko Ursulin <tursulin@ursulin.net> wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> intel-gpu-top is a dangerous tool which can hang machines due unsafe mmio
> register access. This patch rewrites it to use only PMU.
>
> Only overall command streamer busyness and GPU global data such as power
> and frequencies are included in this new version.
>
> For access to more GPU functional unit level data, an OA metric based tool
> like gpu-top should be used instead.
>
> v2:
>  * Sort engines by class and instance.
>  * Do not wait for one sampling period to display something on screen.
>  * Move code out of the asserts. (Rinat Ibragimov)
>  * Continuously adapt to terminal size. (Rinat Ibragimov)
>
> v3:
>  * Change layout and precision of some field. (Chris Wilson)
>  Eero Tamminen:
>  * Use more user friendly engine names.
>  * Don't error out if a counter is missing.
>  * Add IMC read/write bandwidth.
>  * Report minimum required kernel version.
>
> v4:
>  * Really support 4.16 by skipping of missing engines.
>  * Simpler and less hacky float printing.
>  * Preserve copyright header. (Antonio Argenziano)
>  * Simplify engines_ptr macro. (Rinat Ibragimov)
>
> v5:
>  * Get RAPL unit from sysfs.
>  * Consolidate sysfs paths with a macro.
>  * Tidy error handling by carrying over and reporting errno.
>  * Check against console height on all prints.
>  * More readable minimum kernel version message. (Eero Tamminen)
>  * Column banner for per engine stats. (Eero Tamminen)
>
> v6:
>  * Man page update. (Eero Tamminen)
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Petri Latvala <petri.latvala@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Rinat Ibragimov <ibragimovrinat@mail.ru>
> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> # v1
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> # v0.5
> ---
>  lib/igt_perf.c        |    6 +
>  lib/igt_perf.h        |    1 +
>  man/intel_gpu_top.rst |   41 +-
>  tools/Makefile.am     |    2 +
>  tools/intel_gpu_top.c | 1250 +++++++++++++++++++++++++++----------------------
>  tools/meson.build     |    6 +-
>  6 files changed, 719 insertions(+), 587 deletions(-)
>
> diff --git a/lib/igt_perf.c b/lib/igt_perf.c
> index 99d82ea51c9b..e3dec2cc29c7 100644
> --- a/lib/igt_perf.c
> +++ b/lib/igt_perf.c
> @@ -69,3 +69,9 @@ int igt_perf_open(uint64_t type, uint64_t config)
>         return _perf_open(type, config, -1,
>                           PERF_FORMAT_TOTAL_TIME_ENABLED);
>  }
> +
> +int igt_perf_open_group(uint64_t type, uint64_t config, int group)
> +{
> +       return _perf_open(type, config, group,
> +                         PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_GROUP);
> +}
> diff --git a/lib/igt_perf.h b/lib/igt_perf.h
> index 614ea5d23fa6..e00718f4769a 100644
> --- a/lib/igt_perf.h
> +++ b/lib/igt_perf.h
> @@ -55,5 +55,6 @@ uint64_t i915_type_id(void);
>  int perf_i915_open(uint64_t config);
>  int perf_i915_open_group(uint64_t config, int group);
>  int igt_perf_open(uint64_t type, uint64_t config);
> +int igt_perf_open_group(uint64_t type, uint64_t config, int group);
>
>  #endif /* I915_PERF_H */
> diff --git a/man/intel_gpu_top.rst b/man/intel_gpu_top.rst
> index a5f7175bb1a0..19c712307d28 100644
> --- a/man/intel_gpu_top.rst
> +++ b/man/intel_gpu_top.rst
> @@ -7,9 +7,9 @@ Display a top-like summary of Intel GPU usage
>  ---------------------------------------------
>  .. include:: defs.rst
>  :Author: IGT Developers <igt-dev@lists.freedesktop.org>
> -:Date: 2016-03-01
> +:Date: 2018-04-04
>  :Version: |PACKAGE_STRING|
> -:Copyright: 2009,2011,2012,2016 Intel Corporation
> +:Copyright: 2009,2011,2012,2016,2018 Intel Corporation
>  :Manual section: |MANUAL_SECTION|
>  :Manual group: |MANUAL_GROUP|
>
> @@ -21,42 +21,25 @@ SYNOPSIS
>  DESCRIPTION
>  ===========
>
> -**intel_gpu_top** is a tool to display usage information of an Intel GPU. It
> -requires root privilege to map the graphics device.
> +**intel_gpu_top** is a tool to display usage information on Intel GPU's.
> +
> +The tool gathers data using perf performance counters (PMU) exposed by i915 and other platform drivers like RAPL (power) and Uncore IMC (memory bandwidth).
>
>  OPTIONS
>  =======
>
> --s SAMPLES
> -    Number of samples to acquire per second.
> -
> --o FILE
> -    Collect usage statistics to FILE. If file is "-", run non-interactively
> -    and output statistics to stdout.
> -
> --e COMMAND
> -    Execute COMMAND to profile, and leave when it is finished. Note that the
> -    entire command with all parameters should be included as one parameter.
> +-s <ms>
> +    Refresh period in milliseconds.
>
>  -h
> -    Show usage notes.
> +    Show help text.
>
> -EXAMPLES
> -========
> -
> -intel_gpu_top -o "cairo-trace-gvim.log" -s 100 -e "cairo-perf-trace /tmp/gvim"
> -    Run cairo-perf-trace with /tmp/gvim trace, non-interactively, saving the
> -    statistics into cairo-trace-gvim.log file, and collecting 100 samples per
> -    second.
> -
> -Note that idle units are not displayed, so an entirely idle GPU will only
> -display the ring status and header.
> +LIMITATIONS
> +===========
>
> -BUGS
> -====
> +* Not all metrics are supported on all platforms. Where a metric is unsupported it's value will be replaced by a dashed line.

s/it's/its/

>
> -Some GPUs report some units as busy when they aren't, such that even when idle
> -and not hung, it will show up as 100% busy.
> +* Non-root access to perf counters is controlled by the *perf_event_paranoid* sysctl.
>
>  REPORTING BUGS
>  ==============
> diff --git a/tools/Makefile.am b/tools/Makefile.am
> index 09b6dbcc3ece..a0b016ddd7ff 100644
> --- a/tools/Makefile.am
> +++ b/tools/Makefile.am
> @@ -28,6 +28,8 @@ intel_aubdump_la_LDFLAGS = -module -avoid-version -no-undefined
>  intel_aubdump_la_SOURCES = aubdump.c
>  intel_aubdump_la_LIBADD = $(top_builddir)/lib/libintel_tools.la -ldl
>
> +intel_gpu_top_LDADD = $(top_builddir)/lib/libigt_perf.la
> +
>  bin_SCRIPTS = intel_aubdump
>  CLEANFILES = $(bin_SCRIPTS)
>
> diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
> index 098e6ce3ff86..b923c3cfbe97 100644
> --- a/tools/intel_gpu_top.c
> +++ b/tools/intel_gpu_top.c
> @@ -1,6 +1,5 @@
>  /*
> - * Copyright © 2007 Intel Corporation
> - * Copyright © 2011 Intel Corporation
> + * Copyright © 2007-2018 Intel Corporation
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a
>   * copy of this software and associated documentation files (the "Software"),
> @@ -24,695 +23,832 @@
>   * Authors:
>   *    Eric Anholt <eric@anholt.net>
>   *    Eugeni Dodonov <eugeni.dodonov@intel.com>
> - *
>   */
>
> -#include "config.h"
> -
> -#include <inttypes.h>
> -#include <unistd.h>
> -#include <stdlib.h>
>  #include <stdio.h>
> -#include <err.h>
> -#include <sys/ioctl.h>
> -#include <sys/time.h>
> -#include <sys/wait.h>
> +#include <sys/types.h>
> +#include <dirent.h>
> +#include <stdint.h>
> +#include <assert.h>
>  #include <string.h>
> -#ifdef HAVE_TERMIOS_H
> -#include <termios.h>
> -#endif
> -#include "intel_io.h"
> -#include "instdone.h"
> -#include "intel_reg.h"
> -#include "intel_chipset.h"
> -#include "drmtest.h"
> -
> -#define  FORCEWAKE         0xA18C
> -#define  FORCEWAKE_ACK     0x130090
> -
> -#define SAMPLES_PER_SEC             10000
> -#define SAMPLES_TO_PERCENT_RATIO    (SAMPLES_PER_SEC / 100)
> -
> -#define MAX_NUM_TOP_BITS            100
> -
> -#define HAS_STATS_REGS(devid)          IS_965(devid)
> -
> -struct top_bit {
> -       struct instdone_bit *bit;
> -       int count;
> -} top_bits[MAX_NUM_TOP_BITS];
> -struct top_bit *top_bits_sorted[MAX_NUM_TOP_BITS];
> -
> -static uint32_t instdone, instdone1;
> -
> -static const char *bars[] = {
> -       " ",
> -       "▏",
> -       "▎",
> -       "▍",
> -       "▌",
> -       "▋",
> -       "▊",
> -       "▉",
> -       "█"
> -};
> +#include <ctype.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <sys/ioctl.h>
> +#include <errno.h>
> +#include <math.h>
> +#include <locale.h>
> +
> +#include "igt_perf.h"
>
> -enum stats_counts {
> -       IA_VERTICES,
> -       IA_PRIMITIVES,
> -       VS_INVOCATION,
> -       GS_INVOCATION,
> -       GS_PRIMITIVES,
> -       CL_INVOCATION,
> -       CL_PRIMITIVES,
> -       PS_INVOCATION,
> -       PS_DEPTH,
> -       STATS_COUNT
> +struct pmu_pair {
> +       uint64_t cur;
> +       uint64_t prev;
>  };
>
> -const uint32_t stats_regs[STATS_COUNT] = {
> -       IA_VERTICES_COUNT_QW,
> -       IA_PRIMITIVES_COUNT_QW,
> -       VS_INVOCATION_COUNT_QW,
> -       GS_INVOCATION_COUNT_QW,
> -       GS_PRIMITIVES_COUNT_QW,
> -       CL_INVOCATION_COUNT_QW,
> -       CL_PRIMITIVES_COUNT_QW,
> -       PS_INVOCATION_COUNT_QW,
> -       PS_DEPTH_COUNT_QW,
> +struct pmu_counter {
> +       bool present;
> +       uint64_t config;
> +       unsigned int idx;
> +       struct pmu_pair val;
>  };
>
> -const char *stats_reg_names[STATS_COUNT] = {
> -       "vert fetch",
> -       "prim fetch",
> -       "VS invocations",
> -       "GS invocations",
> -       "GS prims",
> -       "CL invocations",
> -       "CL prims",
> -       "PS invocations",
> -       "PS depth pass",
> +struct engine {
> +       const char *name;
> +       const char *display_name;
> +
> +       unsigned int class;
> +       unsigned int instance;
> +
> +       unsigned int num_counters;
> +
> +       struct pmu_counter busy;
> +       struct pmu_counter wait;
> +       struct pmu_counter sema;
>  };
>
> -uint64_t stats[STATS_COUNT];
> -uint64_t last_stats[STATS_COUNT];
> +struct engines {
> +       unsigned int num_engines;
> +       unsigned int num_counters;
> +       DIR *root;
> +       int fd;
> +       struct pmu_pair ts;
> +
> +       int rapl_fd;
> +       double rapl_scale;
> +       const char *rapl_unit;
> +
> +       int imc_fd;
> +       double imc_reads_scale;
> +       const char *imc_reads_unit;
> +       double imc_writes_scale;
> +       const char *imc_writes_unit;
> +
> +       struct pmu_counter freq_req;
> +       struct pmu_counter freq_act;
> +       struct pmu_counter irq;
> +       struct pmu_counter rc6;
> +       struct pmu_counter rapl;
> +       struct pmu_counter imc_reads;
> +       struct pmu_counter imc_writes;
> +
> +       struct engine engine;
> +};
>
> -static unsigned long
> -gettime(void)
> +static uint64_t
> +get_pmu_config(int dirfd, const char *name, const char *counter)
>  {
> -    struct timeval t;
> -    gettimeofday(&t, NULL);
> -    return (t.tv_usec + (t.tv_sec * 1000000));
> -}
> +       char buf[128], *p;
> +       int fd, ret;
>
> -static int
> -top_bits_sort(const void *a, const void *b)
> -{
> -       struct top_bit * const *bit_a = a;
> -       struct top_bit * const *bit_b = b;
> -       int a_count = (*bit_a)->count;
> -       int b_count = (*bit_b)->count;
> +       ret = snprintf(buf, sizeof(buf), "%s-%s", name, counter);
> +       if (ret < 0 || ret == sizeof(buf))
> +               return -1;
>
> -       if (a_count < b_count)
> -               return 1;
> -       else if (a_count == b_count)
> -               return 0;
> -       else
> +       fd = openat(dirfd, buf, O_RDONLY);
> +       if (fd < 0)
>                 return -1;
> -}
>
> -static void
> -update_idle_bit(struct top_bit *top_bit)
> -{
> -       uint32_t reg_val;
> +       ret = read(fd, buf, sizeof(buf));
> +       close(fd);
> +       if (ret <= 0)
> +               return -1;
>
> -       if (top_bit->bit->reg == INSTDONE_1)
> -               reg_val = instdone1;
> -       else
> -               reg_val = instdone;
> +       p = index(buf, '0');
> +       if (!p)
> +               return -1;
>
> -       if ((reg_val & top_bit->bit->bit) == 0)
> -               top_bit->count++;
> +       return strtoul(p, NULL, 0);
>  }
>
> -static void
> -print_clock(const char *name, int clock) {
> -       if (clock == -1)
> -               printf("%s clock: unknown", name);
> +#define engine_ptr(engines, n) (&engines->engine + (n))
> +
> +static const char *class_display_name(unsigned int class)
> +{
> +       switch (class) {
> +       case I915_ENGINE_CLASS_RENDER:
> +               return "Render/3D";
> +       case I915_ENGINE_CLASS_COPY:
> +               return "Blitter";
> +       case I915_ENGINE_CLASS_VIDEO:
> +               return "Video";
> +       case I915_ENGINE_CLASS_VIDEO_ENHANCE:
> +               return "VideoEnhance";
> +       default:
> +               return "[unknown]";
> +       }
> +}
> +
> +static int engine_cmp(const void *__a, const void *__b)
> +{
> +       const struct engine *a = (struct engine *)__a;
> +       const struct engine *b = (struct engine *)__b;
> +
> +       if (a->class != b->class)
> +               return a->class - b->class;
>         else
> -               printf("%s clock: %d Mhz", name, clock);
> +               return a->instance - b->instance;
>  }
>
> -static int
> -print_clock_info(struct pci_device *pci_dev)
> +static struct engines *discover_engines(void)
>  {
> -       uint32_t devid = pci_dev->device_id;
> -       uint16_t gcfgc;
> +       const char *sysfs_root = "/sys/devices/i915/events";
> +       struct engines *engines;
> +       struct dirent *dent;
> +       int ret = 0;
> +       DIR *d;
>
> -       if (IS_GM45(devid)) {
> -               int core_clock = -1;
> +       engines = malloc(sizeof(struct engines));
> +       if (!engines)
> +               return NULL;
>
> -               pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +       memset(engines, 0, sizeof(*engines));
>
> -               switch (gcfgc & 0xf) {
> -               case 8:
> -                       core_clock = 266;
> -                       break;
> -               case 9:
> -                       core_clock = 320;
> -                       break;
> -               case 11:
> -                       core_clock = 400;
> -                       break;
> -               case 13:
> -                       core_clock = 533;
> -                       break;
> -               }
> -               print_clock("core", core_clock);
> -       } else if (IS_965(devid) && IS_MOBILE(devid)) {
> -               int render_clock = -1, sampler_clock = -1;
> +       engines->num_engines = 0;
>
> -               pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +       d = opendir(sysfs_root);
> +       if (!d)
> +               return NULL;
>
> -               switch (gcfgc & 0xf) {
> -               case 2:
> -                       render_clock = 250; sampler_clock = 267;
> -                       break;
> -               case 3:
> -                       render_clock = 320; sampler_clock = 333;
> -                       break;
> -               case 4:
> -                       render_clock = 400; sampler_clock = 444;
> -                       break;
> -               case 5:
> -                       render_clock = 500; sampler_clock = 533;
> +       while ((dent = readdir(d)) != NULL) {
> +               const char *endswith = "-busy";
> +               const unsigned int endlen = strlen(endswith);
> +               struct engine *engine =
> +                               engine_ptr(engines, engines->num_engines);
> +               char buf[256];
> +
> +               if (dent->d_type != DT_REG)
> +                       continue;
> +
> +               if (strlen(dent->d_name) >= sizeof(buf)) {
> +                       ret = ENAMETOOLONG;
>                         break;
>                 }
>
> -               print_clock("render", render_clock);
> -               printf("  ");
> -               print_clock("sampler", sampler_clock);
> -       } else if (IS_945(devid) && IS_MOBILE(devid)) {
> -               int render_clock = -1, display_clock = -1;
> +               strcpy(buf, dent->d_name);
>
> -               pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +               /* xxxN-busy */
> +               if (strlen(buf) < (endlen + 4))
> +                       continue;
> +               if (strcmp(&buf[strlen(buf) - endlen], endswith))
> +                       continue;
>
> -               switch (gcfgc & 0x7) {
> -               case 0:
> -                       render_clock = 166;
> -                       break;
> -               case 1:
> -                       render_clock = 200;
> -                       break;
> -               case 3:
> -                       render_clock = 250;
> -                       break;
> -               case 5:
> -                       render_clock = 400;
> +               memset(engine, 0, sizeof(*engine));
> +
> +               buf[strlen(buf) - endlen] = 0;
> +               engine->name = strdup(buf);
> +               if (!engine->name) {
> +                       ret = errno;
>                         break;
>                 }
>
> -               switch (gcfgc & 0x70) {
> -               case 0:
> -                       display_clock = 200;
> -                       break;
> -               case 4:
> -                       display_clock = 320;
> +               engine->busy.config = get_pmu_config(dirfd(d), engine->name,
> +                                                    "busy");
> +               if (engine->busy.config == -1) {
> +                       ret = ENOENT;
>                         break;
>                 }
> -               if (gcfgc & (1 << 7))
> -                   display_clock = 133;
>
> -               print_clock("render", render_clock);
> -               printf("  ");
> -               print_clock("display", display_clock);
> -       } else if (IS_915(devid) && IS_MOBILE(devid)) {
> -               int render_clock = -1, display_clock = -1;
> +               engine->class = (engine->busy.config &
> +                                (__I915_PMU_OTHER(0) - 1)) >>
> +                               I915_PMU_CLASS_SHIFT;
>
> -               pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
> +               engine->instance = (engine->busy.config >>
> +                                   I915_PMU_SAMPLE_BITS) &
> +                                   ((1 << I915_PMU_SAMPLE_INSTANCE_BITS) - 1);
>
> -               switch (gcfgc & 0x7) {
> -               case 0:
> -                       render_clock = 160;
> -                       break;
> -               case 1:
> -                       render_clock = 190;
> -                       break;
> -               case 4:
> -                       render_clock = 333;
> +               ret = snprintf(buf, sizeof(buf), "%s/%u",
> +                              class_display_name(engine->class),
> +                              engine->instance);
> +               if (ret < 0 || ret == sizeof(buf)) {
> +                       ret = ENOBUFS;
>                         break;
>                 }
> -               if (gcfgc & (1 << 13))
> -                   render_clock = 133;
> +               ret = 0;
>
> -               switch (gcfgc & 0x70) {
> -               case 0:
> -                       display_clock = 190;
> +               engine->display_name = strdup(buf);
> +               if (!engine->display_name) {
> +                       ret = errno;
>                         break;
> -               case 4:
> -                       display_clock = 333;
> +               }
> +
> +               engines->num_engines++;
> +               engines = realloc(engines, sizeof(struct engines) +
> +                                 engines->num_engines * sizeof(struct engine));
> +               if (!engines) {
> +                       ret = errno;
>                         break;
>                 }
> -               if (gcfgc & (1 << 7))
> -                   display_clock = 133;
> +       }
> +
> +       if (ret) {
> +               free(engines);
> +               errno = ret;
>
> -               print_clock("render", render_clock);
> -               printf("  ");
> -               print_clock("display", display_clock);
> +               return NULL;
>         }
>
> +       qsort(engine_ptr(engines, 0), engines->num_engines,
> +             sizeof(struct engine), engine_cmp);
> +
> +       engines->root = d;
>
> -       printf("\n");
> -       return -1;
> +       return engines;
>  }
>
> -#define STATS_LEN (20)
> -#define PERCENTAGE_BAR_END     (79 - STATS_LEN)
> +static int
> +filename_to_buf(const char *filename, char *buf, unsigned int bufsize)
> +{
> +       int fd, err;
> +       ssize_t ret;
>
> -static void
> -print_percentage_bar(float percent, int cur_line_len)
> +       fd = open(filename, O_RDONLY);
> +       if (fd < 0)
> +               return -1;
> +
> +       ret = read(fd, buf, bufsize - 1);
> +       err = errno;
> +       close(fd);
> +       if (ret < 1) {
> +               errno = ret < 0 ? err : ENOMSG;
> +
> +               return -1;
> +       }
> +
> +       if (ret > 1 && buf[ret - 1] == '\n')
> +               buf[ret - 1] = '\0';
> +       else
> +               buf[ret] = '\0';
> +
> +       return 0;
> +}
> +
> +static uint64_t filename_to_u64(const char *filename, int base)
>  {
> -       int bar_avail_len = (PERCENTAGE_BAR_END - cur_line_len - 1) * 8;
> -       int bar_len = bar_avail_len * (percent + .5) / 100.0;
> -       int i;
> +       char buf[64], *b;
>
> -       for (i = bar_len; i >= 8; i -= 8) {
> -               printf("%s", bars[8]);
> -               cur_line_len++;
> +       if (filename_to_buf(filename, buf, sizeof(buf)))
> +               return 0;
> +
> +       /*
> +        * Handle both single integer and key=value formats by skipping
> +        * leading non-digits.
> +        */
> +       b = buf;
> +       while (*b && !isdigit(*b))
> +               b++;
> +
> +       return strtoull(b, NULL, base);
> +}
> +
> +static double filename_to_double(const char *filename)
> +{
> +       char *oldlocale;
> +       char buf[80];
> +       double v;
> +
> +       if (filename_to_buf(filename, buf, sizeof(buf)))
> +               return 0;
> +
> +       oldlocale = setlocale(LC_ALL, "C");
> +       v = strtod(buf, NULL);
> +       setlocale(LC_ALL, oldlocale);
> +
> +       return v;
> +}
> +
> +#define RAPL_ROOT "/sys/devices/power/"
> +#define RAPL_EVENT "/sys/devices/power/events/"
> +
> +static uint64_t rapl_type_id(void)
> +{
> +       return filename_to_u64(RAPL_ROOT "type", 10);
> +}
> +
> +static uint64_t rapl_gpu_power(void)
> +{
> +       return filename_to_u64(RAPL_EVENT "energy-gpu", 0);
> +}
> +
> +static double rapl_gpu_power_scale(void)
> +{
> +       return filename_to_double(RAPL_EVENT "energy-gpu.scale");
> +}
> +
> +static const char *rapl_gpu_power_unit(void)
> +{
> +       char buf[32];
> +
> +       if (filename_to_buf(RAPL_EVENT "energy-gpu.unit",
> +                           buf, sizeof(buf)) == 0)
> +               if (!strcmp(buf, "Joules"))
> +                       return strdup("Watts");
> +               else
> +                       return strdup(buf);
> +       else
> +               return NULL;
> +}
> +
> +#define IMC_ROOT "/sys/devices/uncore_imc/"
> +#define IMC_EVENT "/sys/devices/uncore_imc/events/"
> +
> +static uint64_t imc_type_id(void)
> +{
> +       return filename_to_u64(IMC_ROOT "type", 10);
> +}
> +
> +static uint64_t imc_data_reads(void)
> +{
> +       return filename_to_u64(IMC_EVENT "data_reads", 0);
> +}
> +
> +static double imc_data_reads_scale(void)
> +{
> +       return filename_to_double(IMC_EVENT "data_reads.scale");
> +}
> +
> +static const char *imc_data_reads_unit(void)
> +{
> +       char buf[32];
> +
> +       if (filename_to_buf(IMC_EVENT "data_reads.unit", buf, sizeof(buf)) == 0)
> +               return strdup(buf);
> +       else
> +               return NULL;
> +}
> +
> +static uint64_t imc_data_writes(void)
> +{
> +       return filename_to_u64(IMC_EVENT "data_writes", 0);
> +}
> +
> +static double imc_data_writes_scale(void)
> +{
> +       return filename_to_double(IMC_EVENT "data_writes.scale");
> +}
> +
> +static const char *imc_data_writes_unit(void)
> +{
> +       char buf[32];
> +
> +       if (filename_to_buf(IMC_EVENT "data_writes.unit",
> +                           buf, sizeof(buf)) == 0)
> +               return strdup(buf);
> +       else
> +               return NULL;
> +}
> +
> +#define _open_pmu(cnt, pmu, fd) \
> +({ \
> +       int fd__; \
> +\
> +       fd__ = perf_i915_open_group((pmu)->config, (fd)); \
> +       if (fd__ >= 0) { \
> +               if ((fd) == -1) \
> +                       (fd) = fd__; \
> +               (pmu)->present = true; \
> +               (pmu)->idx = (cnt)++; \
> +       } \
> +\
> +       fd__; \
> +})
> +
> +#define _open_imc(cnt, pmu, fd) \
> +({ \
> +       int fd__; \
> +\
> +       fd__ = igt_perf_open_group(imc_type_id(), (pmu)->config, (fd)); \
> +       if (fd__ >= 0) { \
> +               if ((fd) == -1) \
> +                       (fd) = fd__; \
> +               (pmu)->present = true; \
> +               (pmu)->idx = (cnt)++; \
> +       } \
> +\
> +       fd__; \
> +})
> +
> +static int pmu_init(struct engines *engines)
> +{
> +       unsigned int i;
> +       int fd;
> +
> +       engines->fd = -1;
> +       engines->num_counters = 0;
> +
> +       engines->irq.config = I915_PMU_INTERRUPTS;
> +       fd = _open_pmu(engines->num_counters, &engines->irq, engines->fd);
> +       if (fd < 0)
> +               return -1;
> +
> +       engines->freq_req.config = I915_PMU_REQUESTED_FREQUENCY;
> +       _open_pmu(engines->num_counters, &engines->freq_req, engines->fd);
> +
> +       engines->freq_act.config = I915_PMU_ACTUAL_FREQUENCY;
> +       _open_pmu(engines->num_counters, &engines->freq_act, engines->fd);
> +
> +       engines->rc6.config = I915_PMU_RC6_RESIDENCY;
> +       _open_pmu(engines->num_counters, &engines->rc6, engines->fd);
> +
> +       for (i = 0; i < engines->num_engines; i++) {
> +               struct engine *engine = engine_ptr(engines, i);
> +               struct {
> +                       struct pmu_counter *pmu;
> +                       const char *counter;
> +               } *cnt, counters[] = {
> +                       { .pmu = &engine->busy, .counter = "busy" },
> +                       { .pmu = &engine->wait, .counter = "wait" },
> +                       { .pmu = &engine->sema, .counter = "sema" },
> +                       { .pmu = NULL, .counter = NULL },
> +               };
> +
> +               for (cnt = counters; cnt->pmu; cnt++) {
> +                       if (!cnt->pmu->config)
> +                               cnt->pmu->config =
> +                                       get_pmu_config(dirfd(engines->root),
> +                                                      engine->name,
> +                                                      cnt->counter);
> +                       fd = _open_pmu(engines->num_counters, cnt->pmu,
> +                                      engines->fd);
> +                       if (fd >= 0)
> +                               engine->num_counters++;
> +               }
>         }
> -       if (i) {
> -               printf("%s", bars[i]);
> -               cur_line_len++;
> +
> +       engines->rapl_fd = -1;
> +       if (rapl_type_id()) {
> +               engines->rapl_scale = rapl_gpu_power_scale();
> +               engines->rapl_unit = rapl_gpu_power_unit();
> +               if (!engines->rapl_unit)
> +                       return -1;
> +
> +               engines->rapl.config = rapl_gpu_power();
> +               if (!engines->rapl.config)
> +                       return -1;
> +
> +               engines->rapl_fd = igt_perf_open(rapl_type_id(),
> +                                                engines->rapl.config);
> +               if (engines->rapl_fd < 0)
> +                       return -1;
> +
> +               engines->rapl.present = true;
>         }
>
> -       /* NB: We can't use a field width with utf8 so we manually
> -       * guarantee a field with of 45 chars for any bar. */
> -       printf("%*s", PERCENTAGE_BAR_END - cur_line_len, "");
> -}
> +       engines->imc_fd = -1;
> +       if (imc_type_id()) {
> +               unsigned int num = 0;
>
> -struct ring {
> -       const char *name;
> -       uint32_t mmio;
> -       int head, tail, size;
> -       uint64_t full;
> -       int idle;
> -};
> +               engines->imc_reads_scale = imc_data_reads_scale();
> +               engines->imc_writes_scale = imc_data_writes_scale();
> +
> +               engines->imc_reads_unit = imc_data_reads_unit();
> +               if (!engines->imc_reads_unit)
> +                       return -1;
> +
> +               engines->imc_writes_unit = imc_data_writes_unit();
> +               if (!engines->imc_writes_unit)
> +                       return -1;
> +
> +               engines->imc_reads.config = imc_data_reads();
> +               if (!engines->imc_reads.config)
> +                       return -1;
> +
> +               engines->imc_writes.config = imc_data_writes();
> +               if (!engines->imc_writes.config)
> +                       return -1;
> +
> +               fd = _open_imc(num, &engines->imc_reads, engines->imc_fd);
> +               if (fd < 0)
> +                       return -1;
> +               fd = _open_imc(num, &engines->imc_writes, engines->imc_fd);
> +               if (fd < 0)
> +                       return -1;
> +
> +               engines->imc_reads.present = true;
> +               engines->imc_writes.present = true;

We can drop these.

> +       }
> +
> +       return 0;
> +}
>
> -static uint32_t ring_read(struct ring *ring, uint32_t reg)
> +static uint64_t pmu_read_multi(int fd, unsigned int num, uint64_t *val)
>  {
> -       return INREG(ring->mmio + reg);
> +       uint64_t buf[2 + num];
> +       unsigned int i;
> +       ssize_t len;
> +
> +       memset(buf, 0, sizeof(buf));
> +
> +       len = read(fd, buf, sizeof(buf));
> +       assert(len == sizeof(buf));
> +
> +       for (i = 0; i < num; i++)
> +               val[i] = buf[2 + i];
> +
> +       return buf[1];
>  }
>
> -static void ring_init(struct ring *ring)
> +static double __pmu_calc(struct pmu_pair *p, double d, double t, double s)
>  {
> -       ring->size = (((ring_read(ring, RING_LEN) & RING_NR_PAGES) >> 12) + 1) * 4096;
> +       double v;
> +
> +       v = p->cur - p->prev;
> +       v /= d;
> +       v /= t;
> +       v *= s;
> +
> +       if (s == 100.0 && v > 100.0)
> +               v = 100.0;
> +
> +       return v;
>  }
>
> -static void ring_reset(struct ring *ring)
> +static void fill_str(char *buf, unsigned int bufsz, char c, unsigned int num)
>  {
> -       ring->idle = ring->full = 0;
> +       unsigned int i;
> +
> +       for (i = 0; i < num && i < (bufsz - 1); i++)
> +               *buf++ = c;
> +
> +       *buf = 0;
>  }
>
> -static void ring_sample(struct ring *ring)
> +static void pmu_calc(struct pmu_counter *cnt,
> +                    char *buf, unsigned int bufsz,
> +                    unsigned int width, unsigned width_dec,
> +                    double d, double t, double s)
>  {
> -       int full;
> +       double val;
> +       int len;
> +
> +       assert(bufsz >= (width + width_dec + 1));
> +
> +       if (!cnt->present) {
> +               fill_str(buf, bufsz, '-', width + width_dec);
> +               return;
> +       }
>
> -       if (!ring->size)
> +       val = __pmu_calc(&cnt->val, d, t, s);
> +
> +       len = snprintf(buf, bufsz, "%*.*f", width + width_dec, width_dec, val);
> +       if (len < 0 || len == bufsz) {
> +               fill_str(buf, bufsz, 'X', width + width_dec);
>                 return;
> +       }
> +}
> +
> +static uint64_t __pmu_read_single(int fd, uint64_t *ts)
> +{
> +       uint64_t data[2] = { };
> +       ssize_t len;
>
> -       ring->head = ring_read(ring, RING_HEAD) & HEAD_ADDR;
> -       ring->tail = ring_read(ring, RING_TAIL) & TAIL_ADDR;
> +       len = read(fd, data, sizeof(data));
> +       assert(len == sizeof(data));
>
> -       if (ring->tail == ring->head)
> -               ring->idle++;
> +       if (ts)
> +               *ts = data[1];
>
> -       full = ring->tail - ring->head;
> -       if (full < 0)
> -               full += ring->size;
> -       ring->full += full;
> +       return data[0];
>  }
>
> -static void ring_print_header(FILE *out, struct ring *ring)
> +static uint64_t pmu_read_single(int fd)
>  {
> -    fprintf(out, "%.6s%%\tops\t",
> -            ring->name
> -          );
> +       return __pmu_read_single(fd, NULL);
>  }
>
> -static void ring_print(struct ring *ring, unsigned long samples_per_sec)
> +static void __update_sample(struct pmu_counter *counter, uint64_t val)
>  {
> -       int percent_busy, len;
> +       counter->val.prev = counter->val.cur;
> +       counter->val.cur = val;
> +}
>
> -       if (!ring->size)
> -               return;
> +static void update_sample(struct pmu_counter *counter, uint64_t *val)
> +{
> +       if (counter->present)
> +               __update_sample(counter, val[counter->idx]);
> +}
> +
> +static void pmu_sample(struct engines *engines)
> +{
> +       const int num_val = engines->num_counters;
> +       uint64_t val[2 + num_val];
> +       unsigned int i;
> +
> +       engines->ts.prev = engines->ts.cur;
> +
> +       if (engines->rapl_fd >= 0)
> +               __update_sample(&engines->rapl,
> +                               pmu_read_single(engines->rapl_fd));
> +
> +       if (engines->imc_fd >= 0) {
> +               pmu_read_multi(engines->imc_fd, 2, val);
> +               update_sample(&engines->imc_reads, val);
> +               update_sample(&engines->imc_writes, val);
> +       }
>
> -       percent_busy = 100 - 100 * ring->idle / samples_per_sec;
> +       engines->ts.cur = pmu_read_multi(engines->fd, num_val, val);
>
> -       len = printf("%25s busy: %3d%%: ", ring->name, percent_busy);
> -       print_percentage_bar (percent_busy, len);
> -       printf("%24s space: %d/%d\n",
> -                  ring->name,
> -                  (int)(ring->full / samples_per_sec),
> -                  ring->size);
> +       update_sample(&engines->freq_req, val);
> +       update_sample(&engines->freq_act, val);
> +       update_sample(&engines->irq, val);
> +       update_sample(&engines->rc6, val);
> +
> +       for (i = 0; i < engines->num_engines; i++) {
> +               struct engine *engine = engine_ptr(engines, i);
> +
> +               update_sample(&engine->busy, val);
> +               update_sample(&engine->sema, val);
> +               update_sample(&engine->wait, val);
> +       }
>  }
>
> -static void ring_log(struct ring *ring, unsigned long samples_per_sec,
> -               FILE *output)
> +static const char *bars[] = { " ", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█" };
> +
> +static void
> +print_percentage_bar(double percent, int max_len)
>  {
> -       if (ring->size)
> -               fprintf(output, "%3d\t%d\t",
> -                       (int)(100 - 100 * ring->idle / samples_per_sec),
> -                       (int)(ring->full / samples_per_sec));
> -       else
> -               fprintf(output, "-1\t-1\t");
> +       int bar_len = percent * (8 * (max_len - 2)) / 100.0;
> +       int i;
> +
> +       putchar('|');
> +
> +       for (i = bar_len; i >= 8; i -= 8)
> +               printf("%s", bars[8]);
> +       if (i)
> +               printf("%s", bars[i]);
> +
> +       for (i = 0; i < (max_len - 2 - (bar_len + 7) / 8); i++)
> +               putchar(' ');
> +
> +       putchar('|');
>  }
>
> +#define DEFAULT_PERIOD_MS (1000)
> +
>  static void
>  usage(const char *appname)
>  {
>         printf("intel_gpu_top - Display a top-like summary of Intel GPU usage\n"
> -                       "\n"
> -                       "usage: %s [parameters]\n"
> -                       "\n"
> -                       "The following parameters apply:\n"
> -                       "[-s <samples>]       samples per seconds (default %d)\n"
> -                       "[-e <command>]       command to profile\n"
> -                       "[-o <file>]          output statistics to file. If file is '-',"
> -                       "                     run in batch mode and output statistics to stdio only \n"
> -                       "[-h]                 show this help screen\n"
> -                       "\n",
> -                       appname,
> -                       SAMPLES_PER_SEC
> -                 );
> -       return;
> +               "\n"
> +               "Usage: %s [parameters]\n"
> +               "\n"
> +               "\tThe following parameters are optional:\n\n"
> +               "\t[-s <ms>]       Refresh period in milliseconds (default %ums).\n"
> +               "\t[-h]            Show this help text.\n"
> +               "\n",
> +               appname, DEFAULT_PERIOD_MS);
>  }
>
>  int main(int argc, char **argv)
>  {
> -       uint32_t devid;
> -       struct pci_device *pci_dev;
> -       struct ring render_ring = {
> -               .name = "render",
> -               .mmio = 0x2030,
> -       }, bsd_ring = {
> -               .name = "bitstream",
> -               .mmio = 0x4030,
> -       }, bsd6_ring = {
> -               .name = "bitstream",
> -               .mmio = 0x12030,
> -       }, blt_ring = {
> -               .name = "blitter",
> -               .mmio = 0x22030,
> -       };
> -       int i, ch;
> -       int samples_per_sec = SAMPLES_PER_SEC;
> -       FILE *output = NULL;
> -       double elapsed_time=0;
> -       int print_headers=1;
> -       pid_t child_pid=-1;
> -       int child_stat;
> -       char *cmd=NULL;
> -       int interactive=1;
> -
> -       /* Parse options? */
> -       while ((ch = getopt(argc, argv, "s:o:e:h")) != -1) {
> +       unsigned int period_us = DEFAULT_PERIOD_MS * 1000;
> +       int con_w = -1, con_h = -1;
> +       struct engines *engines;
> +       unsigned int i;
> +       int ret, ch;
> +
> +       /* Parse options */
> +       while ((ch = getopt(argc, argv, "s:h")) != -1) {
>                 switch (ch) {
> -               case 'e': cmd = strdup(optarg);
> -                       break;
> -               case 's': samples_per_sec = atoi(optarg);
> -                       if (samples_per_sec < 100) {
> -                               fprintf(stderr, "Error: samples per second must be >= 100\n");
> -                               exit(1);
> -                       }
> -                       break;
> -               case 'o':
> -                       if (!strcmp(optarg, "-")) {
> -                               /* Running in non-interactive mode */
> -                               interactive = 0;
> -                               output = stdout;
> -                       }
> -                       else
> -                               output = fopen(optarg, "w");
> -                       if (!output)
> -                       {
> -                               perror("fopen");
> -                               exit(1);
> -                       }
> +               case 's':
> +                       period_us = atoi(optarg) * 1000;
>                         break;
>                 case 'h':
>                         usage(argv[0]);
>                         exit(0);
> -                       break;
>                 default:
> -                       fprintf(stderr, "Invalid flag %c!\n", (char)optopt);
> +                       fprintf(stderr, "Invalid option %c!\n", (char)optopt);
>                         usage(argv[0]);
>                         exit(1);
> -                       break;
>                 }
>         }
>
> -       pci_dev = intel_get_pci_device();
> -       devid = pci_dev->device_id;
> -       intel_mmio_use_pci_bar(pci_dev);
> -       init_instdone_definitions(devid);
> -
> -       /* Do we have a command to run? */
> -       if (cmd != NULL) {
> -               if (output) {
> -                       fprintf(output, "# Profiling: %s\n", cmd);
> -                       fflush(output);
> -               }
> -               child_pid = fork();
> -               if (child_pid < 0) {
> -                       perror("fork");
> -                       exit(1);
> -               }
> -               else if (child_pid == 0) {
> -                       int res;
> -                       res = system(cmd);
> -                       if (res < 0)
> -                               perror("running command");
> -                       if (output) {
> -                               fflush(output);
> -                               fprintf(output, "# %s exited with status %d\n", cmd, res);
> -                               fflush(output);
> -                       }
> -                       free(cmd);
> -                       exit(0);
> -               } else {
> -                       free(cmd);
> -               }
> +       engines = discover_engines();
> +       if (!engines) {
> +               fprintf(stderr,
> +                       "Failed to detect engines! (%s)\n(Kernel 4.16 or newer is required for i915 PMU support.)\n",
> +                       strerror(errno));
> +               return 1;
>         }
>
> -       for (i = 0; i < num_instdone_bits; i++) {
> -               top_bits[i].bit = &instdone_bits[i];
> -               top_bits[i].count = 0;
> -               top_bits_sorted[i] = &top_bits[i];
> +       ret = pmu_init(engines);
> +       if (ret) {
> +               fprintf(stderr,
> +                       "Failed to initialize PMU! (%s)\n", strerror(errno));
> +               return 1;
>         }
>
> -       /* Grab access to the registers */
> -       intel_register_access_init(pci_dev, 0, -1);
> +       pmu_sample(engines);
>
> -       ring_init(&render_ring);
> -       if (IS_GEN4(devid) || IS_GEN5(devid))
> -               ring_init(&bsd_ring);
> -       if (IS_GEN6(devid) || IS_GEN7(devid)) {
> -               ring_init(&bsd6_ring);
> -               ring_init(&blt_ring);
> -       }
> +       for (;;) {
> +               double t;
> +#define BUFSZ 16
> +               char freq[BUFSZ];
> +               char fact[BUFSZ];
> +               char irq[BUFSZ];
> +               char rc6[BUFSZ];
> +               char power[BUFSZ];
> +               char reads[BUFSZ];
> +               char writes[BUFSZ];
> +               struct winsize ws;
> +               int lines = 0;
>
> -       /* Initialize GPU stats */
> -       if (HAS_STATS_REGS(devid)) {
> -               for (i = 0; i < STATS_COUNT; i++) {
> -                       uint32_t stats_high, stats_low, stats_high_2;
> +               /* Update terminal size. */
> +               if (ioctl(0, TIOCGWINSZ, &ws) != -1) {
> +                       con_w = ws.ws_col;
> +                       con_h = ws.ws_row;
> +               }
>
> -                       do {
> -                               stats_high = INREG(stats_regs[i] + 4);
> -                               stats_low = INREG(stats_regs[i]);
> -                               stats_high_2 = INREG(stats_regs[i] + 4);
> -                       } while (stats_high != stats_high_2);
> +               pmu_sample(engines);
> +               t = (double)(engines->ts.cur - engines->ts.prev) / 1e9;
>
> -                       last_stats[i] = (uint64_t)stats_high << 32 |
> -                               stats_low;
> -               }
> -       }
> +               printf("\033[H\033[J");
>
> -       for (;;) {
> -               int j;
> -               unsigned long long t1, ti, tf, t2;
> -               unsigned long long def_sleep = 1000000 / samples_per_sec;
> -               unsigned long long last_samples_per_sec = samples_per_sec;
> -               unsigned short int max_lines;
> -               struct winsize ws;
> -               char clear_screen[] = {0x1b, '[', 'H',
> -                                      0x1b, '[', 'J',
> -                                      0x0};
> -               int percent;
> -               int len;
> -
> -               t1 = gettime();
> -
> -               ring_reset(&render_ring);
> -               ring_reset(&bsd_ring);
> -               ring_reset(&bsd6_ring);
> -               ring_reset(&blt_ring);
> -
> -               for (i = 0; i < samples_per_sec; i++) {
> -                       long long interval;
> -                       ti = gettime();
> -                       if (IS_965(devid)) {
> -                               instdone = INREG(INSTDONE_I965);
> -                               instdone1 = INREG(INSTDONE_1);
> -                       } else
> -                               instdone = INREG(INSTDONE);
> -
> -                       for (j = 0; j < num_instdone_bits; j++)
> -                               update_idle_bit(&top_bits[j]);
> -
> -                       ring_sample(&render_ring);
> -                       ring_sample(&bsd_ring);
> -                       ring_sample(&bsd6_ring);
> -                       ring_sample(&blt_ring);
> -
> -                       tf = gettime();
> -                       if (tf - t1 >= 1000000) {
> -                               /* We are out of sync, bail out */
> -                               last_samples_per_sec = i+1;
> -                               break;
> -                       }
> -                       interval = def_sleep - (tf - ti);
> -                       if (interval > 0)
> -                               usleep(interval);
> -               }
> +               pmu_calc(&engines->freq_req, freq, BUFSZ, 4, 0, 1.0, t, 1);
> +               pmu_calc(&engines->freq_act, fact, BUFSZ, 4, 0, 1.0, t, 1);
> +               pmu_calc(&engines->irq, irq, BUFSZ, 8, 0, 1.0, t, 1);
> +               pmu_calc(&engines->rc6, rc6, BUFSZ, 3, 0, 1e9, t, 100);
> +               pmu_calc(&engines->rapl, power, BUFSZ, 4, 2, 1.0, t,
> +                        engines->rapl_scale);
> +               pmu_calc(&engines->imc_reads, reads, BUFSZ, 6, 0, 1.0, t,
> +                        engines->imc_reads_scale);
> +               pmu_calc(&engines->imc_writes, writes, BUFSZ, 6, 0, 1.0, t,
> +                        engines->imc_writes_scale);
>
> -               if (HAS_STATS_REGS(devid)) {
> -                       for (i = 0; i < STATS_COUNT; i++) {
> -                               uint32_t stats_high, stats_low, stats_high_2;
> +               if (lines++ < con_h)
> +                       printf("intel-gpu-top - %s/%s MHz;  %s%% RC6; %s %s; %s irqs/s\n",
> +                              fact, freq, rc6, power, engines->rapl_unit, irq);
>
> -                               do {
> -                                       stats_high = INREG(stats_regs[i] + 4);
> -                                       stats_low = INREG(stats_regs[i]);
> -                                       stats_high_2 = INREG(stats_regs[i] + 4);
> -                               } while (stats_high != stats_high_2);
> +               if (lines++ < con_h)
> +                       printf("\n");
>
> -                               stats[i] = (uint64_t)stats_high << 32 |
> -                                       stats_low;
> -                       }
> -               }
> +               if (engines->imc_fd) {
> +                       if (lines++ < con_h)
> +                               printf("      IMC reads:   %s %s/s\n",
> +                                      reads, engines->imc_reads_unit);
> +
> +                       if (lines++ < con_h)
> +                               printf("     IMC writes:   %s %s/s\n",
> +                                      writes, engines->imc_writes_unit);
>
> -               qsort(top_bits_sorted, num_instdone_bits,
> -                     sizeof(struct top_bit *), top_bits_sort);
> -
> -               /* Limit the number of lines printed to the terminal height so the
> -                * most important info (at the top) will stay on screen. */
> -               max_lines = -1;
> -               if (ioctl(0, TIOCGWINSZ, &ws) != -1)
> -                       max_lines = ws.ws_row - 6; /* exclude header lines */
> -               if (max_lines >= num_instdone_bits)
> -                       max_lines = num_instdone_bits;
> -
> -               t2 = gettime();
> -               elapsed_time += (t2 - t1) / 1000000.0;
> -
> -               if (interactive) {
> -                       printf("%s", clear_screen);
> -                       print_clock_info(pci_dev);
> -
> -                       ring_print(&render_ring, last_samples_per_sec);
> -                       ring_print(&bsd_ring, last_samples_per_sec);
> -                       ring_print(&bsd6_ring, last_samples_per_sec);
> -                       ring_print(&blt_ring, last_samples_per_sec);
> -
> -                       printf("\n%30s  %s\n", "task", "percent busy");
> -                       for (i = 0; i < max_lines; i++) {
> -                               if (top_bits_sorted[i]->count > 0) {
> -                                       percent = (top_bits_sorted[i]->count * 100) /
> -                                               last_samples_per_sec;
> -                                       len = printf("%30s: %3d%%: ",
> -                                                        top_bits_sorted[i]->bit->name,
> -                                                        percent);
> -                                       print_percentage_bar (percent, len);
> -                               } else {
> -                                       printf("%*s", PERCENTAGE_BAR_END, "");
> -                               }
> -
> -                               if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -                                       printf("%13s: %llu (%lld/sec)",
> -                                                  stats_reg_names[i],
> -                                                  (long long)stats[i],
> -                                                  (long long)(stats[i] - last_stats[i]));
> -                                       last_stats[i] = stats[i];
> -                               } else {
> -                                       if (!top_bits_sorted[i]->count)
> -                                               break;
> -                               }
> +                       if (++lines < con_h)
>                                 printf("\n");
> -                       }
>                 }
> -               if (output) {
> -                       /* Print headers for columns at first run */
> -                       if (print_headers) {
> -                               fprintf(output, "# time\t");
> -                               ring_print_header(output, &render_ring);
> -                               ring_print_header(output, &bsd_ring);
> -                               ring_print_header(output, &bsd6_ring);
> -                               ring_print_header(output, &blt_ring);
> -                               for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
> -                                       if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -                                               fprintf(output, "%.6s\t",
> -                                                          stats_reg_names[i]
> -                                                          );
> -                                       }
> -                                       if (!top_bits[i].count)
> -                                               continue;
> -                               }
> -                               fprintf(output, "\n");
> -                               print_headers = 0;
> -                       }
>
> -                       /* Print statistics */
> -                       fprintf(output, "%.2f\t", elapsed_time);
> -                       ring_log(&render_ring, last_samples_per_sec, output);
> -                       ring_log(&bsd_ring, last_samples_per_sec, output);
> -                       ring_log(&bsd6_ring, last_samples_per_sec, output);
> -                       ring_log(&blt_ring, last_samples_per_sec, output);
> -
> -                       for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
> -                               if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
> -                                       fprintf(output, "%"PRIu64"\t",
> -                                                  stats[i] - last_stats[i]);
> -                                       last_stats[i] = stats[i];
> -                               }
> -                                       if (!top_bits[i].count)
> -                                               continue;
> -                       }
> -                       fprintf(output, "\n");
> -                       fflush(output);
> -               }
> +               for (i = 0; i < engines->num_engines; i++) {
> +                       struct engine *engine = engine_ptr(engines, i);
>
> -               for (i = 0; i < num_instdone_bits; i++) {
> -                       top_bits_sorted[i]->count = 0;
> +                       if (engine->num_counters && lines < con_h) {
> +                               const char *a = "          ENGINE      BUSY ";
> +                               const char *b = " MI_SEMA MI_WAIT";
>
> -                       if (i < STATS_COUNT)
> -                               last_stats[i] = stats[i];
> +                               printf("\033[7m%s%*s%s\033[0m\n",
> +                                      a,
> +                                      (int)(con_w - 1 - strlen(a) - strlen(b)),
> +                                      " ", b);
> +                               lines++;
> +                               break;
> +                       }
>                 }
>
> -               /* Check if child has gone */
> -               if (child_pid > 0) {
> -                       int res;
> -                       if ((res = waitpid(child_pid, &child_stat, WNOHANG)) == -1) {
> -                               perror("waitpid");
> -                               exit(1);
> -                       }
> -                       if (res == 0)
> +               for (i = 0; i < engines->num_engines && lines < con_h; i++) {
> +                       struct engine *engine = engine_ptr(engines, i);
> +                       unsigned int max_w = con_w - 1;
> +                       unsigned int len;
> +                       char sema[BUFSZ];
> +                       char wait[BUFSZ];
> +                       char busy[BUFSZ];
> +                       char buf[128];
> +                       double val;
> +
> +                       if (!engine->num_counters)
>                                 continue;
> -                       if (WIFEXITED(child_stat))
> -                               break;
> +
> +                       pmu_calc(&engine->sema, sema, BUFSZ, 3, 0, 1e9, t, 100);
> +                       pmu_calc(&engine->wait, wait, BUFSZ, 3, 0, 1e9, t, 100);
> +                       len = snprintf(buf, sizeof(buf), "    %s%%    %s%%",
> +                                      sema, wait);
> +
> +                       pmu_calc(&engine->busy, busy, BUFSZ, 6, 2, 1e9, t,
> +                                100);
> +                       len += printf("%16s %s%% ", engine->display_name, busy);
> +
> +                       val = __pmu_calc(&engine->busy.val, 1e9, t, 100);
> +                       print_percentage_bar(val, max_w - len);

if (max_w > len + 2)
     print_percentage_bar(val, max_w - len);

or something? Meh.

fwiw,
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
diff mbox

Patch

diff --git a/lib/igt_perf.c b/lib/igt_perf.c
index 99d82ea51c9b..e3dec2cc29c7 100644
--- a/lib/igt_perf.c
+++ b/lib/igt_perf.c
@@ -69,3 +69,9 @@  int igt_perf_open(uint64_t type, uint64_t config)
 	return _perf_open(type, config, -1,
 			  PERF_FORMAT_TOTAL_TIME_ENABLED);
 }
+
+int igt_perf_open_group(uint64_t type, uint64_t config, int group)
+{
+	return _perf_open(type, config, group,
+			  PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_GROUP);
+}
diff --git a/lib/igt_perf.h b/lib/igt_perf.h
index 614ea5d23fa6..e00718f4769a 100644
--- a/lib/igt_perf.h
+++ b/lib/igt_perf.h
@@ -55,5 +55,6 @@  uint64_t i915_type_id(void);
 int perf_i915_open(uint64_t config);
 int perf_i915_open_group(uint64_t config, int group);
 int igt_perf_open(uint64_t type, uint64_t config);
+int igt_perf_open_group(uint64_t type, uint64_t config, int group);
 
 #endif /* I915_PERF_H */
diff --git a/man/intel_gpu_top.rst b/man/intel_gpu_top.rst
index a5f7175bb1a0..19c712307d28 100644
--- a/man/intel_gpu_top.rst
+++ b/man/intel_gpu_top.rst
@@ -7,9 +7,9 @@  Display a top-like summary of Intel GPU usage
 ---------------------------------------------
 .. include:: defs.rst
 :Author: IGT Developers <igt-dev@lists.freedesktop.org>
-:Date: 2016-03-01
+:Date: 2018-04-04
 :Version: |PACKAGE_STRING|
-:Copyright: 2009,2011,2012,2016 Intel Corporation
+:Copyright: 2009,2011,2012,2016,2018 Intel Corporation
 :Manual section: |MANUAL_SECTION|
 :Manual group: |MANUAL_GROUP|
 
@@ -21,42 +21,25 @@  SYNOPSIS
 DESCRIPTION
 ===========
 
-**intel_gpu_top** is a tool to display usage information of an Intel GPU. It
-requires root privilege to map the graphics device.
+**intel_gpu_top** is a tool to display usage information on Intel GPU's.
+
+The tool gathers data using perf performance counters (PMU) exposed by i915 and other platform drivers like RAPL (power) and Uncore IMC (memory bandwidth).
 
 OPTIONS
 =======
 
--s SAMPLES
-    Number of samples to acquire per second.
-
--o FILE
-    Collect usage statistics to FILE. If file is "-", run non-interactively
-    and output statistics to stdout.
-
--e COMMAND
-    Execute COMMAND to profile, and leave when it is finished. Note that the
-    entire command with all parameters should be included as one parameter.
+-s <ms>
+    Refresh period in milliseconds.
 
 -h
-    Show usage notes.
+    Show help text.
 
-EXAMPLES
-========
-
-intel_gpu_top -o "cairo-trace-gvim.log" -s 100 -e "cairo-perf-trace /tmp/gvim"
-    Run cairo-perf-trace with /tmp/gvim trace, non-interactively, saving the
-    statistics into cairo-trace-gvim.log file, and collecting 100 samples per
-    second.
-
-Note that idle units are not displayed, so an entirely idle GPU will only
-display the ring status and header.
+LIMITATIONS
+===========
 
-BUGS
-====
+* Not all metrics are supported on all platforms. Where a metric is unsupported it's value will be replaced by a dashed line.
 
-Some GPUs report some units as busy when they aren't, such that even when idle
-and not hung, it will show up as 100% busy.
+* Non-root access to perf counters is controlled by the *perf_event_paranoid* sysctl.
 
 REPORTING BUGS
 ==============
diff --git a/tools/Makefile.am b/tools/Makefile.am
index 09b6dbcc3ece..a0b016ddd7ff 100644
--- a/tools/Makefile.am
+++ b/tools/Makefile.am
@@ -28,6 +28,8 @@  intel_aubdump_la_LDFLAGS = -module -avoid-version -no-undefined
 intel_aubdump_la_SOURCES = aubdump.c
 intel_aubdump_la_LIBADD = $(top_builddir)/lib/libintel_tools.la -ldl
 
+intel_gpu_top_LDADD = $(top_builddir)/lib/libigt_perf.la
+
 bin_SCRIPTS = intel_aubdump
 CLEANFILES = $(bin_SCRIPTS)
 
diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
index 098e6ce3ff86..b923c3cfbe97 100644
--- a/tools/intel_gpu_top.c
+++ b/tools/intel_gpu_top.c
@@ -1,6 +1,5 @@ 
 /*
- * Copyright © 2007 Intel Corporation
- * Copyright © 2011 Intel Corporation
+ * Copyright © 2007-2018 Intel Corporation
  *
  * Permission is hereby granted, free of charge, to any person obtaining a
  * copy of this software and associated documentation files (the "Software"),
@@ -24,695 +23,832 @@ 
  * Authors:
  *    Eric Anholt <eric@anholt.net>
  *    Eugeni Dodonov <eugeni.dodonov@intel.com>
- *
  */
 
-#include "config.h"
-
-#include <inttypes.h>
-#include <unistd.h>
-#include <stdlib.h>
 #include <stdio.h>
-#include <err.h>
-#include <sys/ioctl.h>
-#include <sys/time.h>
-#include <sys/wait.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <stdint.h>
+#include <assert.h>
 #include <string.h>
-#ifdef HAVE_TERMIOS_H
-#include <termios.h>
-#endif
-#include "intel_io.h"
-#include "instdone.h"
-#include "intel_reg.h"
-#include "intel_chipset.h"
-#include "drmtest.h"
-
-#define  FORCEWAKE	    0xA18C
-#define  FORCEWAKE_ACK	    0x130090
-
-#define SAMPLES_PER_SEC             10000
-#define SAMPLES_TO_PERCENT_RATIO    (SAMPLES_PER_SEC / 100)
-
-#define MAX_NUM_TOP_BITS            100
-
-#define HAS_STATS_REGS(devid)		IS_965(devid)
-
-struct top_bit {
-	struct instdone_bit *bit;
-	int count;
-} top_bits[MAX_NUM_TOP_BITS];
-struct top_bit *top_bits_sorted[MAX_NUM_TOP_BITS];
-
-static uint32_t instdone, instdone1;
-
-static const char *bars[] = {
-	" ",
-	"▏",
-	"▎",
-	"▍",
-	"▌",
-	"▋",
-	"▊",
-	"▉",
-	"█"
-};
+#include <ctype.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <math.h>
+#include <locale.h>
+
+#include "igt_perf.h"
 
-enum stats_counts {
-	IA_VERTICES,
-	IA_PRIMITIVES,
-	VS_INVOCATION,
-	GS_INVOCATION,
-	GS_PRIMITIVES,
-	CL_INVOCATION,
-	CL_PRIMITIVES,
-	PS_INVOCATION,
-	PS_DEPTH,
-	STATS_COUNT
+struct pmu_pair {
+	uint64_t cur;
+	uint64_t prev;
 };
 
-const uint32_t stats_regs[STATS_COUNT] = {
-	IA_VERTICES_COUNT_QW,
-	IA_PRIMITIVES_COUNT_QW,
-	VS_INVOCATION_COUNT_QW,
-	GS_INVOCATION_COUNT_QW,
-	GS_PRIMITIVES_COUNT_QW,
-	CL_INVOCATION_COUNT_QW,
-	CL_PRIMITIVES_COUNT_QW,
-	PS_INVOCATION_COUNT_QW,
-	PS_DEPTH_COUNT_QW,
+struct pmu_counter {
+	bool present;
+	uint64_t config;
+	unsigned int idx;
+	struct pmu_pair val;
 };
 
-const char *stats_reg_names[STATS_COUNT] = {
-	"vert fetch",
-	"prim fetch",
-	"VS invocations",
-	"GS invocations",
-	"GS prims",
-	"CL invocations",
-	"CL prims",
-	"PS invocations",
-	"PS depth pass",
+struct engine {
+	const char *name;
+	const char *display_name;
+
+	unsigned int class;
+	unsigned int instance;
+
+	unsigned int num_counters;
+
+	struct pmu_counter busy;
+	struct pmu_counter wait;
+	struct pmu_counter sema;
 };
 
-uint64_t stats[STATS_COUNT];
-uint64_t last_stats[STATS_COUNT];
+struct engines {
+	unsigned int num_engines;
+	unsigned int num_counters;
+	DIR *root;
+	int fd;
+	struct pmu_pair ts;
+
+	int rapl_fd;
+	double rapl_scale;
+	const char *rapl_unit;
+
+	int imc_fd;
+	double imc_reads_scale;
+	const char *imc_reads_unit;
+	double imc_writes_scale;
+	const char *imc_writes_unit;
+
+	struct pmu_counter freq_req;
+	struct pmu_counter freq_act;
+	struct pmu_counter irq;
+	struct pmu_counter rc6;
+	struct pmu_counter rapl;
+	struct pmu_counter imc_reads;
+	struct pmu_counter imc_writes;
+
+	struct engine engine;
+};
 
-static unsigned long
-gettime(void)
+static uint64_t
+get_pmu_config(int dirfd, const char *name, const char *counter)
 {
-    struct timeval t;
-    gettimeofday(&t, NULL);
-    return (t.tv_usec + (t.tv_sec * 1000000));
-}
+	char buf[128], *p;
+	int fd, ret;
 
-static int
-top_bits_sort(const void *a, const void *b)
-{
-	struct top_bit * const *bit_a = a;
-	struct top_bit * const *bit_b = b;
-	int a_count = (*bit_a)->count;
-	int b_count = (*bit_b)->count;
+	ret = snprintf(buf, sizeof(buf), "%s-%s", name, counter);
+	if (ret < 0 || ret == sizeof(buf))
+		return -1;
 
-	if (a_count < b_count)
-		return 1;
-	else if (a_count == b_count)
-		return 0;
-	else
+	fd = openat(dirfd, buf, O_RDONLY);
+	if (fd < 0)
 		return -1;
-}
 
-static void
-update_idle_bit(struct top_bit *top_bit)
-{
-	uint32_t reg_val;
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0)
+		return -1;
 
-	if (top_bit->bit->reg == INSTDONE_1)
-		reg_val = instdone1;
-	else
-		reg_val = instdone;
+	p = index(buf, '0');
+	if (!p)
+		return -1;
 
-	if ((reg_val & top_bit->bit->bit) == 0)
-		top_bit->count++;
+	return strtoul(p, NULL, 0);
 }
 
-static void
-print_clock(const char *name, int clock) {
-	if (clock == -1)
-		printf("%s clock: unknown", name);
+#define engine_ptr(engines, n) (&engines->engine + (n))
+
+static const char *class_display_name(unsigned int class)
+{
+	switch (class) {
+	case I915_ENGINE_CLASS_RENDER:
+		return "Render/3D";
+	case I915_ENGINE_CLASS_COPY:
+		return "Blitter";
+	case I915_ENGINE_CLASS_VIDEO:
+		return "Video";
+	case I915_ENGINE_CLASS_VIDEO_ENHANCE:
+		return "VideoEnhance";
+	default:
+		return "[unknown]";
+	}
+}
+
+static int engine_cmp(const void *__a, const void *__b)
+{
+	const struct engine *a = (struct engine *)__a;
+	const struct engine *b = (struct engine *)__b;
+
+	if (a->class != b->class)
+		return a->class - b->class;
 	else
-		printf("%s clock: %d Mhz", name, clock);
+		return a->instance - b->instance;
 }
 
-static int
-print_clock_info(struct pci_device *pci_dev)
+static struct engines *discover_engines(void)
 {
-	uint32_t devid = pci_dev->device_id;
-	uint16_t gcfgc;
+	const char *sysfs_root = "/sys/devices/i915/events";
+	struct engines *engines;
+	struct dirent *dent;
+	int ret = 0;
+	DIR *d;
 
-	if (IS_GM45(devid)) {
-		int core_clock = -1;
+	engines = malloc(sizeof(struct engines));
+	if (!engines)
+		return NULL;
 
-		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
+	memset(engines, 0, sizeof(*engines));
 
-		switch (gcfgc & 0xf) {
-		case 8:
-			core_clock = 266;
-			break;
-		case 9:
-			core_clock = 320;
-			break;
-		case 11:
-			core_clock = 400;
-			break;
-		case 13:
-			core_clock = 533;
-			break;
-		}
-		print_clock("core", core_clock);
-	} else if (IS_965(devid) && IS_MOBILE(devid)) {
-		int render_clock = -1, sampler_clock = -1;
+	engines->num_engines = 0;
 
-		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
+	d = opendir(sysfs_root);
+	if (!d)
+		return NULL;
 
-		switch (gcfgc & 0xf) {
-		case 2:
-			render_clock = 250; sampler_clock = 267;
-			break;
-		case 3:
-			render_clock = 320; sampler_clock = 333;
-			break;
-		case 4:
-			render_clock = 400; sampler_clock = 444;
-			break;
-		case 5:
-			render_clock = 500; sampler_clock = 533;
+	while ((dent = readdir(d)) != NULL) {
+		const char *endswith = "-busy";
+		const unsigned int endlen = strlen(endswith);
+		struct engine *engine =
+				engine_ptr(engines, engines->num_engines);
+		char buf[256];
+
+		if (dent->d_type != DT_REG)
+			continue;
+
+		if (strlen(dent->d_name) >= sizeof(buf)) {
+			ret = ENAMETOOLONG;
 			break;
 		}
 
-		print_clock("render", render_clock);
-		printf("  ");
-		print_clock("sampler", sampler_clock);
-	} else if (IS_945(devid) && IS_MOBILE(devid)) {
-		int render_clock = -1, display_clock = -1;
+		strcpy(buf, dent->d_name);
 
-		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
+		/* xxxN-busy */
+		if (strlen(buf) < (endlen + 4))
+			continue;
+		if (strcmp(&buf[strlen(buf) - endlen], endswith))
+			continue;
 
-		switch (gcfgc & 0x7) {
-		case 0:
-			render_clock = 166;
-			break;
-		case 1:
-			render_clock = 200;
-			break;
-		case 3:
-			render_clock = 250;
-			break;
-		case 5:
-			render_clock = 400;
+		memset(engine, 0, sizeof(*engine));
+
+		buf[strlen(buf) - endlen] = 0;
+		engine->name = strdup(buf);
+		if (!engine->name) {
+			ret = errno;
 			break;
 		}
 
-		switch (gcfgc & 0x70) {
-		case 0:
-			display_clock = 200;
-			break;
-		case 4:
-			display_clock = 320;
+		engine->busy.config = get_pmu_config(dirfd(d), engine->name,
+						     "busy");
+		if (engine->busy.config == -1) {
+			ret = ENOENT;
 			break;
 		}
-		if (gcfgc & (1 << 7))
-		    display_clock = 133;
 
-		print_clock("render", render_clock);
-		printf("  ");
-		print_clock("display", display_clock);
-	} else if (IS_915(devid) && IS_MOBILE(devid)) {
-		int render_clock = -1, display_clock = -1;
+		engine->class = (engine->busy.config &
+				 (__I915_PMU_OTHER(0) - 1)) >>
+				I915_PMU_CLASS_SHIFT;
 
-		pci_device_cfg_read_u16(pci_dev, &gcfgc, I915_GCFGC);
+		engine->instance = (engine->busy.config >>
+				    I915_PMU_SAMPLE_BITS) &
+				    ((1 << I915_PMU_SAMPLE_INSTANCE_BITS) - 1);
 
-		switch (gcfgc & 0x7) {
-		case 0:
-			render_clock = 160;
-			break;
-		case 1:
-			render_clock = 190;
-			break;
-		case 4:
-			render_clock = 333;
+		ret = snprintf(buf, sizeof(buf), "%s/%u",
+			       class_display_name(engine->class),
+			       engine->instance);
+		if (ret < 0 || ret == sizeof(buf)) {
+			ret = ENOBUFS;
 			break;
 		}
-		if (gcfgc & (1 << 13))
-		    render_clock = 133;
+		ret = 0;
 
-		switch (gcfgc & 0x70) {
-		case 0:
-			display_clock = 190;
+		engine->display_name = strdup(buf);
+		if (!engine->display_name) {
+			ret = errno;
 			break;
-		case 4:
-			display_clock = 333;
+		}
+
+		engines->num_engines++;
+		engines = realloc(engines, sizeof(struct engines) +
+				  engines->num_engines * sizeof(struct engine));
+		if (!engines) {
+			ret = errno;
 			break;
 		}
-		if (gcfgc & (1 << 7))
-		    display_clock = 133;
+	}
+
+	if (ret) {
+		free(engines);
+		errno = ret;
 
-		print_clock("render", render_clock);
-		printf("  ");
-		print_clock("display", display_clock);
+		return NULL;
 	}
 
+	qsort(engine_ptr(engines, 0), engines->num_engines,
+	      sizeof(struct engine), engine_cmp);
+
+	engines->root = d;
 
-	printf("\n");
-	return -1;
+	return engines;
 }
 
-#define STATS_LEN (20)
-#define PERCENTAGE_BAR_END	(79 - STATS_LEN)
+static int
+filename_to_buf(const char *filename, char *buf, unsigned int bufsize)
+{
+	int fd, err;
+	ssize_t ret;
 
-static void
-print_percentage_bar(float percent, int cur_line_len)
+	fd = open(filename, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	ret = read(fd, buf, bufsize - 1);
+	err = errno;
+	close(fd);
+	if (ret < 1) {
+		errno = ret < 0 ? err : ENOMSG;
+
+		return -1;
+	}
+
+	if (ret > 1 && buf[ret - 1] == '\n')
+		buf[ret - 1] = '\0';
+	else
+		buf[ret] = '\0';
+
+	return 0;
+}
+
+static uint64_t filename_to_u64(const char *filename, int base)
 {
-	int bar_avail_len = (PERCENTAGE_BAR_END - cur_line_len - 1) * 8;
-	int bar_len = bar_avail_len * (percent + .5) / 100.0;
-	int i;
+	char buf[64], *b;
 
-	for (i = bar_len; i >= 8; i -= 8) {
-		printf("%s", bars[8]);
-		cur_line_len++;
+	if (filename_to_buf(filename, buf, sizeof(buf)))
+		return 0;
+
+	/*
+	 * Handle both single integer and key=value formats by skipping
+	 * leading non-digits.
+	 */
+	b = buf;
+	while (*b && !isdigit(*b))
+		b++;
+
+	return strtoull(b, NULL, base);
+}
+
+static double filename_to_double(const char *filename)
+{
+	char *oldlocale;
+	char buf[80];
+	double v;
+
+	if (filename_to_buf(filename, buf, sizeof(buf)))
+		return 0;
+
+	oldlocale = setlocale(LC_ALL, "C");
+	v = strtod(buf, NULL);
+	setlocale(LC_ALL, oldlocale);
+
+	return v;
+}
+
+#define RAPL_ROOT "/sys/devices/power/"
+#define RAPL_EVENT "/sys/devices/power/events/"
+
+static uint64_t rapl_type_id(void)
+{
+	return filename_to_u64(RAPL_ROOT "type", 10);
+}
+
+static uint64_t rapl_gpu_power(void)
+{
+	return filename_to_u64(RAPL_EVENT "energy-gpu", 0);
+}
+
+static double rapl_gpu_power_scale(void)
+{
+	return filename_to_double(RAPL_EVENT "energy-gpu.scale");
+}
+
+static const char *rapl_gpu_power_unit(void)
+{
+	char buf[32];
+
+	if (filename_to_buf(RAPL_EVENT "energy-gpu.unit",
+			    buf, sizeof(buf)) == 0)
+		if (!strcmp(buf, "Joules"))
+			return strdup("Watts");
+		else
+			return strdup(buf);
+	else
+		return NULL;
+}
+
+#define IMC_ROOT "/sys/devices/uncore_imc/"
+#define IMC_EVENT "/sys/devices/uncore_imc/events/"
+
+static uint64_t imc_type_id(void)
+{
+	return filename_to_u64(IMC_ROOT "type", 10);
+}
+
+static uint64_t imc_data_reads(void)
+{
+	return filename_to_u64(IMC_EVENT "data_reads", 0);
+}
+
+static double imc_data_reads_scale(void)
+{
+	return filename_to_double(IMC_EVENT "data_reads.scale");
+}
+
+static const char *imc_data_reads_unit(void)
+{
+	char buf[32];
+
+	if (filename_to_buf(IMC_EVENT "data_reads.unit", buf, sizeof(buf)) == 0)
+		return strdup(buf);
+	else
+		return NULL;
+}
+
+static uint64_t imc_data_writes(void)
+{
+	return filename_to_u64(IMC_EVENT "data_writes", 0);
+}
+
+static double imc_data_writes_scale(void)
+{
+	return filename_to_double(IMC_EVENT "data_writes.scale");
+}
+
+static const char *imc_data_writes_unit(void)
+{
+	char buf[32];
+
+	if (filename_to_buf(IMC_EVENT "data_writes.unit",
+			    buf, sizeof(buf)) == 0)
+		return strdup(buf);
+	else
+		return NULL;
+}
+
+#define _open_pmu(cnt, pmu, fd) \
+({ \
+	int fd__; \
+\
+	fd__ = perf_i915_open_group((pmu)->config, (fd)); \
+	if (fd__ >= 0) { \
+		if ((fd) == -1) \
+			(fd) = fd__; \
+		(pmu)->present = true; \
+		(pmu)->idx = (cnt)++; \
+	} \
+\
+	fd__; \
+})
+
+#define _open_imc(cnt, pmu, fd) \
+({ \
+	int fd__; \
+\
+	fd__ = igt_perf_open_group(imc_type_id(), (pmu)->config, (fd)); \
+	if (fd__ >= 0) { \
+		if ((fd) == -1) \
+			(fd) = fd__; \
+		(pmu)->present = true; \
+		(pmu)->idx = (cnt)++; \
+	} \
+\
+	fd__; \
+})
+
+static int pmu_init(struct engines *engines)
+{
+	unsigned int i;
+	int fd;
+
+	engines->fd = -1;
+	engines->num_counters = 0;
+
+	engines->irq.config = I915_PMU_INTERRUPTS;
+	fd = _open_pmu(engines->num_counters, &engines->irq, engines->fd);
+	if (fd < 0)
+		return -1;
+
+	engines->freq_req.config = I915_PMU_REQUESTED_FREQUENCY;
+	_open_pmu(engines->num_counters, &engines->freq_req, engines->fd);
+
+	engines->freq_act.config = I915_PMU_ACTUAL_FREQUENCY;
+	_open_pmu(engines->num_counters, &engines->freq_act, engines->fd);
+
+	engines->rc6.config = I915_PMU_RC6_RESIDENCY;
+	_open_pmu(engines->num_counters, &engines->rc6, engines->fd);
+
+	for (i = 0; i < engines->num_engines; i++) {
+		struct engine *engine = engine_ptr(engines, i);
+		struct {
+			struct pmu_counter *pmu;
+			const char *counter;
+		} *cnt, counters[] = {
+			{ .pmu = &engine->busy, .counter = "busy" },
+			{ .pmu = &engine->wait, .counter = "wait" },
+			{ .pmu = &engine->sema, .counter = "sema" },
+			{ .pmu = NULL, .counter = NULL },
+		};
+
+		for (cnt = counters; cnt->pmu; cnt++) {
+			if (!cnt->pmu->config)
+				cnt->pmu->config =
+					get_pmu_config(dirfd(engines->root),
+						       engine->name,
+						       cnt->counter);
+			fd = _open_pmu(engines->num_counters, cnt->pmu,
+				       engines->fd);
+			if (fd >= 0)
+				engine->num_counters++;
+		}
 	}
-	if (i) {
-		printf("%s", bars[i]);
-		cur_line_len++;
+
+	engines->rapl_fd = -1;
+	if (rapl_type_id()) {
+		engines->rapl_scale = rapl_gpu_power_scale();
+		engines->rapl_unit = rapl_gpu_power_unit();
+		if (!engines->rapl_unit)
+			return -1;
+
+		engines->rapl.config = rapl_gpu_power();
+		if (!engines->rapl.config)
+			return -1;
+
+		engines->rapl_fd = igt_perf_open(rapl_type_id(),
+						 engines->rapl.config);
+		if (engines->rapl_fd < 0)
+			return -1;
+
+		engines->rapl.present = true;
 	}
 
-	/* NB: We can't use a field width with utf8 so we manually
-	* guarantee a field with of 45 chars for any bar. */
-	printf("%*s", PERCENTAGE_BAR_END - cur_line_len, "");
-}
+	engines->imc_fd = -1;
+	if (imc_type_id()) {
+		unsigned int num = 0;
 
-struct ring {
-	const char *name;
-	uint32_t mmio;
-	int head, tail, size;
-	uint64_t full;
-	int idle;
-};
+		engines->imc_reads_scale = imc_data_reads_scale();
+		engines->imc_writes_scale = imc_data_writes_scale();
+
+		engines->imc_reads_unit = imc_data_reads_unit();
+		if (!engines->imc_reads_unit)
+			return -1;
+
+		engines->imc_writes_unit = imc_data_writes_unit();
+		if (!engines->imc_writes_unit)
+			return -1;
+
+		engines->imc_reads.config = imc_data_reads();
+		if (!engines->imc_reads.config)
+			return -1;
+
+		engines->imc_writes.config = imc_data_writes();
+		if (!engines->imc_writes.config)
+			return -1;
+
+		fd = _open_imc(num, &engines->imc_reads, engines->imc_fd);
+		if (fd < 0)
+			return -1;
+		fd = _open_imc(num, &engines->imc_writes, engines->imc_fd);
+		if (fd < 0)
+			return -1;
+
+		engines->imc_reads.present = true;
+		engines->imc_writes.present = true;
+	}
+
+	return 0;
+}
 
-static uint32_t ring_read(struct ring *ring, uint32_t reg)
+static uint64_t pmu_read_multi(int fd, unsigned int num, uint64_t *val)
 {
-	return INREG(ring->mmio + reg);
+	uint64_t buf[2 + num];
+	unsigned int i;
+	ssize_t len;
+
+	memset(buf, 0, sizeof(buf));
+
+	len = read(fd, buf, sizeof(buf));
+	assert(len == sizeof(buf));
+
+	for (i = 0; i < num; i++)
+		val[i] = buf[2 + i];
+
+	return buf[1];
 }
 
-static void ring_init(struct ring *ring)
+static double __pmu_calc(struct pmu_pair *p, double d, double t, double s)
 {
-	ring->size = (((ring_read(ring, RING_LEN) & RING_NR_PAGES) >> 12) + 1) * 4096;
+	double v;
+
+	v = p->cur - p->prev;
+	v /= d;
+	v /= t;
+	v *= s;
+
+	if (s == 100.0 && v > 100.0)
+		v = 100.0;
+
+	return v;
 }
 
-static void ring_reset(struct ring *ring)
+static void fill_str(char *buf, unsigned int bufsz, char c, unsigned int num)
 {
-	ring->idle = ring->full = 0;
+	unsigned int i;
+
+	for (i = 0; i < num && i < (bufsz - 1); i++)
+		*buf++ = c;
+
+	*buf = 0;
 }
 
-static void ring_sample(struct ring *ring)
+static void pmu_calc(struct pmu_counter *cnt,
+		     char *buf, unsigned int bufsz,
+		     unsigned int width, unsigned width_dec,
+		     double d, double t, double s)
 {
-	int full;
+	double val;
+	int len;
+
+	assert(bufsz >= (width + width_dec + 1));
+
+	if (!cnt->present) {
+		fill_str(buf, bufsz, '-', width + width_dec);
+		return;
+	}
 
-	if (!ring->size)
+	val = __pmu_calc(&cnt->val, d, t, s);
+
+	len = snprintf(buf, bufsz, "%*.*f", width + width_dec, width_dec, val);
+	if (len < 0 || len == bufsz) {
+		fill_str(buf, bufsz, 'X', width + width_dec);
 		return;
+	}
+}
+
+static uint64_t __pmu_read_single(int fd, uint64_t *ts)
+{
+	uint64_t data[2] = { };
+	ssize_t len;
 
-	ring->head = ring_read(ring, RING_HEAD) & HEAD_ADDR;
-	ring->tail = ring_read(ring, RING_TAIL) & TAIL_ADDR;
+	len = read(fd, data, sizeof(data));
+	assert(len == sizeof(data));
 
-	if (ring->tail == ring->head)
-		ring->idle++;
+	if (ts)
+		*ts = data[1];
 
-	full = ring->tail - ring->head;
-	if (full < 0)
-		full += ring->size;
-	ring->full += full;
+	return data[0];
 }
 
-static void ring_print_header(FILE *out, struct ring *ring)
+static uint64_t pmu_read_single(int fd)
 {
-    fprintf(out, "%.6s%%\tops\t",
-            ring->name
-          );
+	return __pmu_read_single(fd, NULL);
 }
 
-static void ring_print(struct ring *ring, unsigned long samples_per_sec)
+static void __update_sample(struct pmu_counter *counter, uint64_t val)
 {
-	int percent_busy, len;
+	counter->val.prev = counter->val.cur;
+	counter->val.cur = val;
+}
 
-	if (!ring->size)
-		return;
+static void update_sample(struct pmu_counter *counter, uint64_t *val)
+{
+	if (counter->present)
+		__update_sample(counter, val[counter->idx]);
+}
+
+static void pmu_sample(struct engines *engines)
+{
+	const int num_val = engines->num_counters;
+	uint64_t val[2 + num_val];
+	unsigned int i;
+
+	engines->ts.prev = engines->ts.cur;
+
+	if (engines->rapl_fd >= 0)
+		__update_sample(&engines->rapl,
+				pmu_read_single(engines->rapl_fd));
+
+	if (engines->imc_fd >= 0) {
+		pmu_read_multi(engines->imc_fd, 2, val);
+		update_sample(&engines->imc_reads, val);
+		update_sample(&engines->imc_writes, val);
+	}
 
-	percent_busy = 100 - 100 * ring->idle / samples_per_sec;
+	engines->ts.cur = pmu_read_multi(engines->fd, num_val, val);
 
-	len = printf("%25s busy: %3d%%: ", ring->name, percent_busy);
-	print_percentage_bar (percent_busy, len);
-	printf("%24s space: %d/%d\n",
-		   ring->name,
-		   (int)(ring->full / samples_per_sec),
-		   ring->size);
+	update_sample(&engines->freq_req, val);
+	update_sample(&engines->freq_act, val);
+	update_sample(&engines->irq, val);
+	update_sample(&engines->rc6, val);
+
+	for (i = 0; i < engines->num_engines; i++) {
+		struct engine *engine = engine_ptr(engines, i);
+
+		update_sample(&engine->busy, val);
+		update_sample(&engine->sema, val);
+		update_sample(&engine->wait, val);
+	}
 }
 
-static void ring_log(struct ring *ring, unsigned long samples_per_sec,
-		FILE *output)
+static const char *bars[] = { " ", "▏", "▎", "▍", "▌", "▋", "▊", "▉", "█" };
+
+static void
+print_percentage_bar(double percent, int max_len)
 {
-	if (ring->size)
-		fprintf(output, "%3d\t%d\t",
-			(int)(100 - 100 * ring->idle / samples_per_sec),
-			(int)(ring->full / samples_per_sec));
-	else
-		fprintf(output, "-1\t-1\t");
+	int bar_len = percent * (8 * (max_len - 2)) / 100.0;
+	int i;
+
+	putchar('|');
+
+	for (i = bar_len; i >= 8; i -= 8)
+		printf("%s", bars[8]);
+	if (i)
+		printf("%s", bars[i]);
+
+	for (i = 0; i < (max_len - 2 - (bar_len + 7) / 8); i++)
+		putchar(' ');
+
+	putchar('|');
 }
 
+#define DEFAULT_PERIOD_MS (1000)
+
 static void
 usage(const char *appname)
 {
 	printf("intel_gpu_top - Display a top-like summary of Intel GPU usage\n"
-			"\n"
-			"usage: %s [parameters]\n"
-			"\n"
-			"The following parameters apply:\n"
-			"[-s <samples>]       samples per seconds (default %d)\n"
-			"[-e <command>]       command to profile\n"
-			"[-o <file>]          output statistics to file. If file is '-',"
-			"                     run in batch mode and output statistics to stdio only \n"
-			"[-h]                 show this help screen\n"
-			"\n",
-			appname,
-			SAMPLES_PER_SEC
-		  );
-	return;
+		"\n"
+		"Usage: %s [parameters]\n"
+		"\n"
+		"\tThe following parameters are optional:\n\n"
+		"\t[-s <ms>]       Refresh period in milliseconds (default %ums).\n"
+		"\t[-h]            Show this help text.\n"
+		"\n",
+		appname, DEFAULT_PERIOD_MS);
 }
 
 int main(int argc, char **argv)
 {
-	uint32_t devid;
-	struct pci_device *pci_dev;
-	struct ring render_ring = {
-		.name = "render",
-		.mmio = 0x2030,
-	}, bsd_ring = {
-		.name = "bitstream",
-		.mmio = 0x4030,
-	}, bsd6_ring = {
-		.name = "bitstream",
-		.mmio = 0x12030,
-	}, blt_ring = {
-		.name = "blitter",
-		.mmio = 0x22030,
-	};
-	int i, ch;
-	int samples_per_sec = SAMPLES_PER_SEC;
-	FILE *output = NULL;
-	double elapsed_time=0;
-	int print_headers=1;
-	pid_t child_pid=-1;
-	int child_stat;
-	char *cmd=NULL;
-	int interactive=1;
-
-	/* Parse options? */
-	while ((ch = getopt(argc, argv, "s:o:e:h")) != -1) {
+	unsigned int period_us = DEFAULT_PERIOD_MS * 1000;
+	int con_w = -1, con_h = -1;
+	struct engines *engines;
+	unsigned int i;
+	int ret, ch;
+
+	/* Parse options */
+	while ((ch = getopt(argc, argv, "s:h")) != -1) {
 		switch (ch) {
-		case 'e': cmd = strdup(optarg);
-			break;
-		case 's': samples_per_sec = atoi(optarg);
-			if (samples_per_sec < 100) {
-				fprintf(stderr, "Error: samples per second must be >= 100\n");
-				exit(1);
-			}
-			break;
-		case 'o':
-			if (!strcmp(optarg, "-")) {
-				/* Running in non-interactive mode */
-				interactive = 0;
-				output = stdout;
-			}
-			else
-				output = fopen(optarg, "w");
-			if (!output)
-			{
-				perror("fopen");
-				exit(1);
-			}
+		case 's':
+			period_us = atoi(optarg) * 1000;
 			break;
 		case 'h':
 			usage(argv[0]);
 			exit(0);
-			break;
 		default:
-			fprintf(stderr, "Invalid flag %c!\n", (char)optopt);
+			fprintf(stderr, "Invalid option %c!\n", (char)optopt);
 			usage(argv[0]);
 			exit(1);
-			break;
 		}
 	}
 
-	pci_dev = intel_get_pci_device();
-	devid = pci_dev->device_id;
-	intel_mmio_use_pci_bar(pci_dev);
-	init_instdone_definitions(devid);
-
-	/* Do we have a command to run? */
-	if (cmd != NULL) {
-		if (output) {
-			fprintf(output, "# Profiling: %s\n", cmd);
-			fflush(output);
-		}
-		child_pid = fork();
-		if (child_pid < 0) {
-			perror("fork");
-			exit(1);
-		}
-		else if (child_pid == 0) {
-			int res;
-			res = system(cmd);
-			if (res < 0)
-				perror("running command");
-			if (output) {
-				fflush(output);
-				fprintf(output, "# %s exited with status %d\n", cmd, res);
-				fflush(output);
-			}
-			free(cmd);
-			exit(0);
-		} else {
-			free(cmd);
-		}
+	engines = discover_engines();
+	if (!engines) {
+		fprintf(stderr,
+			"Failed to detect engines! (%s)\n(Kernel 4.16 or newer is required for i915 PMU support.)\n",
+			strerror(errno));
+		return 1;
 	}
 
-	for (i = 0; i < num_instdone_bits; i++) {
-		top_bits[i].bit = &instdone_bits[i];
-		top_bits[i].count = 0;
-		top_bits_sorted[i] = &top_bits[i];
+	ret = pmu_init(engines);
+	if (ret) {
+		fprintf(stderr,
+			"Failed to initialize PMU! (%s)\n", strerror(errno));
+		return 1;
 	}
 
-	/* Grab access to the registers */
-	intel_register_access_init(pci_dev, 0, -1);
+	pmu_sample(engines);
 
-	ring_init(&render_ring);
-	if (IS_GEN4(devid) || IS_GEN5(devid))
-		ring_init(&bsd_ring);
-	if (IS_GEN6(devid) || IS_GEN7(devid)) {
-		ring_init(&bsd6_ring);
-		ring_init(&blt_ring);
-	}
+	for (;;) {
+		double t;
+#define BUFSZ 16
+		char freq[BUFSZ];
+		char fact[BUFSZ];
+		char irq[BUFSZ];
+		char rc6[BUFSZ];
+		char power[BUFSZ];
+		char reads[BUFSZ];
+		char writes[BUFSZ];
+		struct winsize ws;
+		int lines = 0;
 
-	/* Initialize GPU stats */
-	if (HAS_STATS_REGS(devid)) {
-		for (i = 0; i < STATS_COUNT; i++) {
-			uint32_t stats_high, stats_low, stats_high_2;
+		/* Update terminal size. */
+		if (ioctl(0, TIOCGWINSZ, &ws) != -1) {
+			con_w = ws.ws_col;
+			con_h = ws.ws_row;
+		}
 
-			do {
-				stats_high = INREG(stats_regs[i] + 4);
-				stats_low = INREG(stats_regs[i]);
-				stats_high_2 = INREG(stats_regs[i] + 4);
-			} while (stats_high != stats_high_2);
+		pmu_sample(engines);
+		t = (double)(engines->ts.cur - engines->ts.prev) / 1e9;
 
-			last_stats[i] = (uint64_t)stats_high << 32 |
-				stats_low;
-		}
-	}
+		printf("\033[H\033[J");
 
-	for (;;) {
-		int j;
-		unsigned long long t1, ti, tf, t2;
-		unsigned long long def_sleep = 1000000 / samples_per_sec;
-		unsigned long long last_samples_per_sec = samples_per_sec;
-		unsigned short int max_lines;
-		struct winsize ws;
-		char clear_screen[] = {0x1b, '[', 'H',
-				       0x1b, '[', 'J',
-				       0x0};
-		int percent;
-		int len;
-
-		t1 = gettime();
-
-		ring_reset(&render_ring);
-		ring_reset(&bsd_ring);
-		ring_reset(&bsd6_ring);
-		ring_reset(&blt_ring);
-
-		for (i = 0; i < samples_per_sec; i++) {
-			long long interval;
-			ti = gettime();
-			if (IS_965(devid)) {
-				instdone = INREG(INSTDONE_I965);
-				instdone1 = INREG(INSTDONE_1);
-			} else
-				instdone = INREG(INSTDONE);
-
-			for (j = 0; j < num_instdone_bits; j++)
-				update_idle_bit(&top_bits[j]);
-
-			ring_sample(&render_ring);
-			ring_sample(&bsd_ring);
-			ring_sample(&bsd6_ring);
-			ring_sample(&blt_ring);
-
-			tf = gettime();
-			if (tf - t1 >= 1000000) {
-				/* We are out of sync, bail out */
-				last_samples_per_sec = i+1;
-				break;
-			}
-			interval = def_sleep - (tf - ti);
-			if (interval > 0)
-				usleep(interval);
-		}
+		pmu_calc(&engines->freq_req, freq, BUFSZ, 4, 0, 1.0, t, 1);
+		pmu_calc(&engines->freq_act, fact, BUFSZ, 4, 0, 1.0, t, 1);
+		pmu_calc(&engines->irq, irq, BUFSZ, 8, 0, 1.0, t, 1);
+		pmu_calc(&engines->rc6, rc6, BUFSZ, 3, 0, 1e9, t, 100);
+		pmu_calc(&engines->rapl, power, BUFSZ, 4, 2, 1.0, t,
+			 engines->rapl_scale);
+		pmu_calc(&engines->imc_reads, reads, BUFSZ, 6, 0, 1.0, t,
+			 engines->imc_reads_scale);
+		pmu_calc(&engines->imc_writes, writes, BUFSZ, 6, 0, 1.0, t,
+			 engines->imc_writes_scale);
 
-		if (HAS_STATS_REGS(devid)) {
-			for (i = 0; i < STATS_COUNT; i++) {
-				uint32_t stats_high, stats_low, stats_high_2;
+		if (lines++ < con_h)
+			printf("intel-gpu-top - %s/%s MHz;  %s%% RC6; %s %s; %s irqs/s\n",
+			       fact, freq, rc6, power, engines->rapl_unit, irq);
 
-				do {
-					stats_high = INREG(stats_regs[i] + 4);
-					stats_low = INREG(stats_regs[i]);
-					stats_high_2 = INREG(stats_regs[i] + 4);
-				} while (stats_high != stats_high_2);
+		if (lines++ < con_h)
+			printf("\n");
 
-				stats[i] = (uint64_t)stats_high << 32 |
-					stats_low;
-			}
-		}
+		if (engines->imc_fd) {
+			if (lines++ < con_h)
+				printf("      IMC reads:   %s %s/s\n",
+				       reads, engines->imc_reads_unit);
+
+			if (lines++ < con_h)
+				printf("     IMC writes:   %s %s/s\n",
+				       writes, engines->imc_writes_unit);
 
-		qsort(top_bits_sorted, num_instdone_bits,
-		      sizeof(struct top_bit *), top_bits_sort);
-
-		/* Limit the number of lines printed to the terminal height so the
-		 * most important info (at the top) will stay on screen. */
-		max_lines = -1;
-		if (ioctl(0, TIOCGWINSZ, &ws) != -1)
-			max_lines = ws.ws_row - 6; /* exclude header lines */
-		if (max_lines >= num_instdone_bits)
-			max_lines = num_instdone_bits;
-
-		t2 = gettime();
-		elapsed_time += (t2 - t1) / 1000000.0;
-
-		if (interactive) {
-			printf("%s", clear_screen);
-			print_clock_info(pci_dev);
-
-			ring_print(&render_ring, last_samples_per_sec);
-			ring_print(&bsd_ring, last_samples_per_sec);
-			ring_print(&bsd6_ring, last_samples_per_sec);
-			ring_print(&blt_ring, last_samples_per_sec);
-
-			printf("\n%30s  %s\n", "task", "percent busy");
-			for (i = 0; i < max_lines; i++) {
-				if (top_bits_sorted[i]->count > 0) {
-					percent = (top_bits_sorted[i]->count * 100) /
-						last_samples_per_sec;
-					len = printf("%30s: %3d%%: ",
-							 top_bits_sorted[i]->bit->name,
-							 percent);
-					print_percentage_bar (percent, len);
-				} else {
-					printf("%*s", PERCENTAGE_BAR_END, "");
-				}
-
-				if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
-					printf("%13s: %llu (%lld/sec)",
-						   stats_reg_names[i],
-						   (long long)stats[i],
-						   (long long)(stats[i] - last_stats[i]));
-					last_stats[i] = stats[i];
-				} else {
-					if (!top_bits_sorted[i]->count)
-						break;
-				}
+			if (++lines < con_h)
 				printf("\n");
-			}
 		}
-		if (output) {
-			/* Print headers for columns at first run */
-			if (print_headers) {
-				fprintf(output, "# time\t");
-				ring_print_header(output, &render_ring);
-				ring_print_header(output, &bsd_ring);
-				ring_print_header(output, &bsd6_ring);
-				ring_print_header(output, &blt_ring);
-				for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
-					if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
-						fprintf(output, "%.6s\t",
-							   stats_reg_names[i]
-							   );
-					}
-					if (!top_bits[i].count)
-						continue;
-				}
-				fprintf(output, "\n");
-				print_headers = 0;
-			}
 
-			/* Print statistics */
-			fprintf(output, "%.2f\t", elapsed_time);
-			ring_log(&render_ring, last_samples_per_sec, output);
-			ring_log(&bsd_ring, last_samples_per_sec, output);
-			ring_log(&bsd6_ring, last_samples_per_sec, output);
-			ring_log(&blt_ring, last_samples_per_sec, output);
-
-			for (i = 0; i < MAX_NUM_TOP_BITS; i++) {
-				if (i < STATS_COUNT && HAS_STATS_REGS(devid)) {
-					fprintf(output, "%"PRIu64"\t",
-						   stats[i] - last_stats[i]);
-					last_stats[i] = stats[i];
-				}
-					if (!top_bits[i].count)
-						continue;
-			}
-			fprintf(output, "\n");
-			fflush(output);
-		}
+		for (i = 0; i < engines->num_engines; i++) {
+			struct engine *engine = engine_ptr(engines, i);
 
-		for (i = 0; i < num_instdone_bits; i++) {
-			top_bits_sorted[i]->count = 0;
+			if (engine->num_counters && lines < con_h) {
+				const char *a = "          ENGINE      BUSY ";
+				const char *b = " MI_SEMA MI_WAIT";
 
-			if (i < STATS_COUNT)
-				last_stats[i] = stats[i];
+				printf("\033[7m%s%*s%s\033[0m\n",
+				       a,
+				       (int)(con_w - 1 - strlen(a) - strlen(b)),
+				       " ", b);
+				lines++;
+				break;
+			}
 		}
 
-		/* Check if child has gone */
-		if (child_pid > 0) {
-			int res;
-			if ((res = waitpid(child_pid, &child_stat, WNOHANG)) == -1) {
-				perror("waitpid");
-				exit(1);
-			}
-			if (res == 0)
+		for (i = 0; i < engines->num_engines && lines < con_h; i++) {
+			struct engine *engine = engine_ptr(engines, i);
+			unsigned int max_w = con_w - 1;
+			unsigned int len;
+			char sema[BUFSZ];
+			char wait[BUFSZ];
+			char busy[BUFSZ];
+			char buf[128];
+			double val;
+
+			if (!engine->num_counters)
 				continue;
-			if (WIFEXITED(child_stat))
-				break;
+
+			pmu_calc(&engine->sema, sema, BUFSZ, 3, 0, 1e9, t, 100);
+			pmu_calc(&engine->wait, wait, BUFSZ, 3, 0, 1e9, t, 100);
+			len = snprintf(buf, sizeof(buf), "    %s%%    %s%%",
+				       sema, wait);
+
+			pmu_calc(&engine->busy, busy, BUFSZ, 6, 2, 1e9, t,
+				 100);
+			len += printf("%16s %s%% ", engine->display_name, busy);
+
+			val = __pmu_calc(&engine->busy.val, 1e9, t, 100);
+			print_percentage_bar(val, max_w - len);
+
+			printf("%s\n", buf);
+
+			lines++;
 		}
-	}
 
-	fclose(output);
+		if (lines++ < con_h)
+			printf("\n");
+
+		usleep(period_us);
+	}
 
-	intel_register_access_fini();
 	return 0;
 }
diff --git a/tools/meson.build b/tools/meson.build
index bd2d313d5156..a918eeb0bef1 100644
--- a/tools/meson.build
+++ b/tools/meson.build
@@ -23,7 +23,6 @@  tools_progs = [
 	'intel_gpu_frequency',
 	'intel_firmware_decode',
 	'intel_gpu_time',
-	'intel_gpu_top',
 	'intel_gtt',
 	'intel_guc_logger',
 	'intel_infoframes',
@@ -117,6 +116,11 @@  shared_library('intel_aubdump', 'aubdump.c',
 	       name_prefix : '',
 	       install : true)
 
+executable('intel_gpu_top', 'intel_gpu_top.c',
+	   install : true,
+	   install_rpath : rpathdir,
+	   dependencies : tool_deps + [ lib_igt_perf ])
+
 conf_data = configuration_data()
 conf_data.set('prefix', prefix)
 conf_data.set('exec_prefix', '${prefix}')