[v3,2/2] arm64: add micro test

Message ID	1516225596-27071-3-git-send-email-shihwei@cs.columbia.edu (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Shih-Wei Li <shihwei@cs.columbia.edu> To: kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu Cc: christoffer.dall@linaro.org, marc.zyngier@arm.com, drjones@redhat.com, pbonzini@redhat.com, Shih-Wei Li <shihwei@cs.columbia.edu>, Christoffer Dall <cdall@cs.columbia.edu> Subject: [PATCH v3 2/2] arm64: add micro test Date: Wed, 17 Jan 2018 16:46:36 -0500 Message-Id: <1516225596-27071-3-git-send-email-shihwei@cs.columbia.edu> In-Reply-To: <1516225596-27071-1-git-send-email-shihwei@cs.columbia.edu> References: <1516225596-27071-1-git-send-email-shihwei@cs.columbia.edu> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Shih-Wei Li Jan. 17, 2018, 9:46 p.m. UTC

Here we provide the support for measuring various micro level
operations on arm64. We iterate each of the tests and output
their average, minimum and maximum cost in microseconds.
Instruction barriers were used before taking timestamps to
avoid out-of-order execution or pipelining from skewing our
measurements.

The operations we currently supported and measured are mostly
self-explanatory by their function name. For IPI, we measured the
cost of sending IPI from a source VCPU to a target VCPU, until the
target VCPU receives the IPI.

Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
---
 arm/Makefile.common |   1 +
 arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   7 ++
 3 files changed, 260 insertions(+)
 create mode 100644 arm/micro-test.c

Andrew Jones Jan. 18, 2018, 9:30 a.m. UTC | #1

On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests and output
> their average, minimum and maximum cost in microseconds.
> Instruction barriers were used before taking timestamps to
> avoid out-of-order execution or pipelining from skewing our
> measurements.
> 
> The operations we currently supported and measured are mostly
> self-explanatory by their function name. For IPI, we measured the
> cost of sending IPI from a source VCPU to a target VCPU, until the
> target VCPU receives the IPI.

I think the information above, and certainly much of the information
about pinning, etc. in the cover-letter is worth putting in a huge
comment in the code.

> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   7 ++
>  3 files changed, 260 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..407ce8b
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,252 @@
> +/*
> + * Measure the cost of micro level operations.
> + *
> + * Copyright Columbia University
> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.
> + */
> +#include <asm/gic.h>
> +#include "libcflat.h"
> +#include <util.h>
> +#include <limits.h>

Minor nit: we usually put the <asm/...> includes under the non-asm
includes.

> +
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +static volatile unsigned int cntfrq;
> +static volatile void *vgic_dist_addr;
> +void (*write_eoir)(u32 irqstat);

write_eoir should also be static

> +
> +#define IPI_IRQ		1
> +
> +#define TRIES	(1U << 28)
> +
> +#define MAX_FAILURES	1000

Super minor nit: white space could be improved in the defines,
the '1' is one tab to many and the blank lines probably aren't
necessary.

> +
> +/*
> + * The counter may not always start with zero, which means it could
> + * overflow after some period of time.
> + */
> +#define COUNT(c1, c2) \
> +	((c1) > (c2) ? 0 : (c2) - (c1))
> +
> +static uint64_t read_cc(void)
> +{
> +	isb();
> +	return read_sysreg(cntpct_el0);
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void *data __unused)
> +{
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +

Another super minor white space nit: probably don't need blank lines
around gic_enable.

> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int v;
> +
> +	v = gic_init();
> +	if (v == 2) {
> +		vgic_dist_addr = gicv2_dist_base();
> +		write_eoir = gicv2_write_eoir;
> +	} else if (v == 3) {
> +		vgic_dist_addr = gicv3_dist_base();
> +		write_eoir = gicv3_write_eoir;
> +	} else {
> +		printf("No supported gic present, skipping tests...\n");
> +		return 0;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> +
> +	cntfrq = get_cntfrq();
> +	printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> +
> +	return 1;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int tries = TRIES;
> +	uint64_t c1, c2;
> +
> +	while (!ipi_ready && tries--);
> +	assert(ipi_ready);
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	tries = TRIES;
> +	while (!ipi_received && tries--);
> +	assert(ipi_received);
> +
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long hvc_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	uint64_t c1, c2;
> +	/*
> +	 * FIXME: Read device-id in virtio mmio here. This address
> +	 * needs to be updated in the future if any relevent
> +	 * changes in QEMU test-dev are made.
> +	 */
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + GICD_IIDR);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	/* Avoid measuring assert(..) in gic_write_eoir */
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void get_us_output(const char *name,
> +			  unsigned long cycles)
> +{
> +	unsigned int ns_per_cycle = 10^9U / cntfrq;

10^9U = 3, in C '^' is xor. You need a

 #define NS_PER_SEC (1000 * 1000 * 1000U)

Also, we should probably use the ceiling of ns/cycle:

 ns_per_cycle = NS_PER_SEC / cntfrq + !!(NS_PER_SEC % cntfrq);

> +	unsigned int ns, us, us_frac;
> +
> +	ns =  cycles * ns_per_cycle;
> +	us = ns / 1000;
> +	us_frac = (ns % 1000) / 100;
> +
> +	printf("%s %10d.%d\t", name, us, us_frac);
> +}
> +
> +static void output_result(const char *name,
> +			  unsigned long avg_cycle,
> +			  unsigned long min_cycle,
> +			  unsigned long max_cycle)
> +{
> +	printf("%10s:\t", name);

10 columns isn't big enough for the test names, 30 would be better

> +	get_us_output("avg", avg_cycle);
> +	get_us_output("min", min_cycle);
> +	get_us_output("max", max_cycle);
> +	printf("\n");
> +}
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long min = ULONG_MAX, max = 0;
> +	const unsigned long goal = (1ULL << 26);
> +	int failures = 0;
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		i = 0;
> +		while (i < iterations) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				if (failures++ > MAX_FAILURES) {
> +				/*
> +				 * If the cost is smaller than a cycle count for
> +				 * over MAX_FAILURES of times, we simply ignore the test.
> +				 */

Comment block should also be indented.

> +					printf("%s: Too many cycle count overflows\n",

Should print "too many overflows or too small to measure (c1 == c2)",
or something like that

> +						test->name);
> +					return;
> +				}
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +			++i;
> +		}
> +	} while (cycles < goal);
> +
> +	output_result(test->name, (cycles / iterations), min, max);	

Trailing tab on this line, maybe try running the kernel's checkpatch to
catch some coding styles issues

> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int i, ret;
> +
> +	ret = test_init();
> +	assert(ret);
> +
> +	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> +		if (!tests[i].run)
> +			continue;
> +		loop_test(&tests[i]);
> +	}
> +
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..5759fa8 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,10 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = nodefault,micro-test
> +accel = kvm

It should be pretty easy to make this test work for arm32 too, but if
you don't modify the few places that need to be modified then we should
add 'arch = arm64' here and you should list the test in Makefile.arm64,
not Makefile.common.

Thanks,
drew

Yury Norov Jan. 18, 2018, 10:09 a.m. UTC | #2

On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests and output
> their average, minimum and maximum cost in microseconds.
> Instruction barriers were used before taking timestamps to
> avoid out-of-order execution or pipelining from skewing our
> measurements.
> 
> The operations we currently supported and measured are mostly
> self-explanatory by their function name. For IPI, we measured the
> cost of sending IPI from a source VCPU to a target VCPU, until the
> target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   7 ++
>  3 files changed, 260 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..407ce8b
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,252 @@
> +/*
> + * Measure the cost of micro level operations.
> + *
> + * Copyright Columbia University
> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.
> + */
> +#include <asm/gic.h>
> +#include "libcflat.h"
> +#include <util.h>
> +#include <limits.h>
> +
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +static volatile unsigned int cntfrq;
> +static volatile void *vgic_dist_addr;

Volatiles considered harmful for kernel code. I think it is also correct
for your case:
https://lwn.net/Articles/234017/

I would prefer use readl/writel accessors for interprocessor
communications which guaranties cache synchronization and proper
ordering.

Also, ipi_received and ipi_ready are looking like synchronization
primitives. Did you consider using spinlocks instead?

Also-also, cntfrq is written only once at init. What for you make it
volatile?

> +void (*write_eoir)(u32 irqstat);
> +
> +#define IPI_IRQ		1
> +
> +#define TRIES	(1U << 28)
> +
> +#define MAX_FAILURES	1000
> +
> +/*
> + * The counter may not always start with zero, which means it could
> + * overflow after some period of time.
> + */
> +#define COUNT(c1, c2) \
> +	((c1) > (c2) ? 0 : (c2) - (c1))
> +
> +static uint64_t read_cc(void)
> +{
> +	isb();
> +	return read_sysreg(cntpct_el0);
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void *data __unused)
> +{
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int v;
> +
> +	v = gic_init();
> +	if (v == 2) {
> +		vgic_dist_addr = gicv2_dist_base();
> +		write_eoir = gicv2_write_eoir;
> +	} else if (v == 3) {
> +		vgic_dist_addr = gicv3_dist_base();
> +		write_eoir = gicv3_write_eoir;
> +	} else {
> +		printf("No supported gic present, skipping tests...\n");
> +		return 0;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> +
> +	cntfrq = get_cntfrq();
> +	printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> +
> +	return 1;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int tries = TRIES;
> +	uint64_t c1, c2;
> +
> +	while (!ipi_ready && tries--);
> +	assert(ipi_ready);
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	tries = TRIES;
> +	while (!ipi_received && tries--);
> +	assert(ipi_received);
> +
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long hvc_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	uint64_t c1, c2;
> +	/*
> +	 * FIXME: Read device-id in virtio mmio here. This address
> +	 * needs to be updated in the future if any relevent
> +	 * changes in QEMU test-dev are made.
> +	 */
> +	void *mmio_read_user_addr = (void*) 0x0a000008;

Again, could you rename it? device_id_ptr sounds much better.

> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + GICD_IIDR);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	/* Avoid measuring assert(..) in gic_write_eoir */
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void get_us_output(const char *name,
> +			  unsigned long cycles)
> +{
> +	unsigned int ns_per_cycle = 10^9U / cntfrq;
> +	unsigned int ns, us, us_frac;

'^' is 'xor', you probably mean 1000*1000*1000. And anyway,
ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.

For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
For CPUs working at 1GHz+ it is 0. What's point in it?

> +	ns =  cycles * ns_per_cycle;
> +	us = ns / 1000;
> +	us_frac = (ns % 1000) / 100;
> +
> +	printf("%s %10d.%d\t", name, us, us_frac);
> +}
> +
> +static void output_result(const char *name,
> +			  unsigned long avg_cycle,
> +			  unsigned long min_cycle,
> +			  unsigned long max_cycle)
> +{
> +	printf("%10s:\t", name);
> +	get_us_output("avg", avg_cycle);
> +	get_us_output("min", min_cycle);
> +	get_us_output("max", max_cycle);
> +	printf("\n");
> +}
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long min = ULONG_MAX, max = 0;
> +	const unsigned long goal = (1ULL << 26);
> +	int failures = 0;
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		i = 0;
> +		while (i < iterations) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				if (failures++ > MAX_FAILURES) {
> +				/*
> +				 * If the cost is smaller than a cycle count for
> +				 * over MAX_FAILURES of times, we simply ignore the test.
> +				 */
> +					printf("%s: Too many cycle count overflows\n",
> +						test->name);

Again, this is not overflow. Printed message does not correspond the
comment. 

Also, why do you think that sample == 0 means fail? I'd rather suspect my
measurement procedure, or if everything is correct, simply take this
result.

Now sample == 0 indicates 2 different cases - too small cost and
overflow. Could you introduce special error code to distinguish
between them?

> +					return;
> +				}
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +			++i;
> +		}
> +	} while (cycles < goal);
> +
> +	output_result(test->name, (cycles / iterations), min, max);	
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int i, ret;
> +
> +	ret = test_init();
> +	assert(ret);
> +
> +	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> +		if (!tests[i].run)
> +			continue;
> +		loop_test(&tests[i]);
> +	}
> +
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..5759fa8 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,10 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = nodefault,micro-test
> +accel = kvm
> -- 
> 1.9.1
>

Andrew Jones Jan. 18, 2018, 10:37 a.m. UTC | #3

On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
> On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> > Here we provide the support for measuring various micro level
> > operations on arm64. We iterate each of the tests and output
> > their average, minimum and maximum cost in microseconds.
> > Instruction barriers were used before taking timestamps to
> > avoid out-of-order execution or pipelining from skewing our
> > measurements.
> > 
> > The operations we currently supported and measured are mostly
> > self-explanatory by their function name. For IPI, we measured the
> > cost of sending IPI from a source VCPU to a target VCPU, until the
> > target VCPU receives the IPI.
> > 
> > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> > ---
> >  arm/Makefile.common |   1 +
> >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  arm/unittests.cfg   |   7 ++
> >  3 files changed, 260 insertions(+)
> >  create mode 100644 arm/micro-test.c
> > 
> > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > index 0a039cf..c7d5c27 100644
> > --- a/arm/Makefile.common
> > +++ b/arm/Makefile.common
> > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >  tests-common += $(TEST_DIR)/gic.flat
> >  tests-common += $(TEST_DIR)/psci.flat
> >  tests-common += $(TEST_DIR)/sieve.flat
> > +tests-common += $(TEST_DIR)/micro-test.flat
> >  
> >  tests-all = $(tests-common) $(tests)
> >  all: directories $(tests-all)
> > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > new file mode 100644
> > index 0000000..407ce8b
> > --- /dev/null
> > +++ b/arm/micro-test.c
> > @@ -0,0 +1,252 @@
> > +/*
> > + * Measure the cost of micro level operations.
> > + *
> > + * Copyright Columbia University
> > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2.
> > + */
> > +#include <asm/gic.h>
> > +#include "libcflat.h"
> > +#include <util.h>
> > +#include <limits.h>
> > +
> > +static volatile bool ipi_received;
> > +static volatile bool ipi_ready;
> > +static volatile unsigned int cntfrq;
> > +static volatile void *vgic_dist_addr;
> 
> Volatiles considered harmful for kernel code. I think it is also correct
> for your case:
> https://lwn.net/Articles/234017/
> 
> I would prefer use readl/writel accessors for interprocessor
> communications which guaranties cache synchronization and proper
> ordering.
> 
> Also, ipi_received and ipi_ready are looking like synchronization
> primitives. Did you consider using spinlocks instead?
> 
> Also-also, cntfrq is written only once at init. What for you make it
> volatile?

We frequently take some shortcuts in kvm-unit-tests, using busy loops and
volatile globals for synchronization, trying to keep things as simple as
possible. I'm ok with the ipi_* synchronization variables being volatile,
but agree the other two don't need to be.

> 
> > +void (*write_eoir)(u32 irqstat);
> > +
> > +#define IPI_IRQ		1
> > +
> > +#define TRIES	(1U << 28)
> > +
> > +#define MAX_FAILURES	1000
> > +
> > +/*
> > + * The counter may not always start with zero, which means it could
> > + * overflow after some period of time.
> > + */
> > +#define COUNT(c1, c2) \
> > +	((c1) > (c2) ? 0 : (c2) - (c1))
> > +
> > +static uint64_t read_cc(void)
> > +{
> > +	isb();
> > +	return read_sysreg(cntpct_el0);
> > +}
> > +
> > +static void ipi_irq_handler(struct pt_regs *regs __unused)
> > +{
> > +	u32 ack;
> > +	ipi_ready = false;
> > +	ipi_received = true;
> > +	ack = gic_read_iar();
> > +	gic_write_eoir(ack);
> > +	ipi_ready = true;
> > +}
> > +
> > +static void ipi_test_secondary_entry(void *data __unused)
> > +{
> > +	enum vector v = EL1H_IRQ;
> > +	install_irq_handler(v, ipi_irq_handler);
> > +
> > +	gic_enable_defaults();
> > +
> > +	local_irq_enable(); /* Enter small wait-loop */
> > +	ipi_ready = true;
> > +	while (true);
> > +}
> > +
> > +static int test_init(void)
> > +{
> > +	int v;
> > +
> > +	v = gic_init();
> > +	if (v == 2) {
> > +		vgic_dist_addr = gicv2_dist_base();
> > +		write_eoir = gicv2_write_eoir;
> > +	} else if (v == 3) {
> > +		vgic_dist_addr = gicv3_dist_base();
> > +		write_eoir = gicv3_write_eoir;
> > +	} else {
> > +		printf("No supported gic present, skipping tests...\n");
> > +		return 0;
> > +	}
> > +
> > +	ipi_ready = false;
> > +
> > +	gic_enable_defaults();
> > +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> > +
> > +	cntfrq = get_cntfrq();
> > +	printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> > +
> > +	return 1;
> > +}
> > +
> > +static unsigned long ipi_test(void)
> > +{
> > +	unsigned int tries = TRIES;
> > +	uint64_t c1, c2;
> > +
> > +	while (!ipi_ready && tries--);
> > +	assert(ipi_ready);
> > +
> > +	ipi_received = false;
> > +
> > +	c1 = read_cc();
> > +
> > +	gic_ipi_send_single(IPI_IRQ, 1);
> > +
> > +	tries = TRIES;
> > +	while (!ipi_received && tries--);
> > +	assert(ipi_received);
> > +
> > +	c2 = read_cc();
> > +	return COUNT(c1, c2);
> > +}
> > +
> > +static unsigned long hvc_test(void)
> > +{
> > +	uint64_t c1, c2;
> > +
> > +	c1 = read_cc();
> > +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> > +	c2 = read_cc();
> > +	return COUNT(c1, c2);
> > +}
> > +
> > +static unsigned long mmio_read_user(void)
> > +{
> > +	uint64_t c1, c2;
> > +	/*
> > +	 * FIXME: Read device-id in virtio mmio here. This address
> > +	 * needs to be updated in the future if any relevent
> > +	 * changes in QEMU test-dev are made.
> > +	 */
> > +	void *mmio_read_user_addr = (void*) 0x0a000008;
> 
> Again, could you rename it? device_id_ptr sounds much better.
> 
> > +	c1 = read_cc();
> > +	readl(mmio_read_user_addr);
> > +	c2 = read_cc();
> > +	return COUNT(c1, c2);
> > +}
> > +
> > +static unsigned long mmio_read_vgic(void)
> > +{
> > +	uint64_t c1, c2;
> > +
> > +	c1 = read_cc();
> > +	readl(vgic_dist_addr + GICD_IIDR);
> > +	c2 = read_cc();
> > +	return COUNT(c1, c2);
> > +}
> > +
> > +static unsigned long eoi_test(void)
> > +{
> > +	uint64_t c1, c2;
> > +
> > +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> > +
> > +	/* Avoid measuring assert(..) in gic_write_eoir */
> > +	c1 = read_cc();
> > +	write_eoir(val);
> > +	c2 = read_cc();
> > +
> > +	return COUNT(c1, c2);
> > +}
> > +
> > +struct exit_test {
> > +	const char *name;
> > +	unsigned long (*test_fn)(void);
> > +	bool run;
> > +};
> > +
> > +static struct exit_test tests[] = {
> > +	{"hvc",                hvc_test,           true},
> > +	{"mmio_read_user",     mmio_read_user,     true},
> > +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> > +	{"eoi",                eoi_test,           true},
> > +	{"ipi",                ipi_test,           true},
> > +};
> > +
> > +static void get_us_output(const char *name,
> > +			  unsigned long cycles)
> > +{
> > +	unsigned int ns_per_cycle = 10^9U / cntfrq;
> > +	unsigned int ns, us, us_frac;
> 
> '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
> ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.

The counter (cntpct_el0) ticks that fast? We could start with picoseconds
if necessary.

> 
> For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
> For CPUs working at 1GHz+ it is 0. What's point in it?

We're not looking at the CPU cycles, we're looking at counter ticks.
Indeed we could rename cycles to ticks everywhere to make that more
explicit.

> 
> > +	ns =  cycles * ns_per_cycle;
> > +	us = ns / 1000;
> > +	us_frac = (ns % 1000) / 100;
> > +
> > +	printf("%s %10d.%d\t", name, us, us_frac);
> > +}
> > +
> > +static void output_result(const char *name,
> > +			  unsigned long avg_cycle,
> > +			  unsigned long min_cycle,
> > +			  unsigned long max_cycle)
> > +{
> > +	printf("%10s:\t", name);
> > +	get_us_output("avg", avg_cycle);
> > +	get_us_output("min", min_cycle);
> > +	get_us_output("max", max_cycle);
> > +	printf("\n");
> > +}
> > +
> > +static void loop_test(struct exit_test *test)
> > +{
> > +	unsigned long i, iterations = 32;
> > +	unsigned long sample, cycles;
> > +	unsigned long min = ULONG_MAX, max = 0;
> > +	const unsigned long goal = (1ULL << 26);
> > +	int failures = 0;
> > +
> > +	do {
> > +		iterations *= 2;
> > +		cycles = 0;
> > +		i = 0;
> > +		while (i < iterations) {
> > +			sample = test->test_fn();
> > +			if (sample == 0) {
> > +				if (failures++ > MAX_FAILURES) {
> > +				/*
> > +				 * If the cost is smaller than a cycle count for
> > +				 * over MAX_FAILURES of times, we simply ignore the test.
> > +				 */
> > +					printf("%s: Too many cycle count overflows\n",
> > +						test->name);
> 
> Again, this is not overflow. Printed message does not correspond the
> comment. 
> 
> Also, why do you think that sample == 0 means fail? I'd rather suspect my
> measurement procedure, or if everything is correct, simply take this
> result.

I agree that we don't need to skip the test, giving no feedback to the
user at all about the time the test takes. The '== 0' result should
output something indicating the test executed in less time than 1/cntfrq.

> 
> Now sample == 0 indicates 2 different cases - too small cost and
> overflow. Could you introduce special error code to distinguish
> between them?

I agree we should do that as well in order to output the test was too
fast to get non-zero samples vs. skipping the sample due to overflow.

> 
> > +					return;
> > +				}
> > +				continue;
> > +			}
> > +			cycles += sample;
> > +			if (min > sample)
> > +				min = sample;
> > +			if (max < sample)
> > +				max = sample;
> > +			++i;
> > +		}
> > +	} while (cycles < goal);
> > +
> > +	output_result(test->name, (cycles / iterations), min, max);	
> > +}
> > +
> > +int main(int argc, char **argv)
> > +{
> > +	int i, ret;
> > +
> > +	ret = test_init();
> > +	assert(ret);
> > +
> > +	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> > +		if (!tests[i].run)
> > +			continue;
> > +		loop_test(&tests[i]);
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> > index 44b98cf..5759fa8 100644
> > --- a/arm/unittests.cfg
> > +++ b/arm/unittests.cfg
> > @@ -116,3 +116,10 @@ file = timer.flat
> >  groups = timer
> >  timeout = 2s
> >  arch = arm64
> > +
> > +# Exit tests
> > +[micro-test]
> > +file = micro-test.flat
> > +smp = 2
> > +groups = nodefault,micro-test
> > +accel = kvm
> > -- 
> > 1.9.1
> > 

Thanks,
drew

Yury Norov Jan. 18, 2018, 11:39 a.m. UTC | #4

On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> > > Here we provide the support for measuring various micro level
> > > operations on arm64. We iterate each of the tests and output
> > > their average, minimum and maximum cost in microseconds.
> > > Instruction barriers were used before taking timestamps to
> > > avoid out-of-order execution or pipelining from skewing our
> > > measurements.
> > > 
> > > The operations we currently supported and measured are mostly
> > > self-explanatory by their function name. For IPI, we measured the
> > > cost of sending IPI from a source VCPU to a target VCPU, until the
> > > target VCPU receives the IPI.
> > > 
> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> > > ---
> > >  arm/Makefile.common |   1 +
> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  arm/unittests.cfg   |   7 ++
> > >  3 files changed, 260 insertions(+)
> > >  create mode 100644 arm/micro-test.c
> > > 
> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > > index 0a039cf..c7d5c27 100644
> > > --- a/arm/Makefile.common
> > > +++ b/arm/Makefile.common
> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> > >  tests-common += $(TEST_DIR)/gic.flat
> > >  tests-common += $(TEST_DIR)/psci.flat
> > >  tests-common += $(TEST_DIR)/sieve.flat
> > > +tests-common += $(TEST_DIR)/micro-test.flat
> > >  
> > >  tests-all = $(tests-common) $(tests)
> > >  all: directories $(tests-all)
> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > > new file mode 100644
> > > index 0000000..407ce8b
> > > --- /dev/null
> > > +++ b/arm/micro-test.c
> > > @@ -0,0 +1,252 @@
> > > +/*
> > > + * Measure the cost of micro level operations.
> > > + *
> > > + * Copyright Columbia University
> > > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> > > + *
> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
> > > + */
> > > +#include <asm/gic.h>
> > > +#include "libcflat.h"
> > > +#include <util.h>
> > > +#include <limits.h>
> > > +
> > > +static volatile bool ipi_received;
> > > +static volatile bool ipi_ready;
> > > +static volatile unsigned int cntfrq;
> > > +static volatile void *vgic_dist_addr;
> > 
> > Volatiles considered harmful for kernel code. I think it is also correct
> > for your case:
> > https://lwn.net/Articles/234017/
> > 
> > I would prefer use readl/writel accessors for interprocessor
> > communications which guaranties cache synchronization and proper
> > ordering.
> > 
> > Also, ipi_received and ipi_ready are looking like synchronization
> > primitives. Did you consider using spinlocks instead?
> > 
> > Also-also, cntfrq is written only once at init. What for you make it
> > volatile?
> 
> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
> volatile globals for synchronization, trying to keep things as simple as
> possible. I'm ok with the ipi_* synchronization variables being volatile,
> but agree the other two don't need to be.
> 
> > 
> > > +void (*write_eoir)(u32 irqstat);
> > > +
> > > +#define IPI_IRQ		1
> > > +
> > > +#define TRIES	(1U << 28)
> > > +
> > > +#define MAX_FAILURES	1000
> > > +
> > > +/*
> > > + * The counter may not always start with zero, which means it could
> > > + * overflow after some period of time.
> > > + */
> > > +#define COUNT(c1, c2) \
> > > +	((c1) > (c2) ? 0 : (c2) - (c1))
> > > +
> > > +static uint64_t read_cc(void)
> > > +{
> > > +	isb();
> > > +	return read_sysreg(cntpct_el0);
> > > +}
> > > +
> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
> > > +{
> > > +	u32 ack;
> > > +	ipi_ready = false;
> > > +	ipi_received = true;
> > > +	ack = gic_read_iar();
> > > +	gic_write_eoir(ack);
> > > +	ipi_ready = true;
> > > +}
> > > +
> > > +static void ipi_test_secondary_entry(void *data __unused)
> > > +{
> > > +	enum vector v = EL1H_IRQ;
> > > +	install_irq_handler(v, ipi_irq_handler);
> > > +
> > > +	gic_enable_defaults();
> > > +
> > > +	local_irq_enable(); /* Enter small wait-loop */
> > > +	ipi_ready = true;
> > > +	while (true);
> > > +}
> > > +
> > > +static int test_init(void)
> > > +{
> > > +	int v;
> > > +
> > > +	v = gic_init();
> > > +	if (v == 2) {
> > > +		vgic_dist_addr = gicv2_dist_base();
> > > +		write_eoir = gicv2_write_eoir;
> > > +	} else if (v == 3) {
> > > +		vgic_dist_addr = gicv3_dist_base();
> > > +		write_eoir = gicv3_write_eoir;
> > > +	} else {
> > > +		printf("No supported gic present, skipping tests...\n");
> > > +		return 0;
> > > +	}
> > > +
> > > +	ipi_ready = false;
> > > +
> > > +	gic_enable_defaults();
> > > +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> > > +
> > > +	cntfrq = get_cntfrq();
> > > +	printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> > > +
> > > +	return 1;
> > > +}
> > > +
> > > +static unsigned long ipi_test(void)
> > > +{
> > > +	unsigned int tries = TRIES;
> > > +	uint64_t c1, c2;
> > > +
> > > +	while (!ipi_ready && tries--);
> > > +	assert(ipi_ready);
> > > +
> > > +	ipi_received = false;
> > > +
> > > +	c1 = read_cc();
> > > +
> > > +	gic_ipi_send_single(IPI_IRQ, 1);
> > > +
> > > +	tries = TRIES;
> > > +	while (!ipi_received && tries--);
> > > +	assert(ipi_received);
> > > +
> > > +	c2 = read_cc();
> > > +	return COUNT(c1, c2);
> > > +}
> > > +
> > > +static unsigned long hvc_test(void)
> > > +{
> > > +	uint64_t c1, c2;
> > > +
> > > +	c1 = read_cc();
> > > +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> > > +	c2 = read_cc();
> > > +	return COUNT(c1, c2);
> > > +}
> > > +
> > > +static unsigned long mmio_read_user(void)
> > > +{
> > > +	uint64_t c1, c2;
> > > +	/*
> > > +	 * FIXME: Read device-id in virtio mmio here. This address
> > > +	 * needs to be updated in the future if any relevent
> > > +	 * changes in QEMU test-dev are made.
> > > +	 */
> > > +	void *mmio_read_user_addr = (void*) 0x0a000008;
> > 
> > Again, could you rename it? device_id_ptr sounds much better.
> > 
> > > +	c1 = read_cc();
> > > +	readl(mmio_read_user_addr);
> > > +	c2 = read_cc();
> > > +	return COUNT(c1, c2);
> > > +}
> > > +
> > > +static unsigned long mmio_read_vgic(void)
> > > +{
> > > +	uint64_t c1, c2;
> > > +
> > > +	c1 = read_cc();
> > > +	readl(vgic_dist_addr + GICD_IIDR);
> > > +	c2 = read_cc();
> > > +	return COUNT(c1, c2);
> > > +}
> > > +
> > > +static unsigned long eoi_test(void)
> > > +{
> > > +	uint64_t c1, c2;
> > > +
> > > +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> > > +
> > > +	/* Avoid measuring assert(..) in gic_write_eoir */
> > > +	c1 = read_cc();
> > > +	write_eoir(val);
> > > +	c2 = read_cc();
> > > +
> > > +	return COUNT(c1, c2);
> > > +}
> > > +
> > > +struct exit_test {
> > > +	const char *name;
> > > +	unsigned long (*test_fn)(void);
> > > +	bool run;
> > > +};
> > > +
> > > +static struct exit_test tests[] = {
> > > +	{"hvc",                hvc_test,           true},
> > > +	{"mmio_read_user",     mmio_read_user,     true},
> > > +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> > > +	{"eoi",                eoi_test,           true},
> > > +	{"ipi",                ipi_test,           true},
> > > +};
> > > +
> > > +static void get_us_output(const char *name,
> > > +			  unsigned long cycles)
> > > +{
> > > +	unsigned int ns_per_cycle = 10^9U / cntfrq;
> > > +	unsigned int ns, us, us_frac;
> > 
> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
> 
> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
> if necessary.
> 
> > 
> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
> > For CPUs working at 1GHz+ it is 0. What's point in it?
> 
> We're not looking at the CPU cycles, we're looking at counter ticks.
> Indeed we could rename cycles to ticks everywhere to make that more
> explicit.

Ah, I don't say that counter ticks at CPU frequency on TX2, or
something like that. I was probably confused by name. But switching to
picoseconds, or even femtoseconds sounds reasonable anyway.

Yury

Shih-Wei Li Jan. 19, 2018, 9:57 p.m. UTC | #5

Thanks for the feedback about the mistakes in math and some issues in
naming, print msg, and coding style. I'll be careful and try to avoid
the same problems the next patch set. Sorry for all of the confusion.

So we now skip the test when "sample == 0" happens over 1000 times.
This is only due to the case that "cost is < 1/cntfrq" since it's not
possible for the tick to overflow for that many times. Did I miss
something here? I do agree that we should output better msgs to tell
users that the cost of a certain test is constantly smaller than a
tick.

Also, because ticks overflow should rarely happen, I wonder if we can
simply ignore the data set if it happens? If not, any thoughts on how
to efficiently distinguish one case from the other?

Thinking of the time granularity used in the output, given there's
likely difference in hardware implementation as previously mentioned,
should we output the cost in ticks (along with frequency) like I did
in v1 & v2 instead, allowing the users to do the translation to the exact
time based on the output?

Thanks again,
Shih-Wei

On Thu, Jan 18, 2018 at 6:39 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
>> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
>> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
>> > > Here we provide the support for measuring various micro level
>> > > operations on arm64. We iterate each of the tests and output
>> > > their average, minimum and maximum cost in microseconds.
>> > > Instruction barriers were used before taking timestamps to
>> > > avoid out-of-order execution or pipelining from skewing our
>> > > measurements.
>> > >
>> > > The operations we currently supported and measured are mostly
>> > > self-explanatory by their function name. For IPI, we measured the
>> > > cost of sending IPI from a source VCPU to a target VCPU, until the
>> > > target VCPU receives the IPI.
>> > >
>> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> > > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
>> > > ---
>> > >  arm/Makefile.common |   1 +
>> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >  arm/unittests.cfg   |   7 ++
>> > >  3 files changed, 260 insertions(+)
>> > >  create mode 100644 arm/micro-test.c
>> > >
>> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
>> > > index 0a039cf..c7d5c27 100644
>> > > --- a/arm/Makefile.common
>> > > +++ b/arm/Makefile.common
>> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>> > >  tests-common += $(TEST_DIR)/gic.flat
>> > >  tests-common += $(TEST_DIR)/psci.flat
>> > >  tests-common += $(TEST_DIR)/sieve.flat
>> > > +tests-common += $(TEST_DIR)/micro-test.flat
>> > >
>> > >  tests-all = $(tests-common) $(tests)
>> > >  all: directories $(tests-all)
>> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
>> > > new file mode 100644
>> > > index 0000000..407ce8b
>> > > --- /dev/null
>> > > +++ b/arm/micro-test.c
>> > > @@ -0,0 +1,252 @@
>> > > +/*
>> > > + * Measure the cost of micro level operations.
>> > > + *
>> > > + * Copyright Columbia University
>> > > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
>> > > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
>> > > + *
>> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
>> > > + */
>> > > +#include <asm/gic.h>
>> > > +#include "libcflat.h"
>> > > +#include <util.h>
>> > > +#include <limits.h>
>> > > +
>> > > +static volatile bool ipi_received;
>> > > +static volatile bool ipi_ready;
>> > > +static volatile unsigned int cntfrq;
>> > > +static volatile void *vgic_dist_addr;
>> >
>> > Volatiles considered harmful for kernel code. I think it is also correct
>> > for your case:
>> > https://lwn.net/Articles/234017/
>> >
>> > I would prefer use readl/writel accessors for interprocessor
>> > communications which guaranties cache synchronization and proper
>> > ordering.
>> >
>> > Also, ipi_received and ipi_ready are looking like synchronization
>> > primitives. Did you consider using spinlocks instead?
>> >
>> > Also-also, cntfrq is written only once at init. What for you make it
>> > volatile?
>>
>> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
>> volatile globals for synchronization, trying to keep things as simple as
>> possible. I'm ok with the ipi_* synchronization variables being volatile,
>> but agree the other two don't need to be.
>>
>> >
>> > > +void (*write_eoir)(u32 irqstat);
>> > > +
>> > > +#define IPI_IRQ          1
>> > > +
>> > > +#define TRIES    (1U << 28)
>> > > +
>> > > +#define MAX_FAILURES     1000
>> > > +
>> > > +/*
>> > > + * The counter may not always start with zero, which means it could
>> > > + * overflow after some period of time.
>> > > + */
>> > > +#define COUNT(c1, c2) \
>> > > + ((c1) > (c2) ? 0 : (c2) - (c1))
>> > > +
>> > > +static uint64_t read_cc(void)
>> > > +{
>> > > + isb();
>> > > + return read_sysreg(cntpct_el0);
>> > > +}
>> > > +
>> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> > > +{
>> > > + u32 ack;
>> > > + ipi_ready = false;
>> > > + ipi_received = true;
>> > > + ack = gic_read_iar();
>> > > + gic_write_eoir(ack);
>> > > + ipi_ready = true;
>> > > +}
>> > > +
>> > > +static void ipi_test_secondary_entry(void *data __unused)
>> > > +{
>> > > + enum vector v = EL1H_IRQ;
>> > > + install_irq_handler(v, ipi_irq_handler);
>> > > +
>> > > + gic_enable_defaults();
>> > > +
>> > > + local_irq_enable(); /* Enter small wait-loop */
>> > > + ipi_ready = true;
>> > > + while (true);
>> > > +}
>> > > +
>> > > +static int test_init(void)
>> > > +{
>> > > + int v;
>> > > +
>> > > + v = gic_init();
>> > > + if (v == 2) {
>> > > +         vgic_dist_addr = gicv2_dist_base();
>> > > +         write_eoir = gicv2_write_eoir;
>> > > + } else if (v == 3) {
>> > > +         vgic_dist_addr = gicv3_dist_base();
>> > > +         write_eoir = gicv3_write_eoir;
>> > > + } else {
>> > > +         printf("No supported gic present, skipping tests...\n");
>> > > +         return 0;
>> > > + }
>> > > +
>> > > + ipi_ready = false;
>> > > +
>> > > + gic_enable_defaults();
>> > > + on_cpu_async(1, ipi_test_secondary_entry, 0);
>> > > +
>> > > + cntfrq = get_cntfrq();
>> > > + printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
>> > > +
>> > > + return 1;
>> > > +}
>> > > +
>> > > +static unsigned long ipi_test(void)
>> > > +{
>> > > + unsigned int tries = TRIES;
>> > > + uint64_t c1, c2;
>> > > +
>> > > + while (!ipi_ready && tries--);
>> > > + assert(ipi_ready);
>> > > +
>> > > + ipi_received = false;
>> > > +
>> > > + c1 = read_cc();
>> > > +
>> > > + gic_ipi_send_single(IPI_IRQ, 1);
>> > > +
>> > > + tries = TRIES;
>> > > + while (!ipi_received && tries--);
>> > > + assert(ipi_received);
>> > > +
>> > > + c2 = read_cc();
>> > > + return COUNT(c1, c2);
>> > > +}
>> > > +
>> > > +static unsigned long hvc_test(void)
>> > > +{
>> > > + uint64_t c1, c2;
>> > > +
>> > > + c1 = read_cc();
>> > > + asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> > > + c2 = read_cc();
>> > > + return COUNT(c1, c2);
>> > > +}
>> > > +
>> > > +static unsigned long mmio_read_user(void)
>> > > +{
>> > > + uint64_t c1, c2;
>> > > + /*
>> > > +  * FIXME: Read device-id in virtio mmio here. This address
>> > > +  * needs to be updated in the future if any relevent
>> > > +  * changes in QEMU test-dev are made.
>> > > +  */
>> > > + void *mmio_read_user_addr = (void*) 0x0a000008;
>> >
>> > Again, could you rename it? device_id_ptr sounds much better.
>> >
>> > > + c1 = read_cc();
>> > > + readl(mmio_read_user_addr);
>> > > + c2 = read_cc();
>> > > + return COUNT(c1, c2);
>> > > +}
>> > > +
>> > > +static unsigned long mmio_read_vgic(void)
>> > > +{
>> > > + uint64_t c1, c2;
>> > > +
>> > > + c1 = read_cc();
>> > > + readl(vgic_dist_addr + GICD_IIDR);
>> > > + c2 = read_cc();
>> > > + return COUNT(c1, c2);
>> > > +}
>> > > +
>> > > +static unsigned long eoi_test(void)
>> > > +{
>> > > + uint64_t c1, c2;
>> > > +
>> > > + u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> > > +
>> > > + /* Avoid measuring assert(..) in gic_write_eoir */
>> > > + c1 = read_cc();
>> > > + write_eoir(val);
>> > > + c2 = read_cc();
>> > > +
>> > > + return COUNT(c1, c2);
>> > > +}
>> > > +
>> > > +struct exit_test {
>> > > + const char *name;
>> > > + unsigned long (*test_fn)(void);
>> > > + bool run;
>> > > +};
>> > > +
>> > > +static struct exit_test tests[] = {
>> > > + {"hvc",                hvc_test,           true},
>> > > + {"mmio_read_user",     mmio_read_user,     true},
>> > > + {"mmio_read_vgic",     mmio_read_vgic,     true},
>> > > + {"eoi",                eoi_test,           true},
>> > > + {"ipi",                ipi_test,           true},
>> > > +};
>> > > +
>> > > +static void get_us_output(const char *name,
>> > > +                   unsigned long cycles)
>> > > +{
>> > > + unsigned int ns_per_cycle = 10^9U / cntfrq;
>> > > + unsigned int ns, us, us_frac;
>> >
>> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
>> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
>>
>> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
>> if necessary.
>>
>> >
>> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
>> > For CPUs working at 1GHz+ it is 0. What's point in it?
>>
>> We're not looking at the CPU cycles, we're looking at counter ticks.
>> Indeed we could rename cycles to ticks everywhere to make that more
>> explicit.
>
> Ah, I don't say that counter ticks at CPU frequency on TX2, or
> something like that. I was probably confused by name. But switching to
> picoseconds, or even femtoseconds sounds reasonable anyway.
>
> Yury

Yury Norov Jan. 20, 2018, 11:26 a.m. UTC | #6

On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> Thanks for the feedback about the mistakes in math and some issues in
> naming, print msg, and coding style. I'll be careful and try to avoid
> the same problems the next patch set. Sorry for all of the confusion.
> 
> So we now skip the test when "sample == 0" happens over 1000 times.
> This is only due to the case that "cost is < 1/cntfrq" since it's not
> possible for the tick to overflow for that many times. Did I miss
> something here? I do agree that we should output better msgs to tell
> users that the cost of a certain test is constantly smaller than a
> tick.
> 
> Also, because ticks overflow should rarely happen, I wonder if we can
> simply ignore the data set if it happens? If not, any thoughts on how
> to efficiently distinguish one case from the other?

It was my very first suggestion for you - don't drop 0-cost samples 
and ignore counter overflow.

0-cost samples most probably indicate error in the test code.
Numerous (more than 1) overflows are most probably caused by broken
management of timer-counters on VM side. In both cases results of
the test are not trustworthy.

In case of broken timer-counter, too much things will get crazy, and
if you suspect it, I think it's much-much more important to write
separated test for timer counters.
 
> Thinking of the time granularity used in the output, given there's
> likely difference in hardware implementation as previously mentioned,
> should we output the cost in ticks (along with frequency) like I did
> in v1 & v2 instead, allowing the users to do the translation to the exact
> time based on the output?

Whatever you prefer.

Yury

> Thanks again,
> Shih-Wei
> 
> On Thu, Jan 18, 2018 at 6:39 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> > On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
> >> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
> >> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> >> > > Here we provide the support for measuring various micro level
> >> > > operations on arm64. We iterate each of the tests and output
> >> > > their average, minimum and maximum cost in microseconds.
> >> > > Instruction barriers were used before taking timestamps to
> >> > > avoid out-of-order execution or pipelining from skewing our
> >> > > measurements.
> >> > >
> >> > > The operations we currently supported and measured are mostly
> >> > > self-explanatory by their function name. For IPI, we measured the
> >> > > cost of sending IPI from a source VCPU to a target VCPU, until the
> >> > > target VCPU receives the IPI.
> >> > >
> >> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> > > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> >> > > ---
> >> > >  arm/Makefile.common |   1 +
> >> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > >  arm/unittests.cfg   |   7 ++
> >> > >  3 files changed, 260 insertions(+)
> >> > >  create mode 100644 arm/micro-test.c
> >> > >
> >> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> >> > > index 0a039cf..c7d5c27 100644
> >> > > --- a/arm/Makefile.common
> >> > > +++ b/arm/Makefile.common
> >> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >> > >  tests-common += $(TEST_DIR)/gic.flat
> >> > >  tests-common += $(TEST_DIR)/psci.flat
> >> > >  tests-common += $(TEST_DIR)/sieve.flat
> >> > > +tests-common += $(TEST_DIR)/micro-test.flat
> >> > >
> >> > >  tests-all = $(tests-common) $(tests)
> >> > >  all: directories $(tests-all)
> >> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> >> > > new file mode 100644
> >> > > index 0000000..407ce8b
> >> > > --- /dev/null
> >> > > +++ b/arm/micro-test.c
> >> > > @@ -0,0 +1,252 @@
> >> > > +/*
> >> > > + * Measure the cost of micro level operations.
> >> > > + *
> >> > > + * Copyright Columbia University
> >> > > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> > > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> >> > > + *
> >> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
> >> > > + */
> >> > > +#include <asm/gic.h>
> >> > > +#include "libcflat.h"
> >> > > +#include <util.h>
> >> > > +#include <limits.h>
> >> > > +
> >> > > +static volatile bool ipi_received;
> >> > > +static volatile bool ipi_ready;
> >> > > +static volatile unsigned int cntfrq;
> >> > > +static volatile void *vgic_dist_addr;
> >> >
> >> > Volatiles considered harmful for kernel code. I think it is also correct
> >> > for your case:
> >> > https://lwn.net/Articles/234017/
> >> >
> >> > I would prefer use readl/writel accessors for interprocessor
> >> > communications which guaranties cache synchronization and proper
> >> > ordering.
> >> >
> >> > Also, ipi_received and ipi_ready are looking like synchronization
> >> > primitives. Did you consider using spinlocks instead?
> >> >
> >> > Also-also, cntfrq is written only once at init. What for you make it
> >> > volatile?
> >>
> >> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
> >> volatile globals for synchronization, trying to keep things as simple as
> >> possible. I'm ok with the ipi_* synchronization variables being volatile,
> >> but agree the other two don't need to be.
> >>
> >> >
> >> > > +void (*write_eoir)(u32 irqstat);
> >> > > +
> >> > > +#define IPI_IRQ          1
> >> > > +
> >> > > +#define TRIES    (1U << 28)
> >> > > +
> >> > > +#define MAX_FAILURES     1000
> >> > > +
> >> > > +/*
> >> > > + * The counter may not always start with zero, which means it could
> >> > > + * overflow after some period of time.
> >> > > + */
> >> > > +#define COUNT(c1, c2) \
> >> > > + ((c1) > (c2) ? 0 : (c2) - (c1))
> >> > > +
> >> > > +static uint64_t read_cc(void)
> >> > > +{
> >> > > + isb();
> >> > > + return read_sysreg(cntpct_el0);
> >> > > +}
> >> > > +
> >> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
> >> > > +{
> >> > > + u32 ack;
> >> > > + ipi_ready = false;
> >> > > + ipi_received = true;
> >> > > + ack = gic_read_iar();
> >> > > + gic_write_eoir(ack);
> >> > > + ipi_ready = true;
> >> > > +}
> >> > > +
> >> > > +static void ipi_test_secondary_entry(void *data __unused)
> >> > > +{
> >> > > + enum vector v = EL1H_IRQ;
> >> > > + install_irq_handler(v, ipi_irq_handler);
> >> > > +
> >> > > + gic_enable_defaults();
> >> > > +
> >> > > + local_irq_enable(); /* Enter small wait-loop */
> >> > > + ipi_ready = true;
> >> > > + while (true);
> >> > > +}
> >> > > +
> >> > > +static int test_init(void)
> >> > > +{
> >> > > + int v;
> >> > > +
> >> > > + v = gic_init();
> >> > > + if (v == 2) {
> >> > > +         vgic_dist_addr = gicv2_dist_base();
> >> > > +         write_eoir = gicv2_write_eoir;
> >> > > + } else if (v == 3) {
> >> > > +         vgic_dist_addr = gicv3_dist_base();
> >> > > +         write_eoir = gicv3_write_eoir;
> >> > > + } else {
> >> > > +         printf("No supported gic present, skipping tests...\n");
> >> > > +         return 0;
> >> > > + }
> >> > > +
> >> > > + ipi_ready = false;
> >> > > +
> >> > > + gic_enable_defaults();
> >> > > + on_cpu_async(1, ipi_test_secondary_entry, 0);
> >> > > +
> >> > > + cntfrq = get_cntfrq();
> >> > > + printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> >> > > +
> >> > > + return 1;
> >> > > +}
> >> > > +
> >> > > +static unsigned long ipi_test(void)
> >> > > +{
> >> > > + unsigned int tries = TRIES;
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + while (!ipi_ready && tries--);
> >> > > + assert(ipi_ready);
> >> > > +
> >> > > + ipi_received = false;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > +
> >> > > + gic_ipi_send_single(IPI_IRQ, 1);
> >> > > +
> >> > > + tries = TRIES;
> >> > > + while (!ipi_received && tries--);
> >> > > + assert(ipi_received);
> >> > > +
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long hvc_test(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > + asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long mmio_read_user(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > + /*
> >> > > +  * FIXME: Read device-id in virtio mmio here. This address
> >> > > +  * needs to be updated in the future if any relevent
> >> > > +  * changes in QEMU test-dev are made.
> >> > > +  */
> >> > > + void *mmio_read_user_addr = (void*) 0x0a000008;
> >> >
> >> > Again, could you rename it? device_id_ptr sounds much better.
> >> >
> >> > > + c1 = read_cc();
> >> > > + readl(mmio_read_user_addr);
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long mmio_read_vgic(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > + readl(vgic_dist_addr + GICD_IIDR);
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long eoi_test(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> >> > > +
> >> > > + /* Avoid measuring assert(..) in gic_write_eoir */
> >> > > + c1 = read_cc();
> >> > > + write_eoir(val);
> >> > > + c2 = read_cc();
> >> > > +
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +struct exit_test {
> >> > > + const char *name;
> >> > > + unsigned long (*test_fn)(void);
> >> > > + bool run;
> >> > > +};
> >> > > +
> >> > > +static struct exit_test tests[] = {
> >> > > + {"hvc",                hvc_test,           true},
> >> > > + {"mmio_read_user",     mmio_read_user,     true},
> >> > > + {"mmio_read_vgic",     mmio_read_vgic,     true},
> >> > > + {"eoi",                eoi_test,           true},
> >> > > + {"ipi",                ipi_test,           true},
> >> > > +};
> >> > > +
> >> > > +static void get_us_output(const char *name,
> >> > > +                   unsigned long cycles)
> >> > > +{
> >> > > + unsigned int ns_per_cycle = 10^9U / cntfrq;
> >> > > + unsigned int ns, us, us_frac;
> >> >
> >> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
> >> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
> >>
> >> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
> >> if necessary.
> >>
> >> >
> >> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
> >> > For CPUs working at 1GHz+ it is 0. What's point in it?
> >>
> >> We're not looking at the CPU cycles, we're looking at counter ticks.
> >> Indeed we could rename cycles to ticks everywhere to make that more
> >> explicit.
> >
> > Ah, I don't say that counter ticks at CPU frequency on TX2, or
> > something like that. I was probably confused by name. But switching to
> > picoseconds, or even femtoseconds sounds reasonable anyway.
> >
> > Yury

Andrew Jones Jan. 22, 2018, 8:48 a.m. UTC | #7

On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> Thanks for the feedback about the mistakes in math and some issues in
> naming, print msg, and coding style. I'll be careful and try to avoid
> the same problems the next patch set. Sorry for all of the confusion.
> 
> So we now skip the test when "sample == 0" happens over 1000 times.
> This is only due to the case that "cost is < 1/cntfrq" since it's not

Since we know the reason it happens, then why don't we just print that
message so the user can know it too?

> possible for the tick to overflow for that many times. Did I miss
> something here? I do agree that we should output better msgs to tell
> users that the cost of a certain test is constantly smaller than a
> tick.

Yes

> 
> Also, because ticks overflow should rarely happen, I wonder if we can
> simply ignore the data set if it happens? If not, any thoughts on how
> to efficiently distinguish one case from the other?

If we get tons of zeros, then the test is faster than 1/cntfrq, print it.
If we just get one, possibly two zeros, then it was overflow, skip it.

> 
> Thinking of the time granularity used in the output, given there's
> likely difference in hardware implementation as previously mentioned,
> should we output the cost in ticks (along with frequency) like I did
> in v1 & v2 instead, allowing the users to do the translation to the exact
> time based on the output?

Why? If it's possible for the user to do the math, then it's possible for
the test to do the math. It's easy to do, so why put the burden on the
user? People don't think in terms of timer ticks, they think in terms of
time.

Thanks,
drew

> 
> Thanks again,
> Shih-Wei
> 
> On Thu, Jan 18, 2018 at 6:39 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> > On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
> >> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
> >> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
> >> > > Here we provide the support for measuring various micro level
> >> > > operations on arm64. We iterate each of the tests and output
> >> > > their average, minimum and maximum cost in microseconds.
> >> > > Instruction barriers were used before taking timestamps to
> >> > > avoid out-of-order execution or pipelining from skewing our
> >> > > measurements.
> >> > >
> >> > > The operations we currently supported and measured are mostly
> >> > > self-explanatory by their function name. For IPI, we measured the
> >> > > cost of sending IPI from a source VCPU to a target VCPU, until the
> >> > > target VCPU receives the IPI.
> >> > >
> >> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> > > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
> >> > > ---
> >> > >  arm/Makefile.common |   1 +
> >> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > >  arm/unittests.cfg   |   7 ++
> >> > >  3 files changed, 260 insertions(+)
> >> > >  create mode 100644 arm/micro-test.c
> >> > >
> >> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> >> > > index 0a039cf..c7d5c27 100644
> >> > > --- a/arm/Makefile.common
> >> > > +++ b/arm/Makefile.common
> >> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >> > >  tests-common += $(TEST_DIR)/gic.flat
> >> > >  tests-common += $(TEST_DIR)/psci.flat
> >> > >  tests-common += $(TEST_DIR)/sieve.flat
> >> > > +tests-common += $(TEST_DIR)/micro-test.flat
> >> > >
> >> > >  tests-all = $(tests-common) $(tests)
> >> > >  all: directories $(tests-all)
> >> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> >> > > new file mode 100644
> >> > > index 0000000..407ce8b
> >> > > --- /dev/null
> >> > > +++ b/arm/micro-test.c
> >> > > @@ -0,0 +1,252 @@
> >> > > +/*
> >> > > + * Measure the cost of micro level operations.
> >> > > + *
> >> > > + * Copyright Columbia University
> >> > > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> > > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> >> > > + *
> >> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
> >> > > + */
> >> > > +#include <asm/gic.h>
> >> > > +#include "libcflat.h"
> >> > > +#include <util.h>
> >> > > +#include <limits.h>
> >> > > +
> >> > > +static volatile bool ipi_received;
> >> > > +static volatile bool ipi_ready;
> >> > > +static volatile unsigned int cntfrq;
> >> > > +static volatile void *vgic_dist_addr;
> >> >
> >> > Volatiles considered harmful for kernel code. I think it is also correct
> >> > for your case:
> >> > https://lwn.net/Articles/234017/
> >> >
> >> > I would prefer use readl/writel accessors for interprocessor
> >> > communications which guaranties cache synchronization and proper
> >> > ordering.
> >> >
> >> > Also, ipi_received and ipi_ready are looking like synchronization
> >> > primitives. Did you consider using spinlocks instead?
> >> >
> >> > Also-also, cntfrq is written only once at init. What for you make it
> >> > volatile?
> >>
> >> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
> >> volatile globals for synchronization, trying to keep things as simple as
> >> possible. I'm ok with the ipi_* synchronization variables being volatile,
> >> but agree the other two don't need to be.
> >>
> >> >
> >> > > +void (*write_eoir)(u32 irqstat);
> >> > > +
> >> > > +#define IPI_IRQ          1
> >> > > +
> >> > > +#define TRIES    (1U << 28)
> >> > > +
> >> > > +#define MAX_FAILURES     1000
> >> > > +
> >> > > +/*
> >> > > + * The counter may not always start with zero, which means it could
> >> > > + * overflow after some period of time.
> >> > > + */
> >> > > +#define COUNT(c1, c2) \
> >> > > + ((c1) > (c2) ? 0 : (c2) - (c1))
> >> > > +
> >> > > +static uint64_t read_cc(void)
> >> > > +{
> >> > > + isb();
> >> > > + return read_sysreg(cntpct_el0);
> >> > > +}
> >> > > +
> >> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
> >> > > +{
> >> > > + u32 ack;
> >> > > + ipi_ready = false;
> >> > > + ipi_received = true;
> >> > > + ack = gic_read_iar();
> >> > > + gic_write_eoir(ack);
> >> > > + ipi_ready = true;
> >> > > +}
> >> > > +
> >> > > +static void ipi_test_secondary_entry(void *data __unused)
> >> > > +{
> >> > > + enum vector v = EL1H_IRQ;
> >> > > + install_irq_handler(v, ipi_irq_handler);
> >> > > +
> >> > > + gic_enable_defaults();
> >> > > +
> >> > > + local_irq_enable(); /* Enter small wait-loop */
> >> > > + ipi_ready = true;
> >> > > + while (true);
> >> > > +}
> >> > > +
> >> > > +static int test_init(void)
> >> > > +{
> >> > > + int v;
> >> > > +
> >> > > + v = gic_init();
> >> > > + if (v == 2) {
> >> > > +         vgic_dist_addr = gicv2_dist_base();
> >> > > +         write_eoir = gicv2_write_eoir;
> >> > > + } else if (v == 3) {
> >> > > +         vgic_dist_addr = gicv3_dist_base();
> >> > > +         write_eoir = gicv3_write_eoir;
> >> > > + } else {
> >> > > +         printf("No supported gic present, skipping tests...\n");
> >> > > +         return 0;
> >> > > + }
> >> > > +
> >> > > + ipi_ready = false;
> >> > > +
> >> > > + gic_enable_defaults();
> >> > > + on_cpu_async(1, ipi_test_secondary_entry, 0);
> >> > > +
> >> > > + cntfrq = get_cntfrq();
> >> > > + printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
> >> > > +
> >> > > + return 1;
> >> > > +}
> >> > > +
> >> > > +static unsigned long ipi_test(void)
> >> > > +{
> >> > > + unsigned int tries = TRIES;
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + while (!ipi_ready && tries--);
> >> > > + assert(ipi_ready);
> >> > > +
> >> > > + ipi_received = false;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > +
> >> > > + gic_ipi_send_single(IPI_IRQ, 1);
> >> > > +
> >> > > + tries = TRIES;
> >> > > + while (!ipi_received && tries--);
> >> > > + assert(ipi_received);
> >> > > +
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long hvc_test(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > + asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long mmio_read_user(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > + /*
> >> > > +  * FIXME: Read device-id in virtio mmio here. This address
> >> > > +  * needs to be updated in the future if any relevent
> >> > > +  * changes in QEMU test-dev are made.
> >> > > +  */
> >> > > + void *mmio_read_user_addr = (void*) 0x0a000008;
> >> >
> >> > Again, could you rename it? device_id_ptr sounds much better.
> >> >
> >> > > + c1 = read_cc();
> >> > > + readl(mmio_read_user_addr);
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long mmio_read_vgic(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + c1 = read_cc();
> >> > > + readl(vgic_dist_addr + GICD_IIDR);
> >> > > + c2 = read_cc();
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +static unsigned long eoi_test(void)
> >> > > +{
> >> > > + uint64_t c1, c2;
> >> > > +
> >> > > + u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> >> > > +
> >> > > + /* Avoid measuring assert(..) in gic_write_eoir */
> >> > > + c1 = read_cc();
> >> > > + write_eoir(val);
> >> > > + c2 = read_cc();
> >> > > +
> >> > > + return COUNT(c1, c2);
> >> > > +}
> >> > > +
> >> > > +struct exit_test {
> >> > > + const char *name;
> >> > > + unsigned long (*test_fn)(void);
> >> > > + bool run;
> >> > > +};
> >> > > +
> >> > > +static struct exit_test tests[] = {
> >> > > + {"hvc",                hvc_test,           true},
> >> > > + {"mmio_read_user",     mmio_read_user,     true},
> >> > > + {"mmio_read_vgic",     mmio_read_vgic,     true},
> >> > > + {"eoi",                eoi_test,           true},
> >> > > + {"ipi",                ipi_test,           true},
> >> > > +};
> >> > > +
> >> > > +static void get_us_output(const char *name,
> >> > > +                   unsigned long cycles)
> >> > > +{
> >> > > + unsigned int ns_per_cycle = 10^9U / cntfrq;
> >> > > + unsigned int ns, us, us_frac;
> >> >
> >> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
> >> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
> >>
> >> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
> >> if necessary.
> >>
> >> >
> >> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
> >> > For CPUs working at 1GHz+ it is 0. What's point in it?
> >>
> >> We're not looking at the CPU cycles, we're looking at counter ticks.
> >> Indeed we could rename cycles to ticks everywhere to make that more
> >> explicit.
> >
> > Ah, I don't say that counter ticks at CPU frequency on TX2, or
> > something like that. I was probably confused by name. But switching to
> > picoseconds, or even femtoseconds sounds reasonable anyway.
> >
> > Yury
>

Shih-Wei Li Jan. 23, 2018, 6:48 p.m. UTC | #8

On Mon, Jan 22, 2018 at 3:48 AM, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
>> Thanks for the feedback about the mistakes in math and some issues in
>> naming, print msg, and coding style. I'll be careful and try to avoid
>> the same problems the next patch set. Sorry for all of the confusion.
>>
>> So we now skip the test when "sample == 0" happens over 1000 times.
>> This is only due to the case that "cost is < 1/cntfrq" since it's not
>
> Since we know the reason it happens, then why don't we just print that
> message so the user can know it too?

Yes, we can print out some info when the counter overflows.

>
>> possible for the tick to overflow for that many times. Did I miss
>> something here? I do agree that we should output better msgs to tell
>> users that the cost of a certain test is constantly smaller than a
>> tick.
>
> Yes
>
>>
>> Also, because ticks overflow should rarely happen, I wonder if we can
>> simply ignore the data set if it happens? If not, any thoughts on how
>> to efficiently distinguish one case from the other?
>
> If we get tons of zeros, then the test is faster than 1/cntfrq, print it.
> If we just get one, possibly two zeros, then it was overflow, skip it.

Yes.

>
>>
>> Thinking of the time granularity used in the output, given there's
>> likely difference in hardware implementation as previously mentioned,
>> should we output the cost in ticks (along with frequency) like I did
>> in v1 & v2 instead, allowing the users to do the translation to the exact
>> time based on the output?
>
> Why? If it's possible for the user to do the math, then it's possible for
> the test to do the math. It's easy to do, so why put the burden on the
> user? People don't think in terms of timer ticks, they think in terms of
> time.
>

I was unsure about the right granularity (microseconds vs picoseconds)
to use. I can fix the math and output in microseconds in v4 if it
sounds good to you guys?

Thanks,
Shih-Wei

> Thanks,
> drew
>
>>
>> Thanks again,
>> Shih-Wei
>>
>> On Thu, Jan 18, 2018 at 6:39 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
>> > On Thu, Jan 18, 2018 at 11:37:37AM +0100, Andrew Jones wrote:
>> >> On Thu, Jan 18, 2018 at 01:09:58PM +0300, Yury Norov wrote:
>> >> > On Wed, Jan 17, 2018 at 04:46:36PM -0500, Shih-Wei Li wrote:
>> >> > > Here we provide the support for measuring various micro level
>> >> > > operations on arm64. We iterate each of the tests and output
>> >> > > their average, minimum and maximum cost in microseconds.
>> >> > > Instruction barriers were used before taking timestamps to
>> >> > > avoid out-of-order execution or pipelining from skewing our
>> >> > > measurements.
>> >> > >
>> >> > > The operations we currently supported and measured are mostly
>> >> > > self-explanatory by their function name. For IPI, we measured the
>> >> > > cost of sending IPI from a source VCPU to a target VCPU, until the
>> >> > > target VCPU receives the IPI.
>> >> > >
>> >> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> >> > > Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
>> >> > > ---
>> >> > >  arm/Makefile.common |   1 +
>> >> > >  arm/micro-test.c    | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> > >  arm/unittests.cfg   |   7 ++
>> >> > >  3 files changed, 260 insertions(+)
>> >> > >  create mode 100644 arm/micro-test.c
>> >> > >
>> >> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
>> >> > > index 0a039cf..c7d5c27 100644
>> >> > > --- a/arm/Makefile.common
>> >> > > +++ b/arm/Makefile.common
>> >> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>> >> > >  tests-common += $(TEST_DIR)/gic.flat
>> >> > >  tests-common += $(TEST_DIR)/psci.flat
>> >> > >  tests-common += $(TEST_DIR)/sieve.flat
>> >> > > +tests-common += $(TEST_DIR)/micro-test.flat
>> >> > >
>> >> > >  tests-all = $(tests-common) $(tests)
>> >> > >  all: directories $(tests-all)
>> >> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
>> >> > > new file mode 100644
>> >> > > index 0000000..407ce8b
>> >> > > --- /dev/null
>> >> > > +++ b/arm/micro-test.c
>> >> > > @@ -0,0 +1,252 @@
>> >> > > +/*
>> >> > > + * Measure the cost of micro level operations.
>> >> > > + *
>> >> > > + * Copyright Columbia University
>> >> > > + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
>> >> > > + * Author: Christoffer Dall <cdall@cs.columbia.edu>
>> >> > > + *
>> >> > > + * This work is licensed under the terms of the GNU LGPL, version 2.
>> >> > > + */
>> >> > > +#include <asm/gic.h>
>> >> > > +#include "libcflat.h"
>> >> > > +#include <util.h>
>> >> > > +#include <limits.h>
>> >> > > +
>> >> > > +static volatile bool ipi_received;
>> >> > > +static volatile bool ipi_ready;
>> >> > > +static volatile unsigned int cntfrq;
>> >> > > +static volatile void *vgic_dist_addr;
>> >> >
>> >> > Volatiles considered harmful for kernel code. I think it is also correct
>> >> > for your case:
>> >> > https://lwn.net/Articles/234017/
>> >> >
>> >> > I would prefer use readl/writel accessors for interprocessor
>> >> > communications which guaranties cache synchronization and proper
>> >> > ordering.
>> >> >
>> >> > Also, ipi_received and ipi_ready are looking like synchronization
>> >> > primitives. Did you consider using spinlocks instead?
>> >> >
>> >> > Also-also, cntfrq is written only once at init. What for you make it
>> >> > volatile?
>> >>
>> >> We frequently take some shortcuts in kvm-unit-tests, using busy loops and
>> >> volatile globals for synchronization, trying to keep things as simple as
>> >> possible. I'm ok with the ipi_* synchronization variables being volatile,
>> >> but agree the other two don't need to be.
>> >>
>> >> >
>> >> > > +void (*write_eoir)(u32 irqstat);
>> >> > > +
>> >> > > +#define IPI_IRQ          1
>> >> > > +
>> >> > > +#define TRIES    (1U << 28)
>> >> > > +
>> >> > > +#define MAX_FAILURES     1000
>> >> > > +
>> >> > > +/*
>> >> > > + * The counter may not always start with zero, which means it could
>> >> > > + * overflow after some period of time.
>> >> > > + */
>> >> > > +#define COUNT(c1, c2) \
>> >> > > + ((c1) > (c2) ? 0 : (c2) - (c1))
>> >> > > +
>> >> > > +static uint64_t read_cc(void)
>> >> > > +{
>> >> > > + isb();
>> >> > > + return read_sysreg(cntpct_el0);
>> >> > > +}
>> >> > > +
>> >> > > +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> >> > > +{
>> >> > > + u32 ack;
>> >> > > + ipi_ready = false;
>> >> > > + ipi_received = true;
>> >> > > + ack = gic_read_iar();
>> >> > > + gic_write_eoir(ack);
>> >> > > + ipi_ready = true;
>> >> > > +}
>> >> > > +
>> >> > > +static void ipi_test_secondary_entry(void *data __unused)
>> >> > > +{
>> >> > > + enum vector v = EL1H_IRQ;
>> >> > > + install_irq_handler(v, ipi_irq_handler);
>> >> > > +
>> >> > > + gic_enable_defaults();
>> >> > > +
>> >> > > + local_irq_enable(); /* Enter small wait-loop */
>> >> > > + ipi_ready = true;
>> >> > > + while (true);
>> >> > > +}
>> >> > > +
>> >> > > +static int test_init(void)
>> >> > > +{
>> >> > > + int v;
>> >> > > +
>> >> > > + v = gic_init();
>> >> > > + if (v == 2) {
>> >> > > +         vgic_dist_addr = gicv2_dist_base();
>> >> > > +         write_eoir = gicv2_write_eoir;
>> >> > > + } else if (v == 3) {
>> >> > > +         vgic_dist_addr = gicv3_dist_base();
>> >> > > +         write_eoir = gicv3_write_eoir;
>> >> > > + } else {
>> >> > > +         printf("No supported gic present, skipping tests...\n");
>> >> > > +         return 0;
>> >> > > + }
>> >> > > +
>> >> > > + ipi_ready = false;
>> >> > > +
>> >> > > + gic_enable_defaults();
>> >> > > + on_cpu_async(1, ipi_test_secondary_entry, 0);
>> >> > > +
>> >> > > + cntfrq = get_cntfrq();
>> >> > > + printf("Timer Frequency %d Hz (Output in microseconds)\n", cntfrq);
>> >> > > +
>> >> > > + return 1;
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long ipi_test(void)
>> >> > > +{
>> >> > > + unsigned int tries = TRIES;
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + while (!ipi_ready && tries--);
>> >> > > + assert(ipi_ready);
>> >> > > +
>> >> > > + ipi_received = false;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > +
>> >> > > + gic_ipi_send_single(IPI_IRQ, 1);
>> >> > > +
>> >> > > + tries = TRIES;
>> >> > > + while (!ipi_received && tries--);
>> >> > > + assert(ipi_received);
>> >> > > +
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long hvc_test(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > + asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long mmio_read_user(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > + /*
>> >> > > +  * FIXME: Read device-id in virtio mmio here. This address
>> >> > > +  * needs to be updated in the future if any relevent
>> >> > > +  * changes in QEMU test-dev are made.
>> >> > > +  */
>> >> > > + void *mmio_read_user_addr = (void*) 0x0a000008;
>> >> >
>> >> > Again, could you rename it? device_id_ptr sounds much better.
>> >> >
>> >> > > + c1 = read_cc();
>> >> > > + readl(mmio_read_user_addr);
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long mmio_read_vgic(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + c1 = read_cc();
>> >> > > + readl(vgic_dist_addr + GICD_IIDR);
>> >> > > + c2 = read_cc();
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +static unsigned long eoi_test(void)
>> >> > > +{
>> >> > > + uint64_t c1, c2;
>> >> > > +
>> >> > > + u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> >> > > +
>> >> > > + /* Avoid measuring assert(..) in gic_write_eoir */
>> >> > > + c1 = read_cc();
>> >> > > + write_eoir(val);
>> >> > > + c2 = read_cc();
>> >> > > +
>> >> > > + return COUNT(c1, c2);
>> >> > > +}
>> >> > > +
>> >> > > +struct exit_test {
>> >> > > + const char *name;
>> >> > > + unsigned long (*test_fn)(void);
>> >> > > + bool run;
>> >> > > +};
>> >> > > +
>> >> > > +static struct exit_test tests[] = {
>> >> > > + {"hvc",                hvc_test,           true},
>> >> > > + {"mmio_read_user",     mmio_read_user,     true},
>> >> > > + {"mmio_read_vgic",     mmio_read_vgic,     true},
>> >> > > + {"eoi",                eoi_test,           true},
>> >> > > + {"ipi",                ipi_test,           true},
>> >> > > +};
>> >> > > +
>> >> > > +static void get_us_output(const char *name,
>> >> > > +                   unsigned long cycles)
>> >> > > +{
>> >> > > + unsigned int ns_per_cycle = 10^9U / cntfrq;
>> >> > > + unsigned int ns, us, us_frac;
>> >> >
>> >> > '^' is 'xor', you probably mean 1000*1000*1000. And anyway,
>> >> > ThunderX2 works at ~2GHz, so 10E9/get_cntfrq() is 0 for it.
>> >>
>> >> The counter (cntpct_el0) ticks that fast? We could start with picoseconds
>> >> if necessary.
>> >>
>> >> >
>> >> > For CPUs working at 500...1000 MHz, ns_per_cycle is 1,
>> >> > For CPUs working at 1GHz+ it is 0. What's point in it?
>> >>
>> >> We're not looking at the CPU cycles, we're looking at counter ticks.
>> >> Indeed we could rename cycles to ticks everywhere to make that more
>> >> explicit.
>> >
>> > Ah, I don't say that counter ticks at CPU frequency on TX2, or
>> > something like that. I was probably confused by name. But switching to
>> > picoseconds, or even femtoseconds sounds reasonable anyway.
>> >
>> > Yury
>>

Andrew Jones Jan. 23, 2018, 7:54 p.m. UTC | #9

On Tue, Jan 23, 2018 at 01:48:43PM -0500, Shih-Wei Li wrote:
> On Mon, Jan 22, 2018 at 3:48 AM, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> >> Thanks for the feedback about the mistakes in math and some issues in
> >> naming, print msg, and coding style. I'll be careful and try to avoid
> >> the same problems the next patch set. Sorry for all of the confusion.
> >>
> >> So we now skip the test when "sample == 0" happens over 1000 times.
> >> This is only due to the case that "cost is < 1/cntfrq" since it's not
> >
> > Since we know the reason it happens, then why don't we just print that
> > message so the user can know it too?
> 
> Yes, we can print out some info when the counter overflows.

It's not overflow we need to worry about printing as much as when the
sample is consistently zero due to running in less time than 1/cntfrq.

> 
> >
> >> possible for the tick to overflow for that many times. Did I miss
> >> something here? I do agree that we should output better msgs to tell
> >> users that the cost of a certain test is constantly smaller than a
> >> tick.
> >
> > Yes
> >
> >>
> >> Also, because ticks overflow should rarely happen, I wonder if we can
> >> simply ignore the data set if it happens? If not, any thoughts on how
> >> to efficiently distinguish one case from the other?
> >
> > If we get tons of zeros, then the test is faster than 1/cntfrq, print it.
> > If we just get one, possibly two zeros, then it was overflow, skip it.
> 
> Yes.
> 
> >
> >>
> >> Thinking of the time granularity used in the output, given there's
> >> likely difference in hardware implementation as previously mentioned,
> >> should we output the cost in ticks (along with frequency) like I did
> >> in v1 & v2 instead, allowing the users to do the translation to the exact
> >> time based on the output?
> >
> > Why? If it's possible for the user to do the math, then it's possible for
> > the test to do the math. It's easy to do, so why put the burden on the
> > user? People don't think in terms of timer ticks, they think in terms of
> > time.
> >
> 
> I was unsure about the right granularity (microseconds vs picoseconds)
> to use. I can fix the math and output in microseconds in v4 if it
> sounds good to you guys?

Maybe computing in pico and outputting in nano is better for the long
term, but when I last tested your series micro was good enough for the
hardware I have access to.

Thanks,
drew

Christoffer Dall Jan. 24, 2018, 8:17 a.m. UTC | #10

Hi Shih-Wei,

On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> Thanks for the feedback about the mistakes in math and some issues in
> naming, print msg, and coding style. I'll be careful and try to avoid
> the same problems the next patch set. Sorry for all of the confusion.
> 
> So we now skip the test when "sample == 0" happens over 1000 times.
> This is only due to the case that "cost is < 1/cntfrq" since it's not
> possible for the tick to overflow for that many times. Did I miss
> something here? I do agree that we should output better msgs to tell
> users that the cost of a certain test is constantly smaller than a
> tick.
> 

I think for things like vmexit counts, it's very likely that all the
samples will result in 0 ticks on many systems (fast CPU and slow arch
counter; the architecture doesn't give us any guarantees here).  For
example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
granularity of 2000 cycles for each counter tick, which is not that
useful for low-level tuning of KVM.

So what I thought we were going to do was:

main_test_function()
{
	long ntimes = NTIMES;
	long cost, total_cost;

	cnt1 = read_cnt();
	do {
		run_test();
	} while(ntimes--);
	cnt2 = read_cnt();

	if (verify_sane_counter(cnt1, cnt2))
		return;

	total_cost = to_nanoseconds(cnt2 - cnt1);
	cost = total_cost / NTIMES;
	printf("testname: %l (%l)\n", cost, total_cost);
}

And in that way amortize the potential lack of precision over all the
iterations.  Did I miss some prior discussion about why that was a bad
idea?

It would also be possible to have two functions, one that does the above
and one that does a per-run measurement, in case the user wants to know
min/max/stddev and is running on a system with sufficient precision.
The method could be chosen via an argument.

Thanks,
-Christoffer

Andrew Jones Jan. 24, 2018, 8:30 a.m. UTC | #11

On Wed, Jan 24, 2018 at 09:17:55AM +0100, Christoffer Dall wrote:
> Hi Shih-Wei,
> 
> On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> > Thanks for the feedback about the mistakes in math and some issues in
> > naming, print msg, and coding style. I'll be careful and try to avoid
> > the same problems the next patch set. Sorry for all of the confusion.
> > 
> > So we now skip the test when "sample == 0" happens over 1000 times.
> > This is only due to the case that "cost is < 1/cntfrq" since it's not
> > possible for the tick to overflow for that many times. Did I miss
> > something here? I do agree that we should output better msgs to tell
> > users that the cost of a certain test is constantly smaller than a
> > tick.
> > 
> 
> I think for things like vmexit counts, it's very likely that all the
> samples will result in 0 ticks on many systems (fast CPU and slow arch
> counter; the architecture doesn't give us any guarantees here).  For
> example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
> granularity of 2000 cycles for each counter tick, which is not that
> useful for low-level tuning of KVM.
> 
> So what I thought we were going to do was:
> 
> main_test_function()
> {
> 	long ntimes = NTIMES;
> 	long cost, total_cost;
> 
> 	cnt1 = read_cnt();
> 	do {
> 		run_test();
> 	} while(ntimes--);
> 	cnt2 = read_cnt();
> 
> 	if (verify_sane_counter(cnt1, cnt2))
> 		return;
> 
> 	total_cost = to_nanoseconds(cnt2 - cnt1);
> 	cost = total_cost / NTIMES;
> 	printf("testname: %l (%l)\n", cost, total_cost);
> }
> 
> And in that way amortize the potential lack of precision over all the
> iterations.  Did I miss some prior discussion about why that was a bad
> idea?

Not that I know of, but I missed the above proposal, which I completely
agree with.

> 
> It would also be possible to have two functions, one that does the above
> and one that does a per-run measurement, in case the user wants to know
> min/max/stddev and is running on a system with sufficient precision.
> The method could be chosen via an argument.

We should be able to just make NTIMES variable, setting it to one when a
per-run measurement is desired, right? Anyway, I'm fine with having set
to a reasonable value, not variable / no argument, for initial merge.

Thanks,
drew

Christoffer Dall Jan. 24, 2018, 9:23 a.m. UTC | #12

On Wed, Jan 24, 2018 at 09:30:28AM +0100, Andrew Jones wrote:
> On Wed, Jan 24, 2018 at 09:17:55AM +0100, Christoffer Dall wrote:
> > Hi Shih-Wei,
> > 
> > On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> > > Thanks for the feedback about the mistakes in math and some issues in
> > > naming, print msg, and coding style. I'll be careful and try to avoid
> > > the same problems the next patch set. Sorry for all of the confusion.
> > > 
> > > So we now skip the test when "sample == 0" happens over 1000 times.
> > > This is only due to the case that "cost is < 1/cntfrq" since it's not
> > > possible for the tick to overflow for that many times. Did I miss
> > > something here? I do agree that we should output better msgs to tell
> > > users that the cost of a certain test is constantly smaller than a
> > > tick.
> > > 
> > 
> > I think for things like vmexit counts, it's very likely that all the
> > samples will result in 0 ticks on many systems (fast CPU and slow arch
> > counter; the architecture doesn't give us any guarantees here).  For
> > example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
> > granularity of 2000 cycles for each counter tick, which is not that
> > useful for low-level tuning of KVM.
> > 
> > So what I thought we were going to do was:
> > 
> > main_test_function()
> > {
> > 	long ntimes = NTIMES;
> > 	long cost, total_cost;
> > 
> > 	cnt1 = read_cnt();
> > 	do {
> > 		run_test();
> > 	} while(ntimes--);
> > 	cnt2 = read_cnt();
> > 
> > 	if (verify_sane_counter(cnt1, cnt2))
> > 		return;
> > 
> > 	total_cost = to_nanoseconds(cnt2 - cnt1);
> > 	cost = total_cost / NTIMES;
> > 	printf("testname: %l (%l)\n", cost, total_cost);
> > }
> > 
> > And in that way amortize the potential lack of precision over all the
> > iterations.  Did I miss some prior discussion about why that was a bad
> > idea?
> 
> Not that I know of, but I missed the above proposal, which I completely
> agree with.
> 
> > 
> > It would also be possible to have two functions, one that does the above
> > and one that does a per-run measurement, in case the user wants to know
> > min/max/stddev and is running on a system with sufficient precision.
> > The method could be chosen via an argument.
> 
> We should be able to just make NTIMES variable, setting it to one when a
> per-run measurement is desired, right? 

Not quite.  I think you'll get different results from starting up the
whole test suite, doing a single measurement, and then shutting
everything down again, versus a single startup, and then running
everything for 100,000 times, measurig each time, and reporting
min/max/avg/mean/etc.

We found the latter metrics useful when analyzing things like key/value
store workloads and benchmark results, where individual deviations can
really skew overall results.


> Anyway, I'm fine with having set
> to a reasonable value, not variable / no argument, for initial merge.
> 

I'd be absolutely fine with that as well, we can add the additional
functionality in a separate patch, or later on, depending on how much
time Shih-Wei has :)

Thanks,
-Christoffer

Shih-Wei Li Jan. 25, 2018, 2:55 a.m. UTC | #13

Hi Christoffer,

In the previous patch, we did the avg and min/max altogether in a
single function, in which we can skip the test (and output msg to tell
the user about it) if its cost is smaller than 1 cnnfrq for more than
X times.

I think we can use two functions, one for "avg" and one for
"minx_max_std" as you mentioned if it looks more straightforward. We
can first run the avg() and check if the result can be valid, if yes,
then we can proceed to do the min_max_std() test; otherwise, we just
tell the users that we cannot get valid results from the targeted
test. Is this close what you proposed earlier?

And yes, I can set NTIMES in avg() (and min_max_std()) as a constant
value for now.

Thanks,
Shih-Wei

On Wed, Jan 24, 2018 at 4:23 AM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Wed, Jan 24, 2018 at 09:30:28AM +0100, Andrew Jones wrote:
>> On Wed, Jan 24, 2018 at 09:17:55AM +0100, Christoffer Dall wrote:
>> > Hi Shih-Wei,
>> >
>> > On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
>> > > Thanks for the feedback about the mistakes in math and some issues in
>> > > naming, print msg, and coding style. I'll be careful and try to avoid
>> > > the same problems the next patch set. Sorry for all of the confusion.
>> > >
>> > > So we now skip the test when "sample == 0" happens over 1000 times.
>> > > This is only due to the case that "cost is < 1/cntfrq" since it's not
>> > > possible for the tick to overflow for that many times. Did I miss
>> > > something here? I do agree that we should output better msgs to tell
>> > > users that the cost of a certain test is constantly smaller than a
>> > > tick.
>> > >
>> >
>> > I think for things like vmexit counts, it's very likely that all the
>> > samples will result in 0 ticks on many systems (fast CPU and slow arch
>> > counter; the architecture doesn't give us any guarantees here).  For
>> > example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
>> > granularity of 2000 cycles for each counter tick, which is not that
>> > useful for low-level tuning of KVM.
>> >
>> > So what I thought we were going to do was:
>> >
>> > main_test_function()
>> > {
>> >     long ntimes = NTIMES;
>> >     long cost, total_cost;
>> >
>> >     cnt1 = read_cnt();
>> >     do {
>> >             run_test();
>> >     } while(ntimes--);
>> >     cnt2 = read_cnt();
>> >
>> >     if (verify_sane_counter(cnt1, cnt2))
>> >             return;
>> >
>> >     total_cost = to_nanoseconds(cnt2 - cnt1);
>> >     cost = total_cost / NTIMES;
>> >     printf("testname: %l (%l)\n", cost, total_cost);
>> > }
>> >
>> > And in that way amortize the potential lack of precision over all the
>> > iterations.  Did I miss some prior discussion about why that was a bad
>> > idea?
>>
>> Not that I know of, but I missed the above proposal, which I completely
>> agree with.
>>
>> >
>> > It would also be possible to have two functions, one that does the above
>> > and one that does a per-run measurement, in case the user wants to know
>> > min/max/stddev and is running on a system with sufficient precision.
>> > The method could be chosen via an argument.
>>
>> We should be able to just make NTIMES variable, setting it to one when a
>> per-run measurement is desired, right?
>
> Not quite.  I think you'll get different results from starting up the
> whole test suite, doing a single measurement, and then shutting
> everything down again, versus a single startup, and then running
> everything for 100,000 times, measurig each time, and reporting
> min/max/avg/mean/etc.
>
> We found the latter metrics useful when analyzing things like key/value
> store workloads and benchmark results, where individual deviations can
> really skew overall results.
>
>
>> Anyway, I'm fine with having set
>> to a reasonable value, not variable / no argument, for initial merge.
>>
>
> I'd be absolutely fine with that as well, we can add the additional
> functionality in a separate patch, or later on, depending on how much
> time Shih-Wei has :)
>
> Thanks,
> -Christoffer

Christoffer Dall Jan. 25, 2018, 8:52 a.m. UTC | #14

Hi Shih-Wei,

(Please try to refrain from top-posting on these lists.)

On Wed, Jan 24, 2018 at 09:55:49PM -0500, Shih-Wei Li wrote:
> Hi Christoffer,
> 
> In the previous patch, we did the avg and min/max altogether in a
> single function, in which we can skip the test (and output msg to tell
> the user about it) if its cost is smaller than 1 cnnfrq for more than
> X times.

The problem is not just about the cost being smaller than a single tick,
it's also about granularity.  Even if I get one tick, on some platform
all that tells me is "this took between 2000 and 4000 cycles".  That's
not very precise...

> 
> I think we can use two functions, one for "avg" and one for
> "minx_max_std" as you mentioned if it looks more straightforward. We
> can first run the avg() and check if the result can be valid, if yes,
> then we can proceed to do the min_max_std() test; otherwise, we just
> tell the users that we cannot get valid results from the targeted
> test. Is this close what you proposed earlier?

I think what we're most commonly interested in is the average cost of an
operation when it is repeated many times.  I think the best way you can
achieve a reasonable and accurate measurement of that is using my
proposal below, where you don't read the counter for every single tiny
operation, but you average it out over NTIMES.

I think you should focus on doing that first, and getting this clean and
correct, which should help with some of the concerns about checking for
overflows etc., which really only applied to the cycle counter and not
the arch counter.

Then, in a separate follow-up patch, I think it would be helpful to
introduce the framework that lets you do more fine-grained measurements,
including min/max/std/mean, which people like me can use when tweaking
the system to actually use the cycle counter from the PMU instead.  This
patch would then need to validate the cycle counter reads to some
degree.

> 
> And yes, I can set NTIMES in avg() (and min_max_std()) as a constant
> value for now.
> 

It should be trivial to add an additional patch in the end that lets the
user override NTIMES, but again, just introduce it as a separate patch
if it makes things easier.

Thanks,
-Christoffer

> 
> On Wed, Jan 24, 2018 at 4:23 AM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Wed, Jan 24, 2018 at 09:30:28AM +0100, Andrew Jones wrote:
> >> On Wed, Jan 24, 2018 at 09:17:55AM +0100, Christoffer Dall wrote:
> >> > Hi Shih-Wei,
> >> >
> >> > On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
> >> > > Thanks for the feedback about the mistakes in math and some issues in
> >> > > naming, print msg, and coding style. I'll be careful and try to avoid
> >> > > the same problems the next patch set. Sorry for all of the confusion.
> >> > >
> >> > > So we now skip the test when "sample == 0" happens over 1000 times.
> >> > > This is only due to the case that "cost is < 1/cntfrq" since it's not
> >> > > possible for the tick to overflow for that many times. Did I miss
> >> > > something here? I do agree that we should output better msgs to tell
> >> > > users that the cost of a certain test is constantly smaller than a
> >> > > tick.
> >> > >
> >> >
> >> > I think for things like vmexit counts, it's very likely that all the
> >> > samples will result in 0 ticks on many systems (fast CPU and slow arch
> >> > counter; the architecture doesn't give us any guarantees here).  For
> >> > example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
> >> > granularity of 2000 cycles for each counter tick, which is not that
> >> > useful for low-level tuning of KVM.
> >> >
> >> > So what I thought we were going to do was:
> >> >
> >> > main_test_function()
> >> > {
> >> >     long ntimes = NTIMES;
> >> >     long cost, total_cost;
> >> >
> >> >     cnt1 = read_cnt();
> >> >     do {
> >> >             run_test();
> >> >     } while(ntimes--);
> >> >     cnt2 = read_cnt();
> >> >
> >> >     if (verify_sane_counter(cnt1, cnt2))
> >> >             return;
> >> >
> >> >     total_cost = to_nanoseconds(cnt2 - cnt1);
> >> >     cost = total_cost / NTIMES;
> >> >     printf("testname: %l (%l)\n", cost, total_cost);
> >> > }
> >> >
> >> > And in that way amortize the potential lack of precision over all the
> >> > iterations.  Did I miss some prior discussion about why that was a bad
> >> > idea?
> >>
> >> Not that I know of, but I missed the above proposal, which I completely
> >> agree with.
> >>
> >> >
> >> > It would also be possible to have two functions, one that does the above
> >> > and one that does a per-run measurement, in case the user wants to know
> >> > min/max/stddev and is running on a system with sufficient precision.
> >> > The method could be chosen via an argument.
> >>
> >> We should be able to just make NTIMES variable, setting it to one when a
> >> per-run measurement is desired, right?
> >
> > Not quite.  I think you'll get different results from starting up the
> > whole test suite, doing a single measurement, and then shutting
> > everything down again, versus a single startup, and then running
> > everything for 100,000 times, measurig each time, and reporting
> > min/max/avg/mean/etc.
> >
> > We found the latter metrics useful when analyzing things like key/value
> > store workloads and benchmark results, where individual deviations can
> > really skew overall results.
> >
> >
> >> Anyway, I'm fine with having set
> >> to a reasonable value, not variable / no argument, for initial merge.
> >>
> >
> > I'd be absolutely fine with that as well, we can add the additional
> > functionality in a separate patch, or later on, depending on how much
> > time Shih-Wei has :)
> >
> > Thanks,
> > -Christoffer
>

Shih-Wei Li Jan. 25, 2018, 1:45 p.m. UTC | #15

Hi Christoffer,

On Thu, Jan 25, 2018 at 3:52 AM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> Hi Shih-Wei,
>
> (Please try to refrain from top-posting on these lists.)
>
> On Wed, Jan 24, 2018 at 09:55:49PM -0500, Shih-Wei Li wrote:
>> Hi Christoffer,
>>
>> In the previous patch, we did the avg and min/max altogether in a
>> single function, in which we can skip the test (and output msg to tell
>> the user about it) if its cost is smaller than 1 cnnfrq for more than
>> X times.
>
> The problem is not just about the cost being smaller than a single tick,
> it's also about granularity.  Even if I get one tick, on some platform
> all that tells me is "this took between 2000 and 4000 cycles".  That's
> not very precise...

Yes, I absolutely agree with you on this :)

>
>>
>> I think we can use two functions, one for "avg" and one for
>> "minx_max_std" as you mentioned if it looks more straightforward. We
>> can first run the avg() and check if the result can be valid, if yes,
>> then we can proceed to do the min_max_std() test; otherwise, we just
>> tell the users that we cannot get valid results from the targeted
>> test. Is this close what you proposed earlier?
>
> I think what we're most commonly interested in is the average cost of an
> operation when it is repeated many times.  I think the best way you can
> achieve a reasonable and accurate measurement of that is using my
> proposal below, where you don't read the counter for every single tiny
> operation, but you average it out over NTIMES.
>
> I think you should focus on doing that first, and getting this clean and
> correct, which should help with some of the concerns about checking for
> overflows etc., which really only applied to the cycle counter and not
> the arch counter.
>
> Then, in a separate follow-up patch, I think it would be helpful to
> introduce the framework that lets you do more fine-grained measurements,
> including min/max/std/mean, which people like me can use when tweaking
> the system to actually use the cycle counter from the PMU instead.  This
> patch would then need to validate the cycle counter reads to some
> degree.
>

Yes.

>>
>> And yes, I can set NTIMES in avg() (and min_max_std()) as a constant
>> value for now.
>>
>
> It should be trivial to add an additional patch in the end that lets the
> user override NTIMES, but again, just introduce it as a separate patch
> if it makes things easier.

Yes, I will first clean up the code and support measuring the
amortized cost using arch counter, then provide support for the
fine-grain measurements and setups for NTIMES in the follow-up
patches.

Thanks,
Shih-Wei

>
> Thanks,
> -Christoffer
>
>>
>> On Wed, Jan 24, 2018 at 4:23 AM, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>> > On Wed, Jan 24, 2018 at 09:30:28AM +0100, Andrew Jones wrote:
>> >> On Wed, Jan 24, 2018 at 09:17:55AM +0100, Christoffer Dall wrote:
>> >> > Hi Shih-Wei,
>> >> >
>> >> > On Fri, Jan 19, 2018 at 04:57:55PM -0500, Shih-Wei Li wrote:
>> >> > > Thanks for the feedback about the mistakes in math and some issues in
>> >> > > naming, print msg, and coding style. I'll be careful and try to avoid
>> >> > > the same problems the next patch set. Sorry for all of the confusion.
>> >> > >
>> >> > > So we now skip the test when "sample == 0" happens over 1000 times.
>> >> > > This is only due to the case that "cost is < 1/cntfrq" since it's not
>> >> > > possible for the tick to overflow for that many times. Did I miss
>> >> > > something here? I do agree that we should output better msgs to tell
>> >> > > users that the cost of a certain test is constantly smaller than a
>> >> > > tick.
>> >> > >
>> >> >
>> >> > I think for things like vmexit counts, it's very likely that all the
>> >> > samples will result in 0 ticks on many systems (fast CPU and slow arch
>> >> > counter; the architecture doesn't give us any guarantees here).  For
>> >> > example, a system with a 2 GHz CPU and a 1 MHz counter will give you a
>> >> > granularity of 2000 cycles for each counter tick, which is not that
>> >> > useful for low-level tuning of KVM.
>> >> >
>> >> > So what I thought we were going to do was:
>> >> >
>> >> > main_test_function()
>> >> > {
>> >> >     long ntimes = NTIMES;
>> >> >     long cost, total_cost;
>> >> >
>> >> >     cnt1 = read_cnt();
>> >> >     do {
>> >> >             run_test();
>> >> >     } while(ntimes--);
>> >> >     cnt2 = read_cnt();
>> >> >
>> >> >     if (verify_sane_counter(cnt1, cnt2))
>> >> >             return;
>> >> >
>> >> >     total_cost = to_nanoseconds(cnt2 - cnt1);
>> >> >     cost = total_cost / NTIMES;
>> >> >     printf("testname: %l (%l)\n", cost, total_cost);
>> >> > }
>> >> >
>> >> > And in that way amortize the potential lack of precision over all the
>> >> > iterations.  Did I miss some prior discussion about why that was a bad
>> >> > idea?
>> >>
>> >> Not that I know of, but I missed the above proposal, which I completely
>> >> agree with.
>> >>
>> >> >
>> >> > It would also be possible to have two functions, one that does the above
>> >> > and one that does a per-run measurement, in case the user wants to know
>> >> > min/max/stddev and is running on a system with sufficient precision.
>> >> > The method could be chosen via an argument.
>> >>
>> >> We should be able to just make NTIMES variable, setting it to one when a
>> >> per-run measurement is desired, right?
>> >
>> > Not quite.  I think you'll get different results from starting up the
>> > whole test suite, doing a single measurement, and then shutting
>> > everything down again, versus a single startup, and then running
>> > everything for 100,000 times, measurig each time, and reporting
>> > min/max/avg/mean/etc.
>> >
>> > We found the latter metrics useful when analyzing things like key/value
>> > store workloads and benchmark results, where individual deviations can
>> > really skew overall results.
>> >
>> >
>> >> Anyway, I'm fine with having set
>> >> to a reasonable value, not variable / no argument, for initial merge.
>> >>
>> >
>> > I'd be absolutely fine with that as well, we can add the additional
>> > functionality in a separate patch, or later on, depending on how much
>> > time Shih-Wei has :)
>> >
>> > Thanks,
>> > -Christoffer
>>

[v3,2/2] arm64: add micro test

Commit Message

Comments

Patch