[kvm-unit-tests] arm64: add micro test

Message ID	1513372539-10180-2-git-send-email-shihwei@cs.columbia.edu (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Shih-Wei Li <shihwei@cs.columbia.edu> To: kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu Cc: christoffer.dall@linaro.org, marc.zyngier@arm.com, drjones@redhat.com, pbonzini@redhat.com, Shih-Wei Li <shihwei@cs.columbia.edu> Subject: [kvm-unit-tests PATCH] arm64: add micro test Date: Fri, 15 Dec 2017 16:15:39 -0500 Message-Id: <1513372539-10180-2-git-send-email-shihwei@cs.columbia.edu> In-Reply-To: <1513372539-10180-1-git-send-email-shihwei@cs.columbia.edu> References: <1513372539-10180-1-git-send-email-shihwei@cs.columbia.edu> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Shih-Wei Li Dec. 15, 2017, 9:15 p.m. UTC

Here we provide the support for measuring various micro level
operations on arm64. We iterate each of the tests for millions of
times and output their average, minimum and maximum cost in timer
counts. Instruction barriers are used before and after taking
timestamps to avoid out-of-order execution or pipelining from
skewing our measurements.

The tests we currently support and measure are mostly
straightforward by the function names and the respective comments.
For IPI test, we measure the cost of sending IPI from a source
VCPU to a target VCPU, until the target VCPU receives the IPI.

Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
---
 arm/Makefile.common |   1 +
 arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   6 ++
 3 files changed, 296 insertions(+)
 create mode 100644 arm/micro-test.c

Yury Norov Dec. 18, 2017, 5:31 p.m. UTC | #1

Hi Shih-Wei,

(CC Cavium folks)

Thank you for the test! I was going to write something like
this, and instead found your patch submitted - good luck to me. 

Soon I'll test my hardware with it and share you results.
Comments inline.

Yury

On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   6 ++
>  3 files changed, 296 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))

Is my understanding correct that this is overflow protection?
c1 and c2 are 64-bit values. To overflow them you need 58 years
at 1G CPU freq.

And anyway,
#define CYCLE_COUNT(c1, c2)  ((c1) >= (c2) ? 0 : (c2) - (c1))

> +
> +#define IPI_DEBUG 0
> +

If you assume to pass IPI_DEBUG as GCC parameter, like -DIPI_DEBUG=1,
it should be 

#ifndef IPI_DEBUG
#define IPI_DEBUG 0
#endif

But I wonder if there's already existing switch for debug info?

> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif

There are some fancy printing functions in lib/report.c. I didn't
look deeply into thet, but at first glance you can use it here.

> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}

Flushing pipeline before mrs is enough. Also, you can use read_sysreg()
instead of inline assembly. Refer arch_counter_get_cntpct() in kernel:

151 static inline u64 arch_counter_get_cntpct(void)
152 {
153         isb();   
154         return arch_timer_reg_read_stable(cntpct_el0);
155 }

> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}

Nit: here and later, timeout is not actually timeout, but something like
counter of attempts. Maybe it worth to use read_cc() here if you need
time intervals? 

> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:

Nit: it's simpler to return error at place and drop the 'out' label,
like this:
if (!gic_init()) {
        debug(...);
        return 0;
}

> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;

read_cc() returns uint64_t, so c1 and c2 should also be fixed-size
variables. Maybe one day we'll port test to arm32 where unsigned long is
32-bit...

> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +

Nit: odd line.

> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}

This will be broken if compiler decide to assign r0 for variable c1,
or I miss something?

> +
> +static void __noop(void)
> +{
> +}

This would be completely optimized out. Is it for demonstrative
purpose?

> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}

So if gic version is not 2 or 3, it will readl() at broken
address. I think it's better to return error and print relevant
message to system log.

> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}

Similar here.

> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> -- 
> 1.9.1
>

Andrew Jones Dec. 18, 2017, 7:10 p.m. UTC | #2

On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   6 ++
>  3 files changed, 296 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@

Please add the copyright and license header like we have in other files.

> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))

We have this in libcflat already, ARRAY_SIZE()

> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)

One time used macro is probably not necessary and the tmp
var wouldn't be necessary if you make the last entry of
available_tests a null entry.

> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif

We don't generally turn off print statements in kvm-unit-tests.
If the statement is really a temporary debug statement, then
maybe just remove it before posting the patch. If it's even just a
little helpful, then just leave it in and always print it. To promote
messages to a higher level of importance than printf use report_*
functions.

> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);

cpu_relax() here will help get that first-cpu ack on TCG faster. That
said, this test doesn't make sense on TCG anyway, other than debugging
it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
list.

I just tried on TCG now. It doesn't run. It gets

Timer Frequency 62500000 Hz (Output in timer count)
Unhandled exception ec=0 (UNKNOWN)
Vector: 4 (el1h_sync)
ESR_EL1:         02000000, ec=0 (UNKNOWN)
FAR_EL1: 0000000000000000 (not valid)
Exception frame registers:
pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
sp : 00000000400aff90
x29: 0000000000000000 x28: 0000000000000000 
x27: 0000000040090000 x26: 0000000040090c60 
x25: 0000000040090000 x24: 000000001fffffff 
x23: 0000000000000000 x22: 0000000000000000 
x21: 0000000000000040 x20: 0000000000000000 
x19: 0000000000000000 x18: 00000000400b0000 
x17: 0000000000000000 x16: 0000000000000000 
x15: 00000000400afe8c x14: 00000000400b0000 
x13: 00000000400afecc x12: 0000000000001680 
x11: 0000000000000000 x10: 6666666666666667 
x9 : 0000000000000030 x8 : 0000000000000030 
x7 : 00000000400af670 x6 : 00000000400af673 
x5 : 00000000400af678 x4 : 00000000000007b7 
x3 : 00000000400af6ec x2 : 0000000040090000 
x1 : 000000000015909e x0 : 000000004b000000

You don't need to get it to work on TCG if you add the 'accel = kvm'
parameter, but it's sometimes indicative of a bug in the unit test
when it doesn't run, so you might want to take a look.

Also need to do s/timeout/tries/ or something, as it's not time related.

> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");

I think you should drop the debug() macro completely, but in here
it should certainly be changed. Erroring out shouldn't be silent,
as it would be when out debug turned on.

> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;

I think you should be able to avoid most of this manual cpu
synchronization by using the on_cpu() call, which is the
recommended way to start secondaries anyway.

> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;

This is a virtio-mmio transport device-id address. It's harmless to
read it, as long as we don't change it. I've had plans to provide
mmio addresses with different read, write permissions and sizes through
extending a test-dev for quite a while. I should do that. For this patch
I guess this is fine, but please comment it with a big FIXME or TODO to
make sure we change it when we can someday.

> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */

You can add GICD_IIDR to lib/arm/asm/gic.h

> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;

You don't need this, we have gic_write_eoir(), which you used above.

> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;

I'd prefer incrementing a different counter after the sample==0 test /
continue, than to decrement both i and iterations.

> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);

No need for () around iterations.

> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;

pointless return

> +}

nit: no need for the above function, just inline it in main().

> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);

I think this and all exit(1)'s in this unit test could/should be
replaced with asserts.

> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> -- 
> 1.9.1
> 
>

Thanks again for contributing to kvm-unit-tests. I look forward to v2.

drew

Shih-Wei Li Dec. 18, 2017, 9:32 p.m. UTC | #3

On Mon, Dec 18, 2017 at 12:31 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> Hi Shih-Wei,
>
> (CC Cavium folks)
>
> Thank you for the test! I was going to write something like
> this, and instead found your patch submitted - good luck to me.
>
> Soon I'll test my hardware with it and share you results.
> Comments inline.
>
> Yury
>
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers are used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>>
>> The tests we currently support and measure are mostly
>> straightforward by the function names and the respective comments.
>> For IPI test, we measure the cost of sending IPI from a source
>> VCPU to a target VCPU, until the target VCPU receives the IPI.
>>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   6 ++
>>  3 files changed, 296 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..7df2272
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,289 @@
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool second_cpu_up;
>> +static volatile bool first_cpu_ack;
>> +static volatile bool ipi_acked;
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
>> +#define for_each_test(_iter, _tests, _tmp) \
>> +     for (_tmp = 0, _iter = _tests; \
>> +                     _tmp < ARR_SIZE(_tests); \
>> +                     _tmp++, _iter++)
>> +
>> +#define CYCLE_COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>
> Is my understanding correct that this is overflow protection?
> c1 and c2 are 64-bit values. To overflow them you need 58 years
> at 1G CPU freq.
>
> And anyway,
> #define CYCLE_COUNT(c1, c2)  ((c1) >= (c2) ? 0 : (c2) - (c1))

Yes, it's for overflow protection.

>
>> +
>> +#define IPI_DEBUG 0
>> +
>
> If you assume to pass IPI_DEBUG as GCC parameter, like -DIPI_DEBUG=1,
> it should be
>
> #ifndef IPI_DEBUG
> #define IPI_DEBUG 0
> #endif
>
> But I wonder if there's already existing switch for debug info?
>
>> +#if IPI_DEBUG == 1
>> +#define debug(fmt, ...) \
>> +     printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
>> +#else
>> +#define debug(fmt, ...) {}
>> +#endif
>
> There are some fancy printing functions in lib/report.c. I didn't
> look deeply into thet, but at first glance you can use it here.
>
>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     uint64_t cc;
>> +     asm volatile(
>> +             "isb\n"
>> +             "mrs %0, CNTPCT_EL0\n"
>> +             "isb\n"
>> +             : [reg] "=r" (cc)
>> +             ::
>> +     );
>> +     return cc;
>> +}
>
> Flushing pipeline before mrs is enough. Also, you can use read_sysreg()
> instead of inline assembly. Refer arch_counter_get_cntpct() in kernel:
>
> 151 static inline u64 arch_counter_get_cntpct(void)
> 152 {
> 153         isb();
> 154         return arch_timer_reg_read_stable(cntpct_el0);
> 155 }
>
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     ipi_acked = true;
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     debug("secondary core up\n");
>> +
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     second_cpu_up = true;
>> +
>> +     debug("secondary initialized vgic\n");
>> +
>> +     while (!first_cpu_ack && timeout--);
>> +     if (!first_cpu_ack) {
>> +             debug("ipi_test: First CPU did not ack wake-up\n");
>> +             exit(1);
>> +     }
>
> Nit: here and later, timeout is not actually timeout, but something like
> counter of attempts. Maybe it worth to use read_cc() here if you need
> time intervals?
>
>> +
>> +     debug("detected first cpu ack\n");
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     int ret;
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     ret = gic_init();
>> +     if (!ret) {
>> +             debug("No supported gic present, skipping tests...\n");
>> +             goto out;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +
>> +     debug("starting second CPU\n");
>> +     smp_boot_secondary(1, ipi_test_secondary_entry);
>> +
>> +     while (!second_cpu_up && timeout--); /* Wait for second CPU! */
>> +
>> +     if (!second_cpu_up) {
>> +             debug("ipi_test: timeout waiting for secondary CPU\n");
>> +             ret = 0;
>> +             goto out;
>> +     }
>> +
>> +     debug("detected secondary core up\n");
>> +
>> +     first_cpu_ack = true;
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +out:
>
> Nit: it's simpler to return error at place and drop the 'out' label,
> like this:
> if (!gic_init()) {
>         debug(...);
>         return 0;
> }

Yes.

>
>> +     return ret;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     unsigned long c1, c2;
>
> read_cc() returns uint64_t, so c1 and c2 should also be fixed-size
> variables. Maybe one day we'll port test to arm32 where unsigned long is
> 32-bit...

Right, I'll look at it.

>
>> +
>> +     while (!ipi_ready && timeout--);
>> +     if (!ipi_ready) {
>> +             debug("ipi_test: second core not ready for IPIs\n");
>> +             exit(1);
>> +     }
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     if (!ipi_received) {
>> +             debug("ipi_test: secondary core never received ipi\n");
>> +             exit(1);
>> +     }
>> +
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +
>
> Nit: odd line.
>
>> +static unsigned long hvc_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0");
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> This will be broken if compiler decide to assign r0 for variable c1,
> or I miss something?
>
>> +
>> +static void __noop(void)
>> +{
>> +}
>
> This would be completely optimized out. Is it for demonstrative
> purpose?

Yes, this is to show the cost using isb().

>
>> +
>> +static unsigned long noop_guest(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     __noop();
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     unsigned long c1, c2;
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>> +
>> +     /* Measure MMIO exit to QEMU in userspace */
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void *vgic_dist_addr = NULL;
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>> +
>> +     /* Measure MMIO exit to host kernel */
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> So if gic version is not 2 or 3, it will readl() at broken
> address. I think it's better to return error and print relevant
> message to system log.
>
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> Similar here.
>
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test available_tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"noop_guest",         noop_guest,         true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 29);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             for (i = 0; i < iterations; i++) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             iterations--;
>> +                             i--;
>> +                             debug("cycle count overflow: %lu\n", sample);
>> +                             continue;
>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / (iterations), min, max);
>> +}
>> +
>> +void kvm_unit_test(void)
>> +{
>> +     int i=0;
>> +     struct exit_test *test;
>> +     for_each_test(test, available_tests, i) {
>> +             if (!test->run)
>> +                     continue;
>> +             loop_test(test);
>> +     }
>> +
>> +     return;
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     if (!test_init())
>> +             exit(1);
>> +     kvm_unit_test();
>> +     return 0;
>> +}
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..1d0c4ca 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,9 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> --
>> 1.9.1
>>

Shih-Wei Li Dec. 18, 2017, 9:58 p.m. UTC | #4

On Mon, Dec 18, 2017 at 2:10 PM, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers are used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>>
>> The tests we currently support and measure are mostly
>> straightforward by the function names and the respective comments.
>> For IPI test, we measure the cost of sending IPI from a source
>> VCPU to a target VCPU, until the target VCPU receives the IPI.
>>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   6 ++
>>  3 files changed, 296 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..7df2272
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,289 @@
>
> Please add the copyright and license header like we have in other files.
>
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool second_cpu_up;
>> +static volatile bool first_cpu_ack;
>> +static volatile bool ipi_acked;
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
>
> We have this in libcflat already, ARRAY_SIZE()
>
>> +#define for_each_test(_iter, _tests, _tmp) \
>> +     for (_tmp = 0, _iter = _tests; \
>> +                     _tmp < ARR_SIZE(_tests); \
>> +                     _tmp++, _iter++)
>
> One time used macro is probably not necessary and the tmp
> var wouldn't be necessary if you make the last entry of
> available_tests a null entry.

You're right. I'll look into this.

>
>> +
>> +#define CYCLE_COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>> +
>> +#define IPI_DEBUG 0
>> +
>> +#if IPI_DEBUG == 1
>> +#define debug(fmt, ...) \
>> +     printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
>> +#else
>> +#define debug(fmt, ...) {}
>> +#endif
>
> We don't generally turn off print statements in kvm-unit-tests.
> If the statement is really a temporary debug statement, then
> maybe just remove it before posting the patch. If it's even just a
> little helpful, then just leave it in and always print it. To promote
> messages to a higher level of importance than printf use report_*
> functions.
>

Thanks. I'll look at it.

>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     uint64_t cc;
>> +     asm volatile(
>> +             "isb\n"
>> +             "mrs %0, CNTPCT_EL0\n"
>> +             "isb\n"
>> +             : [reg] "=r" (cc)
>> +             ::
>> +     );
>> +     return cc;
>> +}
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     ipi_acked = true;
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     debug("secondary core up\n");
>> +
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     second_cpu_up = true;
>> +
>> +     debug("secondary initialized vgic\n");
>> +
>> +     while (!first_cpu_ack && timeout--);
>
> cpu_relax() here will help get that first-cpu ack on TCG faster. That
> said, this test doesn't make sense on TCG anyway, other than debugging
> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
> list.

ok, thanks.

>
> I just tried on TCG now. It doesn't run. It gets
>
> Timer Frequency 62500000 Hz (Output in timer count)
> Unhandled exception ec=0 (UNKNOWN)
> Vector: 4 (el1h_sync)
> ESR_EL1:         02000000, ec=0 (UNKNOWN)
> FAR_EL1: 0000000000000000 (not valid)
> Exception frame registers:
> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
> sp : 00000000400aff90
> x29: 0000000000000000 x28: 0000000000000000
> x27: 0000000040090000 x26: 0000000040090c60
> x25: 0000000040090000 x24: 000000001fffffff
> x23: 0000000000000000 x22: 0000000000000000
> x21: 0000000000000040 x20: 0000000000000000
> x19: 0000000000000000 x18: 00000000400b0000
> x17: 0000000000000000 x16: 0000000000000000
> x15: 00000000400afe8c x14: 00000000400b0000
> x13: 00000000400afecc x12: 0000000000001680
> x11: 0000000000000000 x10: 6666666666666667
> x9 : 0000000000000030 x8 : 0000000000000030
> x7 : 00000000400af670 x6 : 00000000400af673
> x5 : 00000000400af678 x4 : 00000000000007b7
> x3 : 00000000400af6ec x2 : 0000000040090000
> x1 : 000000000015909e x0 : 000000004b000000
>
> You don't need to get it to work on TCG if you add the 'accel = kvm'
> parameter, but it's sometimes indicative of a bug in the unit test
> when it doesn't run, so you might want to take a look.
>
> Also need to do s/timeout/tries/ or something, as it's not time related.
>
>> +     if (!first_cpu_ack) {
>> +             debug("ipi_test: First CPU did not ack wake-up\n");
>
> I think you should drop the debug() macro completely, but in here
> it should certainly be changed. Erroring out shouldn't be silent,
> as it would be when out debug turned on.
>
>> +             exit(1);
>> +     }
>> +
>> +     debug("detected first cpu ack\n");
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     int ret;
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     ret = gic_init();
>> +     if (!ret) {
>> +             debug("No supported gic present, skipping tests...\n");
>> +             goto out;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +
>> +     debug("starting second CPU\n");
>> +     smp_boot_secondary(1, ipi_test_secondary_entry);
>> +
>> +     while (!second_cpu_up && timeout--); /* Wait for second CPU! */
>> +
>> +     if (!second_cpu_up) {
>> +             debug("ipi_test: timeout waiting for secondary CPU\n");
>> +             ret = 0;
>> +             goto out;
>> +     }
>> +
>> +     debug("detected secondary core up\n");
>> +
>> +     first_cpu_ack = true;
>
> I think you should be able to avoid most of this manual cpu
> synchronization by using the on_cpu() call, which is the
> recommended way to start secondaries anyway.

I'll take a look.

>
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +out:
>> +     return ret;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     unsigned long c1, c2;
>> +
>> +     while (!ipi_ready && timeout--);
>> +     if (!ipi_ready) {
>> +             debug("ipi_test: second core not ready for IPIs\n");
>> +             exit(1);
>> +     }
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     if (!ipi_received) {
>> +             debug("ipi_test: secondary core never received ipi\n");
>> +             exit(1);
>> +     }
>> +
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +
>> +static unsigned long hvc_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0");
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static void __noop(void)
>> +{
>> +}
>> +
>> +static unsigned long noop_guest(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     __noop();
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     unsigned long c1, c2;
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>
> This is a virtio-mmio transport device-id address. It's harmless to
> read it, as long as we don't change it. I've had plans to provide
> mmio addresses with different read, write permissions and sizes through
> extending a test-dev for quite a while. I should do that. For this patch
> I guess this is fine, but please comment it with a big FIXME or TODO to
> make sure we change it when we can someday.

Thanks for pointing out. I'll take a look.

>
>> +
>> +     /* Measure MMIO exit to QEMU in userspace */
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void *vgic_dist_addr = NULL;
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>> +
>> +     /* Measure MMIO exit to host kernel */
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
>
> You can add GICD_IIDR to lib/arm/asm/gic.h
>
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>
> You don't need this, we have gic_write_eoir(), which you used above.

Yes. I was thinking of calling EOI function directly to avoid the assert.

>
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test available_tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"noop_guest",         noop_guest,         true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 29);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             for (i = 0; i < iterations; i++) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             iterations--;
>> +                             i--;
>
> I'd prefer incrementing a different counter after the sample==0 test /
> continue, than to decrement both i and iterations.

I'm not sure about the comment above. Could you please substantiate
how a new counter should work here?

>
>> +                             debug("cycle count overflow: %lu\n", sample);
>> +                             continue;
>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / (iterations), min, max);
>
> No need for () around iterations.
>
>> +}
>> +
>> +void kvm_unit_test(void)
>> +{
>> +     int i=0;
>> +     struct exit_test *test;
>> +     for_each_test(test, available_tests, i) {
>> +             if (!test->run)
>> +                     continue;
>> +             loop_test(test);
>> +     }
>> +
>> +     return;
>
> pointless return
>
>> +}
>
> nit: no need for the above function, just inline it in main().

Yes.

>
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     if (!test_init())
>> +             exit(1);
>
> I think this and all exit(1)'s in this unit test could/should be
> replaced with asserts.
>
>> +     kvm_unit_test();
>> +     return 0;
>> +}
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..1d0c4ca 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,9 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> --
>> 1.9.1
>>
>>
>
> Thanks again for contributing to kvm-unit-tests. I look forward to v2.

Sure. Thank you for the feedback as well! I'll send out the patch asap.

Shih-Wei

>
> drew

Kalra, Ashish Dec. 18, 2017, 10:17 p.m. UTC | #5

Hello Shih-Wei,

This "micro" test is really useful for micro-architectural performance 
measurement of KVM/ARM.

But, to get slightly more accurate cycle counts, it will be preferable 
to compute counter read (and isb) overhead and deduct it from the final 
cycle count, something like this :

+static unsigned long counter_read_overhead;

+static void compute_counter_read_overhead(void)
+{
+	unsigned long c1, c2;
+
+	c1 = read_cc();
+	c2 = read_cc();
+	counter_read_overhead = c2 - c1;
+}

+#define CYCLE_COUNT(c1, c2) \
+	(((c1) > (c2) || ((c1) == (c2))) ? 0 : ((c2) - (c1)) - 
counter_read_overhead )
+

static int test_init(void)
{
	int ret;
	unsigned int timeout = TIMEOUT;

+	compute_counter_read_overhead();
..
..

Thanks,
Ashish

On 12/16/2017 2:45 AM, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>   arm/Makefile.common |   1 +
>   arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arm/unittests.cfg   |   6 ++
>   3 files changed, 296 insertions(+)
>   create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>   tests-common += $(TEST_DIR)/gic.flat
>   tests-common += $(TEST_DIR)/psci.flat
>   tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>   
>   tests-all = $(tests-common) $(tests)
>   all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif
> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>   groups = timer
>   timeout = 2s
>   arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
>

Kalra, Ashish Dec. 18, 2017, 10:31 p.m. UTC | #6

Hello Shih-Wei,

This "micro" test is really useful for micro-architectural performance
measurement of KVM/ARM.

But, to get more accurate cycle counts, it will be preferable to compute
counter read (and isb) overhead and deduct it from the final cycle
count, something like this :

+static unsigned long counter_read_overhead;

+static void compute_counter_read_overhead(void)
+{
+       unsigned long c1, c2;
+
+       c1 = read_cc();
+       c2 = read_cc();
+       counter_read_overhead = c2 - c1;
+}

+#define CYCLE_COUNT(c1, c2) \
+       (((c1) > (c2) || ((c1) == (c2))) ? 0 : ((c2) - (c1)) -
counter_read_overhead )
+

static int test_init(void)
{
         int ret;
         unsigned int timeout = TIMEOUT;

+       compute_counter_read_overhead();
..
..

Thanks,
Ashish

On 12/16/2017 2:45 AM, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>   arm/Makefile.common |   1 +
>   arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arm/unittests.cfg   |   6 ++
>   3 files changed, 296 insertions(+)
>   create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>   tests-common += $(TEST_DIR)/gic.flat
>   tests-common += $(TEST_DIR)/psci.flat
>   tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>   
>   tests-all = $(tests-common) $(tests)
>   all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif
> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>   groups = timer
>   timeout = 2s
>   arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
>

Christoffer Dall Dec. 19, 2017, 9:12 a.m. UTC | #7

On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > Here we provide the support for measuring various micro level
> > operations on arm64. We iterate each of the tests for millions of
> > times and output their average, minimum and maximum cost in timer
> > counts. Instruction barriers are used before and after taking
> > timestamps to avoid out-of-order execution or pipelining from
> > skewing our measurements.
> > 
> > The tests we currently support and measure are mostly
> > straightforward by the function names and the respective comments.
> > For IPI test, we measure the cost of sending IPI from a source
> > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > 
> > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > ---
> >  arm/Makefile.common |   1 +
> >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  arm/unittests.cfg   |   6 ++
> >  3 files changed, 296 insertions(+)
> >  create mode 100644 arm/micro-test.c
> > 
> > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > index 0a039cf..c7d5c27 100644
> > --- a/arm/Makefile.common
> > +++ b/arm/Makefile.common
> > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >  tests-common += $(TEST_DIR)/gic.flat
> >  tests-common += $(TEST_DIR)/psci.flat
> >  tests-common += $(TEST_DIR)/sieve.flat
> > +tests-common += $(TEST_DIR)/micro-test.flat
> >  
> >  tests-all = $(tests-common) $(tests)
> >  all: directories $(tests-all)
> > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > new file mode 100644
> > index 0000000..7df2272
> > --- /dev/null
> > +++ b/arm/micro-test.c
> > @@ -0,0 +1,289 @@
> > +#include <util.h>
> > +#include <asm/gic.h>
> > +
> > +static volatile bool second_cpu_up;
> > +static volatile bool first_cpu_ack;
> > +static volatile bool ipi_acked;
> > +static volatile bool ipi_received;
> > +static volatile bool ipi_ready;
> > +#define IPI_IRQ		1
> > +
> > +#define TIMEOUT (1U << 28)
> > +
> > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > +#define for_each_test(_iter, _tests, _tmp) \
> > +	for (_tmp = 0, _iter = _tests; \
> > +			_tmp < ARR_SIZE(_tests); \
> > +			_tmp++, _iter++)
> > +
> > +#define CYCLE_COUNT(c1, c2) \
> > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> 
> Is my understanding correct that this is overflow protection?
> c1 and c2 are 64-bit values. To overflow them you need 58 years
> at 1G CPU freq.
> 

That's assuming your cycle counter starts at 0, and that nobody
programmed it near the overflow value to get an overflow interrupt.

So if you get rid of this, you have to make sure the host never plays
with the cycle counter behind your back, and that you've initialized it
to zero.

FWIW, when we first wrote some version of this code 5+ years ago in a
different world, we actually saw overflow as a result of the way we ran
this code.  That may have changed, but it's not as clear cut.

Thanks,
-Christoffer

Yury Norov Dec. 19, 2017, 10:05 a.m. UTC | #8

On Tue, Dec 19, 2017 at 10:12:00AM +0100, Christoffer Dall wrote:
> On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> > On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > > Here we provide the support for measuring various micro level
> > > operations on arm64. We iterate each of the tests for millions of
> > > times and output their average, minimum and maximum cost in timer
> > > counts. Instruction barriers are used before and after taking
> > > timestamps to avoid out-of-order execution or pipelining from
> > > skewing our measurements.
> > > 
> > > The tests we currently support and measure are mostly
> > > straightforward by the function names and the respective comments.
> > > For IPI test, we measure the cost of sending IPI from a source
> > > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > > 
> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > ---
> > >  arm/Makefile.common |   1 +
> > >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  arm/unittests.cfg   |   6 ++
> > >  3 files changed, 296 insertions(+)
> > >  create mode 100644 arm/micro-test.c
> > > 
> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > > index 0a039cf..c7d5c27 100644
> > > --- a/arm/Makefile.common
> > > +++ b/arm/Makefile.common
> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> > >  tests-common += $(TEST_DIR)/gic.flat
> > >  tests-common += $(TEST_DIR)/psci.flat
> > >  tests-common += $(TEST_DIR)/sieve.flat
> > > +tests-common += $(TEST_DIR)/micro-test.flat
> > >  
> > >  tests-all = $(tests-common) $(tests)
> > >  all: directories $(tests-all)
> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > > new file mode 100644
> > > index 0000000..7df2272
> > > --- /dev/null
> > > +++ b/arm/micro-test.c
> > > @@ -0,0 +1,289 @@
> > > +#include <util.h>
> > > +#include <asm/gic.h>
> > > +
> > > +static volatile bool second_cpu_up;
> > > +static volatile bool first_cpu_ack;
> > > +static volatile bool ipi_acked;
> > > +static volatile bool ipi_received;
> > > +static volatile bool ipi_ready;
> > > +#define IPI_IRQ		1
> > > +
> > > +#define TIMEOUT (1U << 28)
> > > +
> > > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > > +#define for_each_test(_iter, _tests, _tmp) \
> > > +	for (_tmp = 0, _iter = _tests; \
> > > +			_tmp < ARR_SIZE(_tests); \
> > > +			_tmp++, _iter++)
> > > +
> > > +#define CYCLE_COUNT(c1, c2) \
> > > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> > 
> > Is my understanding correct that this is overflow protection?
> > c1 and c2 are 64-bit values. To overflow them you need 58 years
> > at 1G CPU freq.
> > 
> 
> That's assuming your cycle counter starts at 0, and that nobody
> programmed it near the overflow value to get an overflow interrupt.
> 
> So if you get rid of this, you have to make sure the host never plays
> with the cycle counter behind your back, and that you've initialized it
> to zero.

Ah, now I understand it. But if there's no evil intention to break the
test, there's single chance to overflow in 584 years. And if it happened,
isn't it simpler to run test again? 

If host shifts cycle counter during the test, the whole test becomes useless,
I think.

> FWIW, when we first wrote some version of this code 5+ years ago in a
> different world, we actually saw overflow as a result of the way we ran
> this code.  That may have changed, but it's not as clear cut.

Maybe 5+ years ago it was ARM32, and overflow was the result of
erroneous cast to unsigned long? (I guess it because the cast exists
in this patch.)

Anyway, this is minor. I'll happily take the test with or without this
hunk. But if you / Shih-Wei decide to save it, the comment would be
appreciated.

Yury

Andrew Jones Dec. 19, 2017, noon UTC | #9

On Mon, Dec 18, 2017 at 04:58:53PM -0500, Shih-Wei Li wrote:
> On Mon, Dec 18, 2017 at 2:10 PM, Andrew Jones <drjones@redhat.com> wrote:
> >> +static void loop_test(struct exit_test *test)
> >> +{
> >> +     unsigned long i, iterations = 32;
> >> +     unsigned long sample, cycles;
> >> +     unsigned long long min = 0, max = 0;
> >> +     const unsigned long long goal = (1ULL << 29);
> >> +
> >> +     do {
> >> +             iterations *= 2;
> >> +             cycles = 0;
> >> +             for (i = 0; i < iterations; i++) {
> >> +                     sample = test->test_fn();
> >> +                     if (sample == 0) {
> >> +                             /*
> >> +                              * If something went wrong or we had an
> >> +                              * overflow, don't count that sample.
> >> +                              */
> >> +                             iterations--;
> >> +                             i--;
> >
> > I'd prefer incrementing a different counter after the sample==0 test /
> > continue, than to decrement both i and iterations.
> 
> I'm not sure about the comment above. Could you please substantiate
> how a new counter should work here?

It doesn't really matter, but how about

 do {
     iterations *= 2;
     cycles = 0;
     i = 0;
     while (i < iterations) {
         sample = test->test_fn();
         if (sample == 0)
             continue;
         cycles += sample;
         ...
         ++i;
     }
 } while (cycles < goal);


> 
> >
> >> +                             debug("cycle count overflow: %lu\n", sample);
> >> +                             continue;
> >> +                     }
> >> +                     cycles += sample;
> >> +                     if (min == 0 || min > sample)
> >> +                             min = sample;
> >> +                     if (max < sample)
> >> +                             max = sample;
> >> +             }
> >> +     } while (cycles < goal);
> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> >> +             test->name, cycles / (iterations), min, max);
> >

Thanks,
drew

Christoffer Dall Dec. 19, 2017, 1:04 p.m. UTC | #10

On Tue, Dec 19, 2017 at 01:05:21PM +0300, Yury Norov wrote:
> On Tue, Dec 19, 2017 at 10:12:00AM +0100, Christoffer Dall wrote:
> > On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> > > On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > > > Here we provide the support for measuring various micro level
> > > > operations on arm64. We iterate each of the tests for millions of
> > > > times and output their average, minimum and maximum cost in timer
> > > > counts. Instruction barriers are used before and after taking
> > > > timestamps to avoid out-of-order execution or pipelining from
> > > > skewing our measurements.
> > > > 
> > > > The tests we currently support and measure are mostly
> > > > straightforward by the function names and the respective comments.
> > > > For IPI test, we measure the cost of sending IPI from a source
> > > > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > > > 
> > > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > > ---
> > > >  arm/Makefile.common |   1 +
> > > >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  arm/unittests.cfg   |   6 ++
> > > >  3 files changed, 296 insertions(+)
> > > >  create mode 100644 arm/micro-test.c
> > > > 
> > > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > > > index 0a039cf..c7d5c27 100644
> > > > --- a/arm/Makefile.common
> > > > +++ b/arm/Makefile.common
> > > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> > > >  tests-common += $(TEST_DIR)/gic.flat
> > > >  tests-common += $(TEST_DIR)/psci.flat
> > > >  tests-common += $(TEST_DIR)/sieve.flat
> > > > +tests-common += $(TEST_DIR)/micro-test.flat
> > > >  
> > > >  tests-all = $(tests-common) $(tests)
> > > >  all: directories $(tests-all)
> > > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > > > new file mode 100644
> > > > index 0000000..7df2272
> > > > --- /dev/null
> > > > +++ b/arm/micro-test.c
> > > > @@ -0,0 +1,289 @@
> > > > +#include <util.h>
> > > > +#include <asm/gic.h>
> > > > +
> > > > +static volatile bool second_cpu_up;
> > > > +static volatile bool first_cpu_ack;
> > > > +static volatile bool ipi_acked;
> > > > +static volatile bool ipi_received;
> > > > +static volatile bool ipi_ready;
> > > > +#define IPI_IRQ		1
> > > > +
> > > > +#define TIMEOUT (1U << 28)
> > > > +
> > > > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > > > +#define for_each_test(_iter, _tests, _tmp) \
> > > > +	for (_tmp = 0, _iter = _tests; \
> > > > +			_tmp < ARR_SIZE(_tests); \
> > > > +			_tmp++, _iter++)
> > > > +
> > > > +#define CYCLE_COUNT(c1, c2) \
> > > > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> > > 
> > > Is my understanding correct that this is overflow protection?
> > > c1 and c2 are 64-bit values. To overflow them you need 58 years
> > > at 1G CPU freq.
> > > 
> > 
> > That's assuming your cycle counter starts at 0, and that nobody
> > programmed it near the overflow value to get an overflow interrupt.
> > 
> > So if you get rid of this, you have to make sure the host never plays
> > with the cycle counter behind your back, and that you've initialized it
> > to zero.
> 
> Ah, now I understand it. But if there's no evil intention to break the
> test, there's single chance to overflow in 584 years. And if it happened,
> isn't it simpler to run test again? 

I believe we used this to throw away results where there was an
overflow, which is nicer to the user than leaving them guessing 'this is
a very high number, I better run the test again'.

-Christoffer

Andrew Jones Dec. 20, 2017, 5 p.m. UTC | #11

On Mon, Dec 18, 2017 at 08:10:37PM +0100, Andrew Jones wrote:
> I just tried on TCG now. It doesn't run. It gets
> 
> Timer Frequency 62500000 Hz (Output in timer count)
> Unhandled exception ec=0 (UNKNOWN)
> Vector: 4 (el1h_sync)
> ESR_EL1:         02000000, ec=0 (UNKNOWN)
> FAR_EL1: 0000000000000000 (not valid)
> Exception frame registers:
> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
> sp : 00000000400aff90
> x29: 0000000000000000 x28: 0000000000000000 
> x27: 0000000040090000 x26: 0000000040090c60 
> x25: 0000000040090000 x24: 000000001fffffff 
> x23: 0000000000000000 x22: 0000000000000000 
> x21: 0000000000000040 x20: 0000000000000000 
> x19: 0000000000000000 x18: 00000000400b0000 
> x17: 0000000000000000 x16: 0000000000000000 
> x15: 00000000400afe8c x14: 00000000400b0000 
> x13: 00000000400afecc x12: 0000000000001680 
> x11: 0000000000000000 x10: 6666666666666667 
> x9 : 0000000000000030 x8 : 0000000000000030 
> x7 : 00000000400af670 x6 : 00000000400af673 
> x5 : 00000000400af678 x4 : 00000000000007b7 
> x3 : 00000000400af6ec x2 : 0000000040090000 
> x1 : 000000000015909e x0 : 000000004b000000
>

I looked into this. It's due to the hvc call. The exception goes away if
you add '-machine virtualization=yes' to the qemu command line. But as
there's no point in running it, and it's way too slow, then the
'accel = kvm' added in v2 of the patch is the right thing to do.

Thanks,
drew

Kalra, Ashish May 1, 2018, 2:57 p.m. UTC | #12

Hello Shih-Wei,

This exception is probably due to a bug in the hvc_test (hvc call) code,
and there is a generic bug in the hvc test code.
In the function hvc_test(), local variables c1, c2 get mapped to 
registers (typically x1 & x2) and those are "clobbered" across the hvc 
call, hence, c1 gets clobbered and we get a large cycle count for 
hvc_test (c2 - 0), which messes-up the hvc test measurements.

As per SMC64/HVC64 ABI definitions, x0-x18 can be clobbered across
HVC/SMC calls.

Currently, i am using your micro-benchmark code with the following
hack applied ...

+register unsigned long c1_hvc asm ("x19"); 
 

static unsigned long hvc_test(void) 
 

{
+     unsigned long c2;
-     unsigned long c1, c2;
 
 
 
 
 

+     c1_hvc = read_cc();
-     c1 = read_cc();
      asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0"); 
 

      c2 = read_cc();
+     return CYCLE_COUNT(c1_hvc, c2);
-     return CYCLE_COUNT(c1, c2); 
 

}

Thanks,
Ashish

On 12/20/2017 10:30 PM, Andrew Jones wrote:
> On Mon, Dec 18, 2017 at 08:10:37PM +0100, Andrew Jones wrote:
>> I just tried on TCG now. It doesn't run. It gets
>>
>> Timer Frequency 62500000 Hz (Output in timer count)
>> Unhandled exception ec=0 (UNKNOWN)
>> Vector: 4 (el1h_sync)
>> ESR_EL1:         02000000, ec=0 (UNKNOWN)
>> FAR_EL1: 0000000000000000 (not valid)
>> Exception frame registers:
>> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
>> sp : 00000000400aff90
>> x29: 0000000000000000 x28: 0000000000000000
>> x27: 0000000040090000 x26: 0000000040090c60
>> x25: 0000000040090000 x24: 000000001fffffff
>> x23: 0000000000000000 x22: 0000000000000000
>> x21: 0000000000000040 x20: 0000000000000000
>> x19: 0000000000000000 x18: 00000000400b0000
>> x17: 0000000000000000 x16: 0000000000000000
>> x15: 00000000400afe8c x14: 00000000400b0000
>> x13: 00000000400afecc x12: 0000000000001680
>> x11: 0000000000000000 x10: 6666666666666667
>> x9 : 0000000000000030 x8 : 0000000000000030
>> x7 : 00000000400af670 x6 : 00000000400af673
>> x5 : 00000000400af678 x4 : 00000000000007b7
>> x3 : 00000000400af6ec x2 : 0000000040090000
>> x1 : 000000000015909e x0 : 000000004b000000
>>
> 
> I looked into this. It's due to the hvc call. The exception goes away if
> you add '-machine virtualization=yes' to the qemu command line. But as
> there's no point in running it, and it's way too slow, then the
> 'accel = kvm' added in v2 of the patch is the right thing to do.
> 
> Thanks,
> drew
>

Marc Zyngier May 2, 2018, 9:23 a.m. UTC | #13

On 01/05/18 15:57, Kalra, Ashish wrote:
> Hello Shih-Wei,
> 
> This exception is probably due to a bug in the hvc_test (hvc call) code,
> and there is a generic bug in the hvc test code.
> In the function hvc_test(), local variables c1, c2 get mapped to 
> registers (typically x1 & x2) and those are "clobbered" across the hvc 
> call, hence, c1 gets clobbered and we get a large cycle count for 
> hvc_test (c2 - 0), which messes-up the hvc test measurements.
> 
> As per SMC64/HVC64 ABI definitions, x0-x18 can be clobbered across
> HVC/SMC calls.

I have two issues with this statement:

- Version 1.1 of the SMCCC actually says that only x0-x3 will be
potentially clobbered

- KVM makes a point in not clobbering any register, except for those
functions implemented by SMCCC 1.1.

That being said...

> 
> Currently, i am using your micro-benchmark code with the following
> hack applied ...
> 
> +register unsigned long c1_hvc asm ("x19"); 
>  
> 
> static unsigned long hvc_test(void) 
>  
> 
> {
> +     unsigned long c2;
> -     unsigned long c1, c2;
>  
>  
>  
>  
>  
> 
> +     c1_hvc = read_cc();
> -     c1 = read_cc();
>       asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0"); 
>  
> 
>       c2 = read_cc();

The reason for the problem you're seeing is probably that the
constraints are not quite right. Here, x0 is not simply clobbered, but
is actively written to in the middle of the sequence (it is at least an
early clobber). It is also, I assume, a result from the hypercall, so it
cannot simply be discarded.

It would help to get a disassembly of the function, but I'd tend to
rewrite the code as such:

extern int bar(void);

int foo(void)
{
	register int w0 asm("w0");
	int a, b;

	a = bar();
	w0 = 0x4b000000;
	asm volatile("hvc #0" : "+r" (w0) :: );
	b = bar();

	return a - b;
}

Thanks,

	M.

Kalra, Ashish May 3, 2018, 11:12 a.m. UTC | #14

>> 
>> The reason for the problem you're seeing is probably that the
>> constraints are not quite right. Here, x0 is not simply clobbered, but
>> is actively written to in the middle of the sequence (it is at least an
>> early clobber). It is also, I assume, a result from the hypercall, so it
>> cannot simply be discarded.
>> 
>> It would help to get a disassembly of the function, but I'd tend to
>> rewrite the code as such:
>> 
>> extern int bar(void);
>> 
>> int foo(void)
>> {
>> 	register int w0 asm("w0");
>> 	int a, b;
>> 
>> 	a = bar();
>> 	w0 = 0x4b000000;
>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>> 	b = bar();
>> 
>> 	return a - b;
>> }

There still is the issue of x1-x3 registers being used for local 
variables and clobbering of the same across the "hvc" call.

Comparatively, the fix below works more reliably :

asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");

Thanks,
Ashish

Marc Zyngier May 3, 2018, 4:24 p.m. UTC | #15

On 03/05/18 12:12, Kalra, Ashish wrote:
> 
>>>
>>> The reason for the problem you're seeing is probably that the
>>> constraints are not quite right. Here, x0 is not simply clobbered, but
>>> is actively written to in the middle of the sequence (it is at least an
>>> early clobber). It is also, I assume, a result from the hypercall, so it
>>> cannot simply be discarded.
>>>
>>> It would help to get a disassembly of the function, but I'd tend to
>>> rewrite the code as such:
>>>
>>> extern int bar(void);
>>>
>>> int foo(void)
>>> {
>>> 	register int w0 asm("w0");
>>> 	int a, b;
>>>
>>> 	a = bar();
>>> 	w0 = 0x4b000000;
>>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>>> 	b = bar();
>>>
>>> 	return a - b;
>>> }
> 
> There still is the issue of x1-x3 registers being used for local 
> variables and clobbering of the same across the "hvc" call.
> 
> Comparatively, the fix below works more reliably :
> 
> asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");

Is that because this hypercall is explicitly clobbering these registers?
or because KVM is corrupting them? If the former, fine. If the latter,
that'd be a KVM bug.

So which one is it?

	M.

Kalra, Ashish May 3, 2018, 6:08 p.m. UTC | #16

>>On 5/3/2018 9:54 PM, Marc Zyngier wrote:
>> On 03/05/18 12:12, Kalra, Ashish wrote:
>>>
>>>>>
>>>> The reason for the problem you're seeing is probably that the
>>>>> constraints are not quite right. Here, x0 is not simply clobbered, but
>>>>> is actively written to in the middle of the sequence (it is at least an
>>>>> early clobber). It is also, I assume, a result from the hypercall, so it
>>>>> cannot simply be discarded.
>>>>>
>>>>> It would help to get a disassembly of the function, but I'd tend to
>>>>> rewrite the code as such:
>>>>>
>>>>> extern int bar(void);
>>>>>
>>>>> int foo(void)
>>>>> {
>>>>> 	register int w0 asm("w0");
>>>>> 	int a, b;
>>>>>
>>>>> 	a = bar();
>>>>> 	w0 = 0x4b000000;
>>>>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>>>>> 	b = bar();
>>>>>
>>>>> 	return a - b;
>>>>> }
>>>
>>> There still is the issue of x1-x3 registers being used for local
>>> variables and clobbering of the same across the "hvc" call.
>>>
>>> Comparatively, the fix below works more reliably :
>>>
>>>> asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");
>> 
>> Is that because this hypercall is explicitly clobbering these registers?
>> or because KVM is corrupting them? If the former, fine. If the latter,
>> that'd be a KVM bug.
>> 
>> So which one is it?
>> 

It is being explicitly done by the hypercall handler:

int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
{
	u32 func_id = smccc_get_function(vcpu);
         u32 val = PSCI_RET_NOT_SUPPORTED;
         ..
         ..
         smccc_set_retval(vcpu, val, 0, 0, 0);
}

Thanks,
Ashish

[kvm-unit-tests] arm64: add micro test

Commit Message

Comments

Patch