diff mbox

[v2] IPI performance benchmark

Message ID 20171219085010.4081-1-ynorov@caviumnetworks.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yury Norov Dec. 19, 2017, 8:50 a.m. UTC
This benchmark sends many IPIs in different modes and measures
time for IPI delivery (first column), and total time, ie including
time to acknowledge the receive by sender (second column).

The scenarios are:
Dry-run:	do everything except actually sending IPI. Useful
		to estimate system overhead.
Self-IPI:	Send IPI to self CPU.
Normal IPI:	Send IPI to some other CPU.
Broadcast IPI:	Send broadcast IPI to all online CPUs.
Broadcast lock:	Send broadcast IPI to all online CPUs and force them
                acquire/release spinlock.

The raw output looks like this:
[  155.363374] Dry-run:                         0,            2999696 ns
[  155.429162] Self-IPI:                 30385328,           65589392 ns
[  156.060821] Normal IPI:              566914128,          631453008 ns
[  158.384427] Broadcast IPI:                   0,         2323368720 ns
[  160.831850] Broadcast lock:                  0,         2447000544 ns

For virtualized guests, sending and reveiving IPIs causes guest exit.
I used this test to measure performance impact on KVM subsystem of
Christoffer Dall's series "Optimize KVM/ARM for VHE systems" [1].

Test machine is ThunderX2, 112 online CPUs. Below the results normalized
to host dry-run time, broadcast lock results omitted. Smaller - better.

Host, v4.14:
Dry-run:	  0	    1
Self-IPI:         9	   18
Normal IPI:      81	  110
Broadcast IPI:    0	 2106

Guest, v4.14:
Dry-run:          0	    1
Self-IPI:        10	   18
Normal IPI:     305	  525
Broadcast IPI:    0    	 9729

Guest, v4.14 + [1]:
Dry-run:          0	    1
Self-IPI:         9	   18
Normal IPI:     176	  343
Broadcast IPI:    0	 9885

[1] https://www.spinics.net/lists/kvm/msg156755.html

v2:
  added broadcast lock test;
  added example raw output in patch description;

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ashish Kalra <Ashish.Kalra@cavium.com>
CC: Christoffer Dall <christoffer.dall@linaro.org>
CC: Geert Uytterhoeven <geert@linux-m68k.org>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Linu Cherian <Linu.Cherian@cavium.com>
CC: Shih-Wei Li <shihwei@cs.columbia.edu>
CC: Sunil Goutham <Sunil.Goutham@cavium.com>
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 arch/Kconfig           |  10 ++++
 kernel/Makefile        |   1 +
 kernel/ipi_benchmark.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 164 insertions(+)
 create mode 100644 kernel/ipi_benchmark.c

Comments

Philippe Ombredanne Dec. 19, 2017, 9:26 a.m. UTC | #1
Dear Yury,

On Tue, Dec 19, 2017 at 9:50 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> This benchmark sends many IPIs in different modes and measures
> time for IPI delivery (first column), and total time, ie including
> time to acknowledge the receive by sender (second column).

<snip>

> --- /dev/null
> +++ b/kernel/ipi_benchmark.c
> @@ -0,0 +1,153 @@
> +/*
> + * Performance test for IPI on SMP machines.
> + *
> + * Copyright (c) 2017 Cavium Networks.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */

Would you mind using the new SPDX tags documented in Thomas patch set
[1] rather than this fine but longer legalese?  Each time long
legalese is added as a comment to a kernel file, there is a whole star
system that dies somewhere in the universe, which is not a good thing.

SPDX tags eschew this problem by using a simple one line comment and
this has been proven to be mostly harmless. And if you could spread
the word to others in your team this would be very nice. I recently
nudged Aleksey who nicely updated his patches a short while ago.

> +MODULE_LICENSE("GPL");

There is a problem here: your MODULE_LICENSE tag means GPL-2.0 or
later versions as documented in module.h. This is not consistent with
your top level license notice. You should make this consistent IMHO
.... and use SPDX tags for the top level notice of course!

Thank you!

[1] https://lkml.org/lkml/2017/12/4/934

CC: Aleksey Makarov <aleksey.makarov@cavium.com>
Yury Norov Dec. 19, 2017, 10:28 a.m. UTC | #2
On Tue, Dec 19, 2017 at 10:26:02AM +0100, Philippe Ombredanne wrote:
> Dear Yury,
> 
> On Tue, Dec 19, 2017 at 9:50 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> > This benchmark sends many IPIs in different modes and measures
> > time for IPI delivery (first column), and total time, ie including
> > time to acknowledge the receive by sender (second column).
> 
> <snip>
> 
> > --- /dev/null
> > +++ b/kernel/ipi_benchmark.c
> > @@ -0,0 +1,153 @@
> > +/*
> > + * Performance test for IPI on SMP machines.
> > + *
> > + * Copyright (c) 2017 Cavium Networks.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful, but
> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + * General Public License for more details.
> > + */
> 
> Would you mind using the new SPDX tags documented in Thomas patch set
> [1] rather than this fine but longer legalese? 

Of course. I'll collect more comments, if any, and send v3 soon.

> Each time long
> legalese is added as a comment to a kernel file, there is a whole star
> system that dies somewhere in the universe, which is not a good thing.

You can save all that stars and hours of your time if add
corresponding rule to checkpatch. ;)

> SPDX tags eschew this problem by using a simple one line comment and
> this has been proven to be mostly harmless. And if you could spread
> the word to others in your team this would be very nice. I recently
> nudged Aleksey who nicely updated his patches a short while ago.
> 
> > +MODULE_LICENSE("GPL");
> 
> There is a problem here: your MODULE_LICENSE tag means GPL-2.0 or
> later versions as documented in module.h. This is not consistent with
> your top level license notice. You should make this consistent IMHO
> .... and use SPDX tags for the top level notice of course!
> 
> Thank you!
> 
> [1] https://lkml.org/lkml/2017/12/4/934
> 
> CC: Aleksey Makarov <aleksey.makarov@cavium.com>
> -- 
> Cordially
> Philippe Ombredanne
Andrew Morton Dec. 19, 2017, 11:51 p.m. UTC | #3
On Tue, 19 Dec 2017 11:50:10 +0300 Yury Norov <ynorov@caviumnetworks.com> wrote:

> This benchmark sends many IPIs in different modes and measures
> time for IPI delivery (first column), and total time, ie including
> time to acknowledge the receive by sender (second column).
> 
> The scenarios are:
> Dry-run:	do everything except actually sending IPI. Useful
> 		to estimate system overhead.
> Self-IPI:	Send IPI to self CPU.
> Normal IPI:	Send IPI to some other CPU.
> Broadcast IPI:	Send broadcast IPI to all online CPUs.
> Broadcast lock:	Send broadcast IPI to all online CPUs and force them
>                 acquire/release spinlock.
> 
> The raw output looks like this:
> [  155.363374] Dry-run:                         0,            2999696 ns
> [  155.429162] Self-IPI:                 30385328,           65589392 ns
> [  156.060821] Normal IPI:              566914128,          631453008 ns
> [  158.384427] Broadcast IPI:                   0,         2323368720 ns
> [  160.831850] Broadcast lock:                  0,         2447000544 ns
> 
> For virtualized guests, sending and reveiving IPIs causes guest exit.
> I used this test to measure performance impact on KVM subsystem of
> Christoffer Dall's series "Optimize KVM/ARM for VHE systems" [1].
> 
> Test machine is ThunderX2, 112 online CPUs. Below the results normalized
> to host dry-run time, broadcast lock results omitted. Smaller - better.
> 
> Host, v4.14:
> Dry-run:	  0	    1
> Self-IPI:         9	   18
> Normal IPI:      81	  110
> Broadcast IPI:    0	 2106
> 
> Guest, v4.14:
> Dry-run:          0	    1
> Self-IPI:        10	   18
> Normal IPI:     305	  525
> Broadcast IPI:    0    	 9729
> 
> Guest, v4.14 + [1]:
> Dry-run:          0	    1
> Self-IPI:         9	   18
> Normal IPI:     176	  343
> Broadcast IPI:    0	 9885
> 

That looks handy.  Peter and Ingo might be interested.

I wonder if it should be in kernel/.  Perhaps it's better to accumulate
these things in lib/test_*.c, rather than cluttering up other top-level
directories.

> +static ktime_t __init send_ipi(int flags)
> +{
> +	ktime_t time = 0;
> +	DEFINE_SPINLOCK(lock);

I have some vague historical memory that an on-stack spinlock can cause
problems, perhaps with debugging code.  Can't remember, maybe I dreamed it.
Wanpeng Li Dec. 20, 2017, 6:44 a.m. UTC | #4
Hi Yury,
2017-12-19 16:50 GMT+08:00 Yury Norov <ynorov@caviumnetworks.com>:
> This benchmark sends many IPIs in different modes and measures
> time for IPI delivery (first column), and total time, ie including
> time to acknowledge the receive by sender (second column).
>
> The scenarios are:
> Dry-run:        do everything except actually sending IPI. Useful
>                 to estimate system overhead.
> Self-IPI:       Send IPI to self CPU.
> Normal IPI:     Send IPI to some other CPU.
> Broadcast IPI:  Send broadcast IPI to all online CPUs.
> Broadcast lock: Send broadcast IPI to all online CPUs and force them
>                 acquire/release spinlock.
>
> The raw output looks like this:
> [  155.363374] Dry-run:                         0,            2999696 ns
> [  155.429162] Self-IPI:                 30385328,           65589392 ns
> [  156.060821] Normal IPI:              566914128,          631453008 ns
> [  158.384427] Broadcast IPI:                   0,         2323368720 ns
> [  160.831850] Broadcast lock:                  0,         2447000544 ns
>
> For virtualized guests, sending and reveiving IPIs causes guest exit.
> I used this test to measure performance impact on KVM subsystem of
> Christoffer Dall's series "Optimize KVM/ARM for VHE systems" [1].
>
> Test machine is ThunderX2, 112 online CPUs. Below the results normalized
> to host dry-run time, broadcast lock results omitted. Smaller - better.

Could you test on a x86 box? I see a lot of calltraces on my haswell
client host, there is no calltrace in the guest, however, I can still
observe "Invalid parameters" warning when insmod this module. In
addition, the x86 box fails to boot when ipi_benchmark is buildin.

Regards,
Wanpeng Li
Yury Norov Dec. 21, 2017, 7:02 p.m. UTC | #5
On Wed, Dec 20, 2017 at 02:44:25PM +0800, Wanpeng Li wrote:
> Hi Yury,
> 2017-12-19 16:50 GMT+08:00 Yury Norov <ynorov@caviumnetworks.com>:
> > This benchmark sends many IPIs in different modes and measures
> > time for IPI delivery (first column), and total time, ie including
> > time to acknowledge the receive by sender (second column).
> >
> > The scenarios are:
> > Dry-run:        do everything except actually sending IPI. Useful
> >                 to estimate system overhead.
> > Self-IPI:       Send IPI to self CPU.
> > Normal IPI:     Send IPI to some other CPU.
> > Broadcast IPI:  Send broadcast IPI to all online CPUs.
> > Broadcast lock: Send broadcast IPI to all online CPUs and force them
> >                 acquire/release spinlock.
> >
> > The raw output looks like this:
> > [  155.363374] Dry-run:                         0,            2999696 ns
> > [  155.429162] Self-IPI:                 30385328,           65589392 ns
> > [  156.060821] Normal IPI:              566914128,          631453008 ns
> > [  158.384427] Broadcast IPI:                   0,         2323368720 ns
> > [  160.831850] Broadcast lock:                  0,         2447000544 ns
> >
> > For virtualized guests, sending and reveiving IPIs causes guest exit.
> > I used this test to measure performance impact on KVM subsystem of
> > Christoffer Dall's series "Optimize KVM/ARM for VHE systems" [1].
> >
> > Test machine is ThunderX2, 112 online CPUs. Below the results normalized
> > to host dry-run time, broadcast lock results omitted. Smaller - better.
> 
> Could you test on a x86 box? I see a lot of calltraces on my haswell
> client host, there is no calltrace in the guest, however, I can still
> observe "Invalid parameters" warning when insmod this module. In
> addition, the x86 box fails to boot when ipi_benchmark is buildin.

I tried to boot kernel with builtin test both on real hardware and
qemu+kvm - no calltraces or other problems. Kernel is 4.14, config for
host is attached.

CPU is Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz

Kernel is 4.14, config for host is attached, but it's default Ubuntu
config. Results and qemu command are below. Could you share more details
about your configuration?

Yury

qemu-system-x86_64 -hda debian_squeeze_amd64_standard.qcow2	\
	-smp 1 -curses --nographic --enable-kvm

Host, 4 cores:
[    0.237279] Dry-run:                         0,             170292 ns
[    0.643269] Self-IPI:                458516336,          922256372 ns
[    0.902545] Self-IPI:                508518362,          972130665 ns
[    0.646500] Broadcast IPI:                   0,           97301545 ns
[    0.649712] Broadcast lock:                  0,          102364755 ns

KVM, single core:
[    0.237279] Dry-run:                         0,             124500 ns
[    0.643269] Self-IPI:                202518310,          405444790 ns
[    0.643694] Normal IPI FAILED: -2
[    0.646500] Broadcast IPI:                   0,            2524370 ns
[    0.649712] Broadcast lock:                  0,            2642270 ns

KVM, 4 cores:
[    0.492676] Dry-run:                         0,             126380 ns
[    0.902545] Self-IPI:                204085450,          409863800 ns
[    2.179676] Normal IPI:             1058014940,         1276742820 ns
[    3.396132] Broadcast IPI:                   0,         1215934730 ns
[    4.610719] Broadcast lock:                  0,         1213945500 ns
Yury Norov Dec. 22, 2017, 6:09 a.m. UTC | #6
On Wed, Dec 20, 2017 at 02:44:25PM +0800, Wanpeng Li wrote:
> Hi Yury,
> 2017-12-19 16:50 GMT+08:00 Yury Norov <ynorov@caviumnetworks.com>:
> > This benchmark sends many IPIs in different modes and measures
> > time for IPI delivery (first column), and total time, ie including
> > time to acknowledge the receive by sender (second column).
> >
> > The scenarios are:
> > Dry-run:        do everything except actually sending IPI. Useful
> >                 to estimate system overhead.
> > Self-IPI:       Send IPI to self CPU.
> > Normal IPI:     Send IPI to some other CPU.
> > Broadcast IPI:  Send broadcast IPI to all online CPUs.
> > Broadcast lock: Send broadcast IPI to all online CPUs and force them
> >                 acquire/release spinlock.
> >
> > The raw output looks like this:
> > [  155.363374] Dry-run:                         0,            2999696 ns
> > [  155.429162] Self-IPI:                 30385328,           65589392 ns
> > [  156.060821] Normal IPI:              566914128,          631453008 ns
> > [  158.384427] Broadcast IPI:                   0,         2323368720 ns
> > [  160.831850] Broadcast lock:                  0,         2447000544 ns
> >
> > For virtualized guests, sending and reveiving IPIs causes guest exit.
> > I used this test to measure performance impact on KVM subsystem of
> > Christoffer Dall's series "Optimize KVM/ARM for VHE systems" [1].
> >
> > Test machine is ThunderX2, 112 online CPUs. Below the results normalized
> > to host dry-run time, broadcast lock results omitted. Smaller - better.
> 
> Could you test on a x86 box? I see a lot of calltraces on my haswell
> client host, there is no calltrace in the guest, however, I can still
> observe "Invalid parameters" warning when insmod this module. In
> addition, the x86 box fails to boot when ipi_benchmark is buildin.

EINVAL is returned intentionally to let user run test again without
annoying rmmod.
diff mbox

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 400b9e1b2f27..1b216eb15642 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -82,6 +82,16 @@  config JUMP_LABEL
 	 ( On 32-bit x86, the necessary options added to the compiler
 	   flags may increase the size of the kernel slightly. )
 
+config IPI_BENCHMARK
+	tristate "Test IPI performance on SMP systems"
+	depends on SMP
+	help
+	  Test IPI performance on SMP systems. If system has only one online
+	  CPU, sending IPI to other CPU is obviously not possible, and ENOENT
+	  is returned for corresponding test.
+
+	  If unsure, say N.
+
 config STATIC_KEYS_SELFTEST
 	bool "Static key selftest"
 	depends on JUMP_LABEL
diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..04e550e1990c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -101,6 +101,7 @@  obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_IPI_BENCHMARK) += ipi_benchmark.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/ipi_benchmark.c b/kernel/ipi_benchmark.c
new file mode 100644
index 000000000000..1dfa15e5ef70
--- /dev/null
+++ b/kernel/ipi_benchmark.c
@@ -0,0 +1,153 @@ 
+/*
+ * Performance test for IPI on SMP machines.
+ *
+ * Copyright (c) 2017 Cavium Networks.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/ktime.h>
+
+#define NTIMES 100000
+
+#define POKE_ANY	0
+#define DRY_RUN		1
+#define POKE_SELF	2
+#define POKE_ALL	3
+#define POKE_ALL_LOCK	4
+
+static void __init handle_ipi_spinlock(void *t)
+{
+	spinlock_t *lock = (spinlock_t *) t;
+
+	spin_lock(lock);
+	spin_unlock(lock);
+}
+
+static void __init handle_ipi(void *t)
+{
+	ktime_t *time = (ktime_t *) t;
+
+	if (time)
+		*time = ktime_get() - *time;
+}
+
+static ktime_t __init send_ipi(int flags)
+{
+	ktime_t time = 0;
+	DEFINE_SPINLOCK(lock);
+	unsigned int cpu = get_cpu();
+
+	switch (flags) {
+	case DRY_RUN:
+		/* Do everything except actually sending IPI. */
+		break;
+	case POKE_ALL:
+		/* If broadcasting, don't force all CPUs to update time. */
+		smp_call_function_many(cpu_online_mask, handle_ipi, NULL, 1);
+		break;
+	case POKE_ALL_LOCK:
+		smp_call_function_many(cpu_online_mask,
+				handle_ipi_spinlock, &lock, 1);
+		break;
+	case POKE_ANY:
+		cpu = cpumask_any_but(cpu_online_mask, cpu);
+		if (cpu >= nr_cpu_ids) {
+			time = -ENOENT;
+			break;
+		}
+		/* Fall thru */
+	case POKE_SELF:
+		time = ktime_get();
+		smp_call_function_single(cpu, handle_ipi, &time, 1);
+		break;
+	default:
+		time = -EINVAL;
+	}
+
+	put_cpu();
+	return time;
+}
+
+static int __init __bench_ipi(unsigned long i, ktime_t *time, int flags)
+{
+	ktime_t t;
+
+	*time = 0;
+	while (i--) {
+		t = send_ipi(flags);
+		if ((int) t < 0)
+			return (int) t;
+
+		*time += t;
+	}
+
+	return 0;
+}
+
+static int __init bench_ipi(unsigned long times, int flags,
+				ktime_t *ipi, ktime_t *total)
+{
+	int ret;
+
+	*total = ktime_get();
+	ret = __bench_ipi(times, ipi, flags);
+	if (unlikely(ret))
+		return ret;
+
+	*total = ktime_get() - *total;
+
+	return 0;
+}
+
+static int __init init_bench_ipi(void)
+{
+	ktime_t ipi, total;
+	int ret;
+
+	ret = bench_ipi(NTIMES, DRY_RUN, &ipi, &total);
+	if (ret)
+		pr_err("Dry-run FAILED: %d\n", ret);
+	else
+		pr_err("Dry-run:        %18llu, %18llu ns\n", ipi, total);
+
+	ret = bench_ipi(NTIMES, POKE_SELF, &ipi, &total);
+	if (ret)
+		pr_err("Self-IPI FAILED: %d\n", ret);
+	else
+		pr_err("Self-IPI:       %18llu, %18llu ns\n", ipi, total);
+
+	ret = bench_ipi(NTIMES, POKE_ANY, &ipi, &total);
+	if (ret)
+		pr_err("Normal IPI FAILED: %d\n", ret);
+	else
+		pr_err("Normal IPI:     %18llu, %18llu ns\n", ipi, total);
+
+	ret = bench_ipi(NTIMES, POKE_ALL, &ipi, &total);
+	if (ret)
+		pr_err("Broadcast IPI FAILED: %d\n", ret);
+	else
+		pr_err("Broadcast IPI:  %18llu, %18llu ns\n", ipi, total);
+
+	ret = bench_ipi(NTIMES, POKE_ALL_LOCK, &ipi, &total);
+	if (ret)
+		pr_err("Broadcast lock FAILED: %d\n", ret);
+	else
+		pr_err("Broadcast lock: %18llu, %18llu ns\n", ipi, total);
+
+	/* Return error to avoid annoying rmmod. */
+	return -EINVAL;
+}
+module_init(init_bench_ipi);
+
+MODULE_LICENSE("GPL");