crypto: xor - fix template benchmarking

Message ID	ZnMWDdKJHfYQLDzS@p100 (mailing list archive)
State	Superseded, archived
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E663F155C8E; Wed, 19 Jun 2024 17:32:01 +0000 (UTC) Date: Wed, 19 Jun 2024 19:31:57 +0200 From: Helge Deller <deller@kernel.org> To: Herbert Xu <herbert@gondor.apana.org.au>, linux-crypto@vger.kernel.org, Ard Biesheuvel <ardb@kernel.org>, linux-parisc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH] crypto: xor - fix template benchmarking Message-ID: <ZnMWDdKJHfYQLDzS@p100> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline
Series	crypto: xor - fix template benchmarking \| expand crypto: xor - fix template benchmarking

Message ID

ZnMWDdKJHfYQLDzS@p100 (mailing list archive)

State

Superseded, archived

Headers

Date: Wed, 19 Jun 2024 19:31:57 +0200
From: Helge Deller <deller@kernel.org>
To: Herbert Xu <herbert@gondor.apana.org.au>, linux-crypto@vger.kernel.org,
	Ard Biesheuvel <ardb@kernel.org>, linux-parisc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Subject: [PATCH] crypto: xor - fix template benchmarking
Message-ID: <ZnMWDdKJHfYQLDzS@p100>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Series

crypto: xor - fix template benchmarking | expand

Commit Message

Helge Deller June 19, 2024, 5:31 p.m. UTC

Commit c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
switched from using jiffies to ktime-based performance benchmarking.

This works nicely on machines which have a fine-grained ktime()
clocksource as e.g. x86 machoines with TSC.
But other machines, e.g. my 4-way HP PARISC server, don't have such
fine-grained clocksources, which is why it seems that 800 xor loops
take zero seconds, which then calculates in the logs as:

 xor: measuring software checksum speed
    8regs           : -1018167296 MB/sec
    8regs_prefetch  : -1018167296 MB/sec
    32regs          : -1018167296 MB/sec
    32regs_prefetch : -1018167296 MB/sec

Fix this with some small modifications to the existing code to improve
the algorithm to always produce correct results without introducing
major delays for architectures with a fine-grained ktime()
clocksource:
a) Delay start of the timing until ktime() just advanced. On machines
with a fast ktime() this should be just one additional ktime() call.
b) Count the number of loops. Run at minimum 800 loops and finish
earliest when the ktime() counter has progressed.

With that the throughput can now be calculated more accurately under all
conditions.

Fixes: c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
Signed-off-by: Helge Deller <deller@gmx.de>

Comments

John David Anglin June 20, 2024, 12:34 p.m. UTC | #1

On 2024-06-19 1:31 p.m., Helge Deller wrote:
> Commit c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
> switched from using jiffies to ktime-based performance benchmarking.
>
> This works nicely on machines which have a fine-grained ktime()
> clocksource as e.g. x86 machoines with TSC.
> But other machines, e.g. my 4-way HP PARISC server, don't have such
> fine-grained clocksources, which is why it seems that 800 xor loops
> take zero seconds, which then calculates in the logs as:
>
>   xor: measuring software checksum speed
>      8regs           : -1018167296 MB/sec
>      8regs_prefetch  : -1018167296 MB/sec
>      32regs          : -1018167296 MB/sec
>      32regs_prefetch : -1018167296 MB/sec
>
> Fix this with some small modifications to the existing code to improve
> the algorithm to always produce correct results without introducing
> major delays for architectures with a fine-grained ktime()
> clocksource:
> a) Delay start of the timing until ktime() just advanced. On machines
> with a fast ktime() this should be just one additional ktime() call.
> b) Count the number of loops. Run at minimum 800 loops and finish
> earliest when the ktime() counter has progressed.
>
> With that the throughput can now be calculated more accurately under all
> conditions.
>
> Fixes: c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
> Signed-off-by: Helge Deller <deller@gmx.de>
You can add my "Tested-by".

I wonder if prefetch versions are implemented correctly on parisc:

[   29.353868] xor: measuring software checksum speed
[   29.360030]    8regs           :  2266 MB/sec
[   29.368031]    8regs_prefetch  :  2076 MB/sec
[   29.376031]    32regs          :  2259 MB/sec
[   29.384031]    32regs_prefetch :  2075 MB/sec
[   29.384080] xor: using function: 8regs (2266 MB/sec)

>
> diff --git a/crypto/xor.c b/crypto/xor.c
> index 8e72e5d5db0d..29b4c0fd89d7 100644
> --- a/crypto/xor.c
> +++ b/crypto/xor.c
> @@ -83,33 +83,29 @@ static void __init
>   do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
>   {
>   	int speed;
> -	int i, j;
> -	ktime_t min, start, diff;
> +	unsigned long reps;
> +	ktime_t min, start, t0;
>   
>   	tmpl->next = template_list;
>   	template_list = tmpl;
>   
>   	preempt_disable();
>   
> -	min = (ktime_t)S64_MAX;
> -	for (i = 0; i < 3; i++) {
> -		start = ktime_get();
> -		for (j = 0; j < REPS; j++) {
> -			mb(); /* prevent loop optimization */
> -			tmpl->do_2(BENCH_SIZE, b1, b2);
> -			mb();
> -		}
> -		diff = ktime_sub(ktime_get(), start);
> -		if (diff < min)
> -			min = diff;
> -	}
> +	t0 = ktime_get();
> +	/* delay start until time has advanced */
> +	do { start = ktime_get(); } while (start == t0);
> +	reps = 0;
> +	do {
> +		mb(); /* prevent loop optimization */
> +		tmpl->do_2(BENCH_SIZE, b1, b2);
> +		mb();
> +	} while (reps++ < REPS || (t0 = ktime_get()) == start);
> +	min = ktime_sub(t0, start);
>   
>   	preempt_enable();
>   
>   	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
> -	if (!min)
> -		min = 1;
> -	speed = (1000 * REPS * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
> +	speed = (1000 * reps * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
>   	tmpl->speed = speed;
>   
>   	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);
>

Herbert Xu June 28, 2024, 1:03 a.m. UTC | #2

On Wed, Jun 19, 2024 at 07:31:57PM +0200, Helge Deller wrote:
> Commit c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
> switched from using jiffies to ktime-based performance benchmarking.
> 
> This works nicely on machines which have a fine-grained ktime()
> clocksource as e.g. x86 machoines with TSC.
> But other machines, e.g. my 4-way HP PARISC server, don't have such
> fine-grained clocksources, which is why it seems that 800 xor loops
> take zero seconds, which then calculates in the logs as:
> 
>  xor: measuring software checksum speed
>     8regs           : -1018167296 MB/sec
>     8regs_prefetch  : -1018167296 MB/sec
>     32regs          : -1018167296 MB/sec
>     32regs_prefetch : -1018167296 MB/sec

Ard, could you please take a look at this patch when you get a
chance? Thanks!

Herbert Xu July 6, 2024, 12:04 a.m. UTC | #3

On Wed, Jun 19, 2024 at 07:31:57PM +0200, Helge Deller wrote:
>
> +	t0 = ktime_get();
> +	/* delay start until time has advanced */
> +	do { start = ktime_get(); } while (start == t0);

Please rewrite this loop in the usual kernel coding style.

What about adding a cpu_relax() in there if ktime_get doesn't
advance? Something like

	while ((start = ktime_get()) == t0)
		cpu_relax();

Cheers,

Helge Deller July 8, 2024, 10:28 a.m. UTC | #4

On 7/6/24 02:04, Herbert Xu wrote:
> On Wed, Jun 19, 2024 at 07:31:57PM +0200, Helge Deller wrote:
>>
>> +	t0 = ktime_get();
>> +	/* delay start until time has advanced */
>> +	do { start = ktime_get(); } while (start == t0);
>
> Please rewrite this loop in the usual kernel coding style.

Ok.

> What about adding a cpu_relax() in there if ktime_get doesn't
> advance? Something like
>
> 	while ((start = ktime_get()) == t0)
> 		cpu_relax();

Yes, looks better.
Will send updated patch.

Helge

diff --git a/crypto/xor.c b/crypto/xor.c
index 8e72e5d5db0d..29b4c0fd89d7 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -83,33 +83,29 @@  static void __init
 do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 {
 	int speed;
-	int i, j;
-	ktime_t min, start, diff;
+	unsigned long reps;
+	ktime_t min, start, t0;
 
 	tmpl->next = template_list;
 	template_list = tmpl;
 
 	preempt_disable();
 
-	min = (ktime_t)S64_MAX;
-	for (i = 0; i < 3; i++) {
-		start = ktime_get();
-		for (j = 0; j < REPS; j++) {
-			mb(); /* prevent loop optimization */
-			tmpl->do_2(BENCH_SIZE, b1, b2);
-			mb();
-		}
-		diff = ktime_sub(ktime_get(), start);
-		if (diff < min)
-			min = diff;
-	}
+	t0 = ktime_get();
+	/* delay start until time has advanced */
+	do { start = ktime_get(); } while (start == t0);
+	reps = 0;
+	do {
+		mb(); /* prevent loop optimization */
+		tmpl->do_2(BENCH_SIZE, b1, b2);
+		mb();
+	} while (reps++ < REPS || (t0 = ktime_get()) == start);
+	min = ktime_sub(t0, start);
 
 	preempt_enable();
 
 	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
-	if (!min)
-		min = 1;
-	speed = (1000 * REPS * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
+	speed = (1000 * reps * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
 	tmpl->speed = speed;
 
 	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);

crypto: xor - fix template benchmarking

Commit Message

Comments

Patch