From patchwork Tue Feb 6 20:48:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547783 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1B846C48297 for ; Tue, 6 Feb 2024 20:49:58 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNU-0004v4-Sd; Tue, 06 Feb 2024 15:49:12 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNS-0004uB-Ho for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:10 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNN-0001c2-76 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:10 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id B85CC4076725; Tue, 6 Feb 2024 20:48:40 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru B85CC4076725 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252520; bh=Y2cpClKJ7CYNoOrL26LehvbpSaAZoGzNt0b2/PjT/uc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=eM+f+3WgQ0IWhT0tlwEJPDY7EbRRy8fabU9agpdDsI8eJf2oRGqMWgjszmFbUHIen YWuCmqXQ6jnZHWCaeGraFGWiyufMSJEp0HEuUxrWXAYgg1IXXvh5KeQ8LbdFuIfKvd w5/pGAbkD0K4981aK4jVeHRSvRNE6RsApOUmuQZY= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Date: Tue, 6 Feb 2024 23:48:04 +0300 Message-Id: <20240206204809.9859-2-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org The SSE4.1 variant is virtually identical to the SSE2 variant, except for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing if an SSE register is all zeroes. The PTEST instruction decodes to two uops, so it can be handled only by the complex decoder, and since CMP+JNE are macro-fused, both sequences decode to three uops. The uops comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch standpoint. Hence, the use of PTEST brings no benefit from throughput standpoint. Its latency is not important, since it feeds only a conditional jump, which terminates the dependency chain. I never observed PTEST variants to be faster on real hardware. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson --- util/bufferiszero.c | 29 ----------------------------- 1 file changed, 29 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 3e6a5dfd63..f5a3634f9a 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len) } #ifdef CONFIG_AVX2_OPT -static bool __attribute__((target("sse4"))) -buffer_zero_sse4(const void *buf, size_t len) -{ - __m128i t = _mm_loadu_si128(buf); - __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); - __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); - - /* Loop over 16-byte aligned blocks of 64. */ - while (likely(p <= e)) { - __builtin_prefetch(p); - if (unlikely(!_mm_testz_si128(t, t))) { - return false; - } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } - - /* Finish the aligned tail. */ - t |= e[-3]; - t |= e[-2]; - t |= e[-1]; - - /* Finish the unaligned tail. */ - t |= _mm_loadu_si128(buf + len - 16); - - return _mm_testz_si128(t, t); -} - static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { @@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info) #endif #ifdef CONFIG_AVX2_OPT { CPUINFO_AVX2, 128, buffer_zero_avx2 }, - { CPUINFO_SSE4, 64, buffer_zero_sse4 }, #endif { CPUINFO_SSE2, 64, buffer_zero_sse2 }, { CPUINFO_ALWAYS, 0, buffer_zero_int }, From patchwork Tue Feb 6 20:48:05 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547785 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5C9A7C48297 for ; Tue, 6 Feb 2024 20:50:08 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNV-0004vM-ET; Tue, 06 Feb 2024 15:49:13 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNR-0004u0-L3 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:09 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNN-0001c3-74 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:09 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id D94814076726; Tue, 6 Feb 2024 20:48:40 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru D94814076726 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252520; bh=lvGdZxGcWkyt/xJMjaL220N3POd116VkHqr/H8FApyo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=CFxt+9JMV5VG8TTQsNKO9q8Hs/8UGY0qvcNFGocHwrrUl/+ck/ElnTy6LQ6E/0G5Z cql5QDOaW5n/sRXz5lGjh2gfdIMFkrVI2hye83Va6RwfTxRPW6MtResXved2jGTDPo kRKimRE0siFj0CuErnUNvr/l3vey6NCAvk/FPGLQ= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper Date: Tue, 6 Feb 2024 23:48:05 +0300 Message-Id: <20240206204809.9859-3-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Make buffer_is_zero a 'static inline' function that tests up to three bytes from the buffer before handing off to an unrolled loop. This eliminates call overhead for most non-zero buffers, and allows to optimize out length checks when it is known at compile time (which is often the case in Qemu). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- include/qemu/cutils.h | 28 +++++++++++++++- util/bufferiszero.c | 76 ++++++++++++------------------------------- 2 files changed, 47 insertions(+), 57 deletions(-) diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h index 92c927a6a3..62b153e603 100644 --- a/include/qemu/cutils.h +++ b/include/qemu/cutils.h @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz); /* used to print char* safely */ #define STR_OR_NULL(str) ((str) ? (str) : "null") -bool buffer_is_zero(const void *buf, size_t len); +bool buffer_is_zero_len_4_plus(const void *, size_t); +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t); bool test_buffer_is_zero_next_accel(void); +/* + * Check if a buffer is all zeroes. + */ +static inline bool buffer_is_zero(const void *vbuf, size_t len) +{ + const char *buf = vbuf; + + if (len == 0) { + return true; + } + if (buf[0] || buf[len - 1] || buf[len / 2]) { + return false; + } + /* All bytes are covered for any len <= 3. */ + if (len <= 3) { + return true; + } + + if (len >= 256) { + return buffer_is_zero_len_256_plus(vbuf, len); + } else { + return buffer_is_zero_len_4_plus(vbuf, len); + } +} + /* * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128) * Input is limited to 14-bit numbers diff --git a/util/bufferiszero.c b/util/bufferiszero.c index f5a3634f9a..01050694a6 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -26,8 +26,8 @@ #include "qemu/bswap.h" #include "host/cpuinfo.h" -static bool -buffer_zero_int(const void *buf, size_t len) +bool +buffer_is_zero_len_4_plus(const void *buf, size_t len) { if (unlikely(len < 8)) { /* For a very small buffer, simply accumulate all the bytes. */ @@ -157,57 +157,40 @@ buffer_zero_avx512(const void *buf, size_t len) } #endif /* CONFIG_AVX512F_OPT */ -/* - * Make sure that these variables are appropriately initialized when - * SSE2 is enabled on the compiler command-line, but the compiler is - * too old to support CONFIG_AVX2_OPT. - */ -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) -# define INIT_USED 0 -# define INIT_LENGTH 0 -# define INIT_ACCEL buffer_zero_int -#else -# ifndef __SSE2__ -# error "ISA selection confusion" -# endif -# define INIT_USED CPUINFO_SSE2 -# define INIT_LENGTH 64 -# define INIT_ACCEL buffer_zero_sse2 -#endif - -static unsigned used_accel = INIT_USED; -static unsigned length_to_accel = INIT_LENGTH; -static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL; - static unsigned __attribute__((noinline)) select_accel_cpuinfo(unsigned info) { /* Array is sorted in order of algorithm preference. */ static const struct { unsigned bit; - unsigned len; bool (*fn)(const void *, size_t); } all[] = { #ifdef CONFIG_AVX512F_OPT - { CPUINFO_AVX512F, 256, buffer_zero_avx512 }, + { CPUINFO_AVX512F, buffer_zero_avx512 }, #endif #ifdef CONFIG_AVX2_OPT - { CPUINFO_AVX2, 128, buffer_zero_avx2 }, + { CPUINFO_AVX2, buffer_zero_avx2 }, #endif - { CPUINFO_SSE2, 64, buffer_zero_sse2 }, - { CPUINFO_ALWAYS, 0, buffer_zero_int }, + { CPUINFO_SSE2, buffer_zero_sse2 }, + { CPUINFO_ALWAYS, buffer_is_zero_len_4_plus }, }; for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) { if (info & all[i].bit) { - length_to_accel = all[i].len; - buffer_accel = all[i].fn; + buffer_is_zero_len_256_plus = all[i].fn; return all[i].bit; } } return 0; } +static unsigned used_accel +#if defined(__SSE2__) + = CPUINFO_SSE2; +#else + = 0; +#endif + #if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) static void __attribute__((constructor)) init_accel(void) { @@ -227,35 +210,16 @@ bool test_buffer_is_zero_next_accel(void) return used; } -static bool select_accel_fn(const void *buf, size_t len) -{ - if (likely(len >= length_to_accel)) { - return buffer_accel(buf, len); - } - return buffer_zero_int(buf, len); -} - #else -#define select_accel_fn buffer_zero_int bool test_buffer_is_zero_next_accel(void) { return false; } #endif -/* - * Checks if a buffer is all zeroes - */ -bool buffer_is_zero(const void *buf, size_t len) -{ - if (unlikely(len == 0)) { - return true; - } - - /* Fetch the beginning of the buffer while we select the accelerator. */ - __builtin_prefetch(buf); - - /* Use an optimized zero check if possible. Note that this also - includes a check for an unrolled loop over 64-bit integers. */ - return select_accel_fn(buf, len); -} +bool (*buffer_is_zero_len_256_plus)(const void *, size_t) +#if defined(__SSE2__) + = buffer_zero_sse2; +#else + = buffer_is_zero_len_4_plus; +#endif From patchwork Tue Feb 6 20:48:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547787 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6E9EEC4829C for ; Tue, 6 Feb 2024 20:50:19 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNT-0004un-Ny; Tue, 06 Feb 2024 15:49:11 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNS-0004uE-MK for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:10 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNN-0001c5-73 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:10 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id 07BD84076728; Tue, 6 Feb 2024 20:48:41 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 07BD84076728 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252521; bh=LKlsP9qffl5TmFXf/YIJKHAGRdE7XUamBGJVaa+posg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=k2pmwYm0+tmnjh2Wza3wmePZA+Yw2DRnQ3OVmU60RRqVfPe1nHjz8E+IuMw9cDZip iY5ZzwriTkhLQExaupI5D983L4TOcDx3UPyPr8Fo84YHYIkGORI7w6qwX9uONJstAf R7hVf2W8c8+MC8XJaPHbshXkv8svzsdp1Fjj/0DM= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Date: Tue, 6 Feb 2024 23:48:06 +0300 Message-Id: <20240206204809.9859-4-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD routines are invoked much more rarely in normal use when most buffers are non-zero. This makes use of AVX512 unprofitable, as it incurs extra frequency and voltage transition periods during which the CPU operates at reduced performance, as described in https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html Signed-off-by: Mikhail Romanov Signed-off-by: Alexander Monakov Reviewed-by: Richard Henderson --- util/bufferiszero.c | 36 ++---------------------------------- 1 file changed, 2 insertions(+), 34 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 01050694a6..c037d11d04 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) } } -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__) +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include /* Note that each of these vectorized functions require len >= 64. */ @@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len) } #endif /* CONFIG_AVX2_OPT */ -#ifdef CONFIG_AVX512F_OPT -static bool __attribute__((target("avx512f"))) -buffer_zero_avx512(const void *buf, size_t len) -{ - /* Begin with an unaligned head of 64 bytes. */ - __m512i t = _mm512_loadu_si512(buf); - __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64); - __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64); - - /* Loop over 64-byte aligned blocks of 256. */ - while (p <= e) { - __builtin_prefetch(p); - if (unlikely(_mm512_test_epi64_mask(t, t))) { - return false; - } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } - - t |= _mm512_loadu_si512(buf + len - 4 * 64); - t |= _mm512_loadu_si512(buf + len - 3 * 64); - t |= _mm512_loadu_si512(buf + len - 2 * 64); - t |= _mm512_loadu_si512(buf + len - 1 * 64); - - return !_mm512_test_epi64_mask(t, t); - -} -#endif /* CONFIG_AVX512F_OPT */ - static unsigned __attribute__((noinline)) select_accel_cpuinfo(unsigned info) { @@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info) unsigned bit; bool (*fn)(const void *, size_t); } all[] = { -#ifdef CONFIG_AVX512F_OPT - { CPUINFO_AVX512F, buffer_zero_avx512 }, -#endif #ifdef CONFIG_AVX2_OPT { CPUINFO_AVX2, buffer_zero_avx2 }, #endif @@ -191,7 +159,7 @@ static unsigned used_accel = 0; #endif -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) +#if defined(CONFIG_AVX2_OPT) static void __attribute__((constructor)) init_accel(void) { used_accel = select_accel_cpuinfo(cpuinfo_init()); From patchwork Tue Feb 6 20:48:07 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547784 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 81F4DC48297 for ; Tue, 6 Feb 2024 20:50:05 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNT-0004uK-1V; Tue, 06 Feb 2024 15:49:11 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNR-0004tt-C9 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:09 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNN-0001cB-74 for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:49:09 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id 2C3094076729; Tue, 6 Feb 2024 20:48:41 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 2C3094076729 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252521; bh=VFrTWUlG2vqWX9RDmzX8dOxdM4X5eGzOS6RJxjiMu2o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=MvS/sCN7GSp4MF0lMpJkTKokikr4L+sbUSh9kXOC01mf7klPhtp2xzQCemckKKtak i6pqC3altfKaZa96P50v4rkqIlvBpTwZISqAYVkxSkuVWrz3oR/duq01gqOum2XNQM SDHMqgXg9/zQO4FPMM0i2zb4wcatETzlym8WKAAA= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 4/6] util/bufferiszero: remove useless prefetches Date: Tue, 6 Feb 2024 23:48:07 +0300 Message-Id: <20240206204809.9859-5-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Use of prefetching in bufferiszero.c is quite questionable: - prefetches are issued just a few CPU cycles before the corresponding line would be hit by demand loads; - they are done for simple access patterns, i.e. where hardware prefetchers can perform better; - they compete for load ports in loops that should be limited by load port throughput rather than ALU throughput. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson --- util/bufferiszero.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index c037d11d04..cb3eb2543f 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -49,7 +49,6 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); for (; p + 8 <= e; p += 8) { - __builtin_prefetch(p + 8); if (t) { return false; } @@ -79,7 +78,6 @@ buffer_zero_sse2(const void *buf, size_t len) /* Loop over 16-byte aligned blocks of 64. */ while (likely(p <= e)) { - __builtin_prefetch(p); t = _mm_cmpeq_epi8(t, zero); if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) { return false; @@ -110,7 +108,6 @@ buffer_zero_avx2(const void *buf, size_t len) /* Loop over 32-byte aligned blocks of 128. */ while (p <= e) { - __builtin_prefetch(p); if (unlikely(!_mm256_testz_si256(t, t))) { return false; } From patchwork Tue Feb 6 20:48:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547781 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C757AC48297 for ; Tue, 6 Feb 2024 20:49:21 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNE-0004t8-TM; Tue, 06 Feb 2024 15:48:56 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSND-0004sn-EM for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:48:55 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNB-0001eU-Or for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:48:55 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id 50E36407672A; Tue, 6 Feb 2024 20:48:41 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 50E36407672A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252521; bh=ETvOydZy8vnkWXRDCKf0mg+cgfZMi9kB2kJ9DVxrF2k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=igCv2iXCaIWP1l3kVX7mQv/WCTK0G0E9xbMdyaElMjupXKXbcHm4rhqJbvQFmVHRI RTLRjm2voPlqX9L3rRyCkG1/ii8iDf9WLjaNpSV3cPYIri+wYm8wPGTxFEsLkcMOs/ Vc5zsOeRh3Df5nX1pW4lv3yR2+FAzgYNHIIR/PtY= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants Date: Tue, 6 Feb 2024 23:48:08 +0300 Message-Id: <20240206204809.9859-6-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson --- util/bufferiszero.c | 108 ++++++++++++++++++++++++++++---------------- 1 file changed, 69 insertions(+), 39 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index cb3eb2543f..d752edd8cc 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -66,62 +66,92 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include -/* Note that each of these vectorized functions require len >= 64. */ +/* Helper for preventing the compiler from reassociating + chains of binary vector operations. */ +#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1)) + +/* Note that these vectorized functions may assume len >= 256. */ static bool __attribute__((target("sse2"))) buffer_zero_sse2(const void *buf, size_t len) { - __m128i t = _mm_loadu_si128(buf); - __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); - __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); - __m128i zero = _mm_setzero_si128(); - - /* Loop over 16-byte aligned blocks of 64. */ - while (likely(p <= e)) { - t = _mm_cmpeq_epi8(t, zero); - if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) { + /* Unaligned loads at head/tail. */ + __m128i v = *(__m128i_u *)(buf); + __m128i w = *(__m128i_u *)(buf + len - 16); + /* Align head/tail to 16-byte boundaries. */ + __m128i *p = (void *)(((uintptr_t)buf + 16) & -16); + __m128i *e = (void *)(((uintptr_t)buf + len - 1) & -16); + __m128i zero = { 0 }; + + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* Loop over complete 128-byte blocks. */ + for (; p < e - 7; p += 8) { + v = _mm_cmpeq_epi8(v, zero); + if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; } - /* Finish the aligned tail. */ - t |= e[-3]; - t |= e[-2]; - t |= e[-1]; - - /* Finish the unaligned tail. */ - t |= _mm_loadu_si128(buf + len - 16); - - return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF; + return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF; } #ifdef CONFIG_AVX2_OPT static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { - /* Begin with an unaligned head of 32 bytes. */ - __m256i t = _mm256_loadu_si256(buf); - __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32); - __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32); - - /* Loop over 32-byte aligned blocks of 128. */ - while (p <= e) { - if (unlikely(!_mm256_testz_si256(t, t))) { + /* Unaligned loads at head/tail. */ + __m256i v = *(__m256i_u *)(buf); + __m256i w = *(__m256i_u *)(buf + len - 32); + /* Align head/tail to 32-byte boundaries. */ + __m256i *p = (void *)(((uintptr_t)buf + 32) & -32); + __m256i *e = (void *)(((uintptr_t)buf + len - 1) & -32); + __m256i zero = { 0 }; + + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* Loop over complete 256-byte blocks. */ + for (; p < e - 7; p += 8) { + /* PTEST is not profitable here. */ + v = _mm256_cmpeq_epi8(v, zero); + if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } ; - - /* Finish the last block of 128 unaligned. */ - t |= _mm256_loadu_si256(buf + len - 4 * 32); - t |= _mm256_loadu_si256(buf + len - 3 * 32); - t |= _mm256_loadu_si256(buf + len - 2 * 32); - t |= _mm256_loadu_si256(buf + len - 1 * 32); + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; + } - return _mm256_testz_si256(t, t); + return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF; } #endif /* CONFIG_AVX2_OPT */ From patchwork Tue Feb 6 20:48:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Monakov X-Patchwork-Id: 13547786 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E7472C48297 for ; Tue, 6 Feb 2024 20:50:16 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rXSNQ-0004ts-9Z; Tue, 06 Feb 2024 15:49:08 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSND-0004so-Go for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:48:55 -0500 Received: from mail.ispras.ru ([83.149.199.84]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rXSNB-0001eS-OA for qemu-devel@nongnu.org; Tue, 06 Feb 2024 15:48:55 -0500 Received: from localhost.intra.ispras.ru (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTP id 73AA4407672D; Tue, 6 Feb 2024 20:48:41 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 73AA4407672D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1707252521; bh=W6EAekkBjy9x2cgLMjMZlCKw0B9qmi55hLClRB6gYtw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Ig2MPlqgvp1AsoB2P6yxYnJQngGebIorQ5g5xdtrp+72uh5Ui3aCaK23ir1cxGGd9 nKKQAh6bDb1QHB4AGzqzYikh5511lXCWUFEl8mRbeiFe0O77oDJ3qlCdrrsjWn7vvL hViUTBmg5UOnkiLKiu1l7t2b5yzN8AKsd96i2ywI= From: Alexander Monakov To: qemu-devel@nongnu.org Cc: Mikhail Romanov , Richard Henderson , Paolo Bonzini , Alexander Monakov Subject: [PATCH v3 6/6] util/bufferiszero: improve scalar variant Date: Tue, 6 Feb 2024 23:48:09 +0300 Message-Id: <20240206204809.9859-7-amonakov@ispras.ru> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20240206204809.9859-1-amonakov@ispras.ru> References: <20240206204809.9859-1-amonakov@ispras.ru> MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=amonakov@ispras.ru; helo=mail.ispras.ru X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Take into account that the inline wrapper ensures len >= 4. Use __attribute__((may_alias)) for accesses via non-char pointers. Avoid using out-of-bounds pointers in loop boundary conditions by reformulating the 'for' loop as 'if (...) do { ... } while (...)'. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson --- util/bufferiszero.c | 30 +++++++++++------------------- 1 file changed, 11 insertions(+), 19 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index d752edd8cc..1f4cbfaea4 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -29,35 +29,27 @@ bool buffer_is_zero_len_4_plus(const void *buf, size_t len) { - if (unlikely(len < 8)) { - /* For a very small buffer, simply accumulate all the bytes. */ - const unsigned char *p = buf; - const unsigned char *e = buf + len; - unsigned char t = 0; - - do { - t |= *p++; - } while (p < e); - - return t == 0; + if (unlikely(len <= 8)) { + /* Our caller ensures len >= 4. */ + return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0; } else { - /* Otherwise, use the unaligned memory access functions to - handle the beginning and end of the buffer, with a couple + /* Use unaligned memory access functions to handle + the beginning and end of the buffer, with a couple of loops handling the middle aligned section. */ - uint64_t t = ldq_he_p(buf); - const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8); - const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); + uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8); + typedef uint64_t uint64_a __attribute__((may_alias)); + const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8); + const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8); - for (; p + 8 <= e; p += 8) { + if (e - p >= 8) do { if (t) { return false; } t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7]; - } + } while ((p += 8) <= e - 8); while (p < e) { t |= *p++; } - t |= ldq_he_p(buf + len - 8); return t == 0; }