[v6,07/13] fpu: introduce hardfloat

Message ID	20181124235553.17371-8-cota@braap.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: "Emilio G. Cota" <cota@braap.org> To: qemu-devel@nongnu.org Date: Sat, 24 Nov 2018 18:55:47 -0500 Message-Id: <20181124235553.17371-8-cota@braap.org> In-Reply-To: <20181124235553.17371-1-cota@braap.org> References: <20181124235553.17371-1-cota@braap.org> Subject: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Precedence: list Cc: Richard Henderson <richard.henderson@linaro.org>, =?utf-8?q?Alex_Benn?= =?utf-8?q?=C3=A9e?= <alex.bennee@linaro.org> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	hardfloat \| expand [v6,00/13] hardfloat [v6,01/13] fp-test: pick TARGET_ARM to get its specialization [v6,02/13] softfloat: add float{32, 64}_is_{de, }normal [v6,03/13] target/tricore: use float32_is_denormal [v6,04/13] softfloat: rename canonicalize to sf_canonicalize [v6,05/13] softfloat: add float{32, 64}_is_zero_or_normal [v6,06/13] tests/fp: add fp-bench [v6,07/13] fpu: introduce hardfloat [v6,08/13] hardfloat: implement float32/64 addition and subtraction [v6,09/13] hardfloat: implement float32/64 multiplication [v6,10/13] hardfloat: implement float32/64 division [v6,11/13] hardfloat: implement float32/64 fused multiply-add [v6,12/13] hardfloat: implement float32/64 square root [v6,13/13] hardfloat: implement float32/64 comparison

Hi, Emilio. > Note: some architectures (at least PPC, there might be others) clear > the status flags passed to softfloat before most FP operations. This > precludes the use of hardfloat, so to avoid introducing a performance > regression for those targets, we add a flag to disable hardfloat. > In the long run though it would be good to fix the targets so that > at least the inexact flag passed to softfloat is indeed sticky. Can you elaborate more on this paragraph? Thanks, Aleksandar Markovic On Nov 25, 2018 1:08 AM, "Emilio G. Cota" <cota@braap.org> wrote: > The appended paves the way for leveraging the host FPU for a subset > of guest FP operations. For most guest workloads (e.g. FP flags > aren't ever cleared, inexact occurs often and rounding is set to the > default [to nearest]) this will yield sizable performance speedups. > > The approach followed here avoids checking the FP exception flags register. > See the added comment for details. > > This assumes that QEMU is running on an IEEE754-compliant FPU and > that the rounding is set to the default (to nearest). The > implementation-dependent specifics of the FPU should not matter; things > like tininess detection and snan representation are still dealt with in > soft-fp. However, this approach will break on most hosts if we compile > QEMU with flags such as -ffast-math. We control the flags so this should > be easy to enforce though. > > This patch just adds common code. Some operations will be migrated > to hardfloat in subsequent patches to ease bisection. > > Note: some architectures (at least PPC, there might be others) clear > the status flags passed to softfloat before most FP operations. This > precludes the use of hardfloat, so to avoid introducing a performance > regression for those targets, we add a flag to disable hardfloat. > In the long run though it would be good to fix the targets so that > at least the inexact flag passed to softfloat is indeed sticky. > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > fpu/softfloat.c | 315 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 315 insertions(+) > > diff --git a/fpu/softfloat.c b/fpu/softfloat.c > index ecdc00c633..306a12fa8d 100644 > --- a/fpu/softfloat.c > +++ b/fpu/softfloat.c > @@ -83,6 +83,7 @@ this code that are retained. > * target-dependent and needs the TARGET_* macros. > */ > #include "qemu/osdep.h" > +#include <math.h> > #include "qemu/bitops.h" > #include "fpu/softfloat.h" > > @@ -95,6 +96,320 @@ this code that are retained. > *----------------------------------------------------------- > -----------------*/ > #include "fpu/softfloat-macros.h" > > +/* > + * Hardfloat > + * > + * Fast emulation of guest FP instructions is challenging for two reasons. > + * First, FP instruction semantics are similar but not identical, > particularly > + * when handling NaNs. Second, emulating at reasonable speed the guest FP > + * exception flags is not trivial: reading the host's flags register with > a > + * feclearexcept & fetestexcept pair is slow [slightly slower than > soft-fp], > + * and trapping on every FP exception is not fast nor pleasant to work > with. > + * > + * We address these challenges by leveraging the host FPU for a subset of > the > + * operations. To do this we expand on the idea presented in this paper: > + * > + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions > in a > + * binary translator." Software: Practice and Experience 46.12 > (2016):1591-1615. > + * > + * The idea is thus to leverage the host FPU to (1) compute FP operations > + * and (2) identify whether FP exceptions occurred while avoiding > + * expensive exception flag register accesses. > + * > + * An important optimization shown in the paper is that given that > exception > + * flags are rarely cleared by the guest, we can avoid recomputing some > flags. > + * This is particularly useful for the inexact flag, which is very > frequently > + * raised in floating-point workloads. > + * > + * We optimize the code further by deferring to soft-fp whenever FP > exception > + * detection might get hairy. Two examples: (1) when at least one operand > is > + * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 > result > + * and the result is < the minimum normal. > + */ > +#define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \ > + static inline void name(soft_t *a, float_status *s) \ > + { \ > + if (unlikely(soft_t ## _is_denormal(*a))) { \ > + *a = soft_t ## _set_sign(soft_t ## _zero, \ > + soft_t ## _is_neg(*a)); \ > + s->float_exception_flags |= float_flag_input_denormal; \ > + } \ > + } > + > +GEN_INPUT_FLUSH__NOCHECK(float32_input_flush__nocheck, float32) > +GEN_INPUT_FLUSH__NOCHECK(float64_input_flush__nocheck, float64) > +#undef GEN_INPUT_FLUSH__NOCHECK > + > +#define GEN_INPUT_FLUSH1(name, soft_t) \ > + static inline void name(soft_t *a, float_status *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + } > + > +GEN_INPUT_FLUSH1(float32_input_flush1, float32) > +GEN_INPUT_FLUSH1(float64_input_flush1, float64) > +#undef GEN_INPUT_FLUSH1 > + > +#define GEN_INPUT_FLUSH2(name, soft_t) \ > + static inline void name(soft_t *a, soft_t *b, float_status *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + soft_t ## _input_flush__nocheck(b, s); \ > + } > + > +GEN_INPUT_FLUSH2(float32_input_flush2, float32) > +GEN_INPUT_FLUSH2(float64_input_flush2, float64) > +#undef GEN_INPUT_FLUSH2 > + > +#define GEN_INPUT_FLUSH3(name, soft_t) \ > + static inline void name(soft_t *a, soft_t *b, soft_t *c, float_status > *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + soft_t ## _input_flush__nocheck(b, s); \ > + soft_t ## _input_flush__nocheck(c, s); \ > + } > + > +GEN_INPUT_FLUSH3(float32_input_flush3, float32) > +GEN_INPUT_FLUSH3(float64_input_flush3, float64) > +#undef GEN_INPUT_FLUSH3 > + > +/* > + * Choose whether to use fpclassify or float32/64_* primitives in the > generated > + * hardfloat functions. Each combination of number of inputs and float > size > + * gets its own value. > + */ > +#if defined(__x86_64__) > +# define QEMU_HARDFLOAT_1F32_USE_FP 0 > +# define QEMU_HARDFLOAT_1F64_USE_FP 1 > +# define QEMU_HARDFLOAT_2F32_USE_FP 0 > +# define QEMU_HARDFLOAT_2F64_USE_FP 1 > +# define QEMU_HARDFLOAT_3F32_USE_FP 0 > +# define QEMU_HARDFLOAT_3F64_USE_FP 1 > +#else > +# define QEMU_HARDFLOAT_1F32_USE_FP 0 > +# define QEMU_HARDFLOAT_1F64_USE_FP 0 > +# define QEMU_HARDFLOAT_2F32_USE_FP 0 > +# define QEMU_HARDFLOAT_2F64_USE_FP 0 > +# define QEMU_HARDFLOAT_3F32_USE_FP 0 > +# define QEMU_HARDFLOAT_3F64_USE_FP 0 > +#endif > + > +/* > + * QEMU_HARDFLOAT_USE_ISINF chooses whether to use isinf() over > + * float{32,64}_is_infinity when !USE_FP. > + * On x86_64/aarch64, using the former over the latter can yield a ~6% > speedup. > + * On power64 however, using isinf() reduces fp-bench performance by up > to 50%. > + */ > +#if defined(__x86_64__) || defined(__aarch64__) > +# define QEMU_HARDFLOAT_USE_ISINF 1 > +#else > +# define QEMU_HARDFLOAT_USE_ISINF 0 > +#endif > + > +/* > + * Some targets clear the FP flags before most FP operations. This > prevents > + * the use of hardfloat, since hardfloat relies on the inexact flag being > + * already set. > + */ > +#if defined(TARGET_PPC) > +# define QEMU_NO_HARDFLOAT 1 > +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN > +#else > +# define QEMU_NO_HARDFLOAT 0 > +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN __attribute__((noinline)) > +#endif > + > +static inline bool can_use_fpu(const float_status *s) > +{ > + if (QEMU_NO_HARDFLOAT) { > + return false; > + } > + return likely(s->float_exception_flags & float_flag_inexact && > + s->float_rounding_mode == float_round_nearest_even); > +} > + > +/* > + * Hardfloat generation functions. Each operation can have two flavors: > + * either using softfloat primitives (e.g. float32_is_zero_or_normal) for > + * most condition checks, or native ones (e.g. fpclassify). > + * > + * The flavor is chosen by the callers. Instead of using macros, we rely > on the > + * compiler to propagate constants and inline everything into the callers. > + * > + * We only generate functions for operations with two inputs, since only > + * these are common enough to justify consolidating them into common code. > + */ > + > +typedef union { > + float32 s; > + float h; > +} union_float32; > + > +typedef union { > + float64 s; > + double h; > +} union_float64; > + > +typedef bool (*f32_check_fn)(union_float32 a, union_float32 b); > +typedef bool (*f64_check_fn)(union_float64 a, union_float64 b); > + > +typedef float32 (*soft_f32_op2_fn)(float32 a, float32 b, float_status *s); > +typedef float64 (*soft_f64_op2_fn)(float64 a, float64 b, float_status *s); > +typedef float (*hard_f32_op2_fn)(float a, float b); > +typedef double (*hard_f64_op2_fn)(double a, double b); > + > +/* 2-input is-zero-or-normal */ > +static inline bool f32_is_zon2(union_float32 a, union_float32 b) > +{ > + if (QEMU_HARDFLOAT_2F32_USE_FP) { > + /* > + * Not using a temp variable for consecutive fpclassify calls > ends up > + * generating faster code. > + */ > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO); > + } > + return float32_is_zero_or_normal(a.s) && > + float32_is_zero_or_normal(b.s); > +} > + > +static inline bool f64_is_zon2(union_float64 a, union_float64 b) > +{ > + if (QEMU_HARDFLOAT_2F64_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO); > + } > + return float64_is_zero_or_normal(a.s) && > + float64_is_zero_or_normal(b.s); > +} > + > +/* 3-input is-zero-or-normal */ > +static inline > +bool f32_is_zon3(union_float32 a, union_float32 b, union_float32 c) > +{ > + if (QEMU_HARDFLOAT_3F32_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO) && > + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == > FP_ZERO); > + } > + return float32_is_zero_or_normal(a.s) && > + float32_is_zero_or_normal(b.s) && > + float32_is_zero_or_normal(c.s); > +} > + > +static inline > +bool f64_is_zon3(union_float64 a, union_float64 b, union_float64 c) > +{ > + if (QEMU_HARDFLOAT_3F64_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO) && > + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == > FP_ZERO); > + } > + return float64_is_zero_or_normal(a.s) && > + float64_is_zero_or_normal(b.s) && > + float64_is_zero_or_normal(c.s); > +} > + > +static inline bool f32_is_inf(union_float32 a) > +{ > + if (QEMU_HARDFLOAT_USE_ISINF) { > + return isinff(a.h); > + } > + return float32_is_infinity(a.s); > +} > + > +static inline bool f64_is_inf(union_float64 a) > +{ > + if (QEMU_HARDFLOAT_USE_ISINF) { > + return isinf(a.h); > + } > + return float64_is_infinity(a.s); > +} > + > +/* Note: @fast_test and @post can be NULL */ > +static inline float32 > +float32_gen2(float32 xa, float32 xb, float_status *s, > + hard_f32_op2_fn hard, soft_f32_op2_fn soft, > + f32_check_fn pre, f32_check_fn post, > + f32_check_fn fast_test, soft_f32_op2_fn fast_op) > +{ > + union_float32 ua, ub, ur; > + > + ua.s = xa; > + ub.s = xb; > + > + if (unlikely(!can_use_fpu(s))) { > + goto soft; > + } > + > + float32_input_flush2(&ua.s, &ub.s, s); > + if (unlikely(!pre(ua, ub))) { > + goto soft; > + } > + if (fast_test && fast_test(ua, ub)) { > + return fast_op(ua.s, ub.s, s); > + } > + > + ur.h = hard(ua.h, ub.h); > + if (unlikely(f32_is_inf(ur))) { > + s->float_exception_flags |= float_flag_overflow; > + } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) { > + if (post == NULL || post(ua, ub)) { > + goto soft; > + } > + } > + return ur.s; > + > + soft: > + return soft(ua.s, ub.s, s); > +} > + > +static inline float64 > +float64_gen2(float64 xa, float64 xb, float_status *s, > + hard_f64_op2_fn hard, soft_f64_op2_fn soft, > + f64_check_fn pre, f64_check_fn post, > + f64_check_fn fast_test, soft_f64_op2_fn fast_op) > +{ > + union_float64 ua, ub, ur; > + > + ua.s = xa; > + ub.s = xb; > + > + if (unlikely(!can_use_fpu(s))) { > + goto soft; > + } > + > + float64_input_flush2(&ua.s, &ub.s, s); > + if (unlikely(!pre(ua, ub))) { > + goto soft; > + } > + if (fast_test && fast_test(ua, ub)) { > + return fast_op(ua.s, ub.s, s); > + } > + > + ur.h = hard(ua.h, ub.h); > + if (unlikely(f64_is_inf(ur))) { > + s->float_exception_flags |= float_flag_overflow; > + } else if (unlikely(fabs(ur.h) <= DBL_MIN)) { > + if (post == NULL || post(ua, ub)) { > + goto soft; > + } > + } > + return ur.s; > + > + soft: > + return soft(ua.s, ub.s, s); > +} > + > /*---------------------------------------------------------- > ------------------ > | Returns the fraction bits of the half-precision floating-point value > `a'. > *----------------------------------------------------------- > -----------------*/ > -- > 2.17.1 > > >

diff --git a/fpu/softfloat.c b/fpu/softfloat.c index ecdc00c633..306a12fa8d 100644 --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -83,6 +83,7 @@ this code that are retained. * target-dependent and needs the TARGET_* macros. */ #include "qemu/osdep.h" +#include <math.h> #include "qemu/bitops.h" #include "fpu/softfloat.h" @@ -95,6 +96,320 @@ this code that are retained. *----------------------------------------------------------------------------*/ #include "fpu/softfloat-macros.h" +/* + * Hardfloat + * + * Fast emulation of guest FP instructions is challenging for two reasons. + * First, FP instruction semantics are similar but not identical, particularly + * when handling NaNs. Second, emulating at reasonable speed the guest FP + * exception flags is not trivial: reading the host's flags register with a + * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp], + * and trapping on every FP exception is not fast nor pleasant to work with. + * + * We address these challenges by leveraging the host FPU for a subset of the + * operations. To do this we expand on the idea presented in this paper: + * + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a + * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615. + * + * The idea is thus to leverage the host FPU to (1) compute FP operations + * and (2) identify whether FP exceptions occurred while avoiding + * expensive exception flag register accesses. + * + * An important optimization shown in the paper is that given that exception + * flags are rarely cleared by the guest, we can avoid recomputing some flags. + * This is particularly useful for the inexact flag, which is very frequently + * raised in floating-point workloads. + * + * We optimize the code further by deferring to soft-fp whenever FP exception + * detection might get hairy. Two examples: (1) when at least one operand is + * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 result + * and the result is < the minimum normal. + */ +#define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (unlikely(soft_t ## _is_denormal(*a))) { \ + *a = soft_t ## _set_sign(soft_t ## _zero, \ + soft_t ## _is_neg(*a)); \ + s->float_exception_flags |= float_flag_input_denormal; \ + } \ + } + +GEN_INPUT_FLUSH__NOCHECK(float32_input_flush__nocheck, float32) +GEN_INPUT_FLUSH__NOCHECK(float64_input_flush__nocheck, float64) +#undef GEN_INPUT_FLUSH__NOCHECK + +#define GEN_INPUT_FLUSH1(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + } + +GEN_INPUT_FLUSH1(float32_input_flush1, float32) +GEN_INPUT_FLUSH1(float64_input_flush1, float64) +#undef GEN_INPUT_FLUSH1 + +#define GEN_INPUT_FLUSH2(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + } + +GEN_INPUT_FLUSH2(float32_input_flush2, float32) +GEN_INPUT_FLUSH2(float64_input_flush2, float64) +#undef GEN_INPUT_FLUSH2 + +#define GEN_INPUT_FLUSH3(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, soft_t *c, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + soft_t ## _input_flush__nocheck(c, s); \ + } + +GEN_INPUT_FLUSH3(float32_input_flush3, float32) +GEN_INPUT_FLUSH3(float64_input_flush3, float64) +#undef GEN_INPUT_FLUSH3 + +/* + * Choose whether to use fpclassify or float32/64_* primitives in the generated + * hardfloat functions. Each combination of number of inputs and float size + * gets its own value. + */ +#if defined(__x86_64__) +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 1 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 1 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 1 +#else +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 0 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 0 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 0 +#endif + +/* + * QEMU_HARDFLOAT_USE_ISINF chooses whether to use isinf() over + * float{32,64}_is_infinity when !USE_FP. + * On x86_64/aarch64, using the former over the latter can yield a ~6% speedup. + * On power64 however, using isinf() reduces fp-bench performance by up to 50%. + */ +#if defined(__x86_64__) || defined(__aarch64__) +# define QEMU_HARDFLOAT_USE_ISINF 1 +#else +# define QEMU_HARDFLOAT_USE_ISINF 0 +#endif + +/* + * Some targets clear the FP flags before most FP operations. This prevents + * the use of hardfloat, since hardfloat relies on the inexact flag being + * already set. + */ +#if defined(TARGET_PPC) +# define QEMU_NO_HARDFLOAT 1 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN +#else +# define QEMU_NO_HARDFLOAT 0 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN __attribute__((noinline)) +#endif + +static inline bool can_use_fpu(const float_status *s) +{ + if (QEMU_NO_HARDFLOAT) { + return false; + } + return likely(s->float_exception_flags & float_flag_inexact && + s->float_rounding_mode == float_round_nearest_even); +} + +/* + * Hardfloat generation functions. Each operation can have two flavors: + * either using softfloat primitives (e.g. float32_is_zero_or_normal) for + * most condition checks, or native ones (e.g. fpclassify). + * + * The flavor is chosen by the callers. Instead of using macros, we rely on the + * compiler to propagate constants and inline everything into the callers. + * + * We only generate functions for operations with two inputs, since only + * these are common enough to justify consolidating them into common code. + */ + +typedef union { + float32 s; + float h; +} union_float32; + +typedef union { + float64 s; + double h; +} union_float64; + +typedef bool (*f32_check_fn)(union_float32 a, union_float32 b); +typedef bool (*f64_check_fn)(union_float64 a, union_float64 b); + +typedef float32 (*soft_f32_op2_fn)(float32 a, float32 b, float_status *s); +typedef float64 (*soft_f64_op2_fn)(float64 a, float64 b, float_status *s); +typedef float (*hard_f32_op2_fn)(float a, float b); +typedef double (*hard_f64_op2_fn)(double a, double b); + +/* 2-input is-zero-or-normal */ +static inline bool f32_is_zon2(union_float32 a, union_float32 b) +{ + if (QEMU_HARDFLOAT_2F32_USE_FP) { + /* + * Not using a temp variable for consecutive fpclassify calls ends up + * generating faster code. + */ + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s); +} + +static inline bool f64_is_zon2(union_float64 a, union_float64 b) +{ + if (QEMU_HARDFLOAT_2F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s); +} + +/* 3-input is-zero-or-normal */ +static inline +bool f32_is_zon3(union_float32 a, union_float32 b, union_float32 c) +{ + if (QEMU_HARDFLOAT_3F32_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s) && + float32_is_zero_or_normal(c.s); +} + +static inline +bool f64_is_zon3(union_float64 a, union_float64 b, union_float64 c) +{ + if (QEMU_HARDFLOAT_3F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s) && + float64_is_zero_or_normal(c.s); +} + +static inline bool f32_is_inf(union_float32 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinff(a.h); + } + return float32_is_infinity(a.s); +} + +static inline bool f64_is_inf(union_float64 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinf(a.h); + } + return float64_is_infinity(a.s); +} + +/* Note: @fast_test and @post can be NULL */ +static inline float32 +float32_gen2(float32 xa, float32 xb, float_status *s, + hard_f32_op2_fn hard, soft_f32_op2_fn soft, + f32_check_fn pre, f32_check_fn post, + f32_check_fn fast_test, soft_f32_op2_fn fast_op) +{ + union_float32 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float32_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f32_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + +static inline float64 +float64_gen2(float64 xa, float64 xb, float_status *s, + hard_f64_op2_fn hard, soft_f64_op2_fn soft, + f64_check_fn pre, f64_check_fn post, + f64_check_fn fast_test, soft_f64_op2_fn fast_op) +{ + union_float64 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float64_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f64_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabs(ur.h) <= DBL_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + /*---------------------------------------------------------------------------- | Returns the fraction bits of the half-precision floating-point value `a'. *----------------------------------------------------------------------------*/

[v6,07/13] fpu: introduce hardfloat

Commit Message

Comments

Patch