[06/72] softfloat: Move the binary point to the msb

Message ID	20210508014802.892561-7-richard.henderson@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=6eJd=KD=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 650A56140E From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Subject: [PATCH 06/72] softfloat: Move the binary point to the msb Date: Fri, 7 May 2021 18:46:56 -0700 Message-Id: <20210508014802.892561-7-richard.henderson@linaro.org> In-Reply-To: <20210508014802.892561-1-richard.henderson@linaro.org> References: <20210508014802.892561-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::1032; envelope-from=richard.henderson@linaro.org; helo=mail-pj1-x1032.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: alex.bennee@linaro.org, david@redhat.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	Convert floatx80 and float128 to FloatParts \| expand [00/72] Convert floatx80 and float128 to FloatParts [01/72] qemu/host-utils: Use __builtin_bitreverseN [02/72] qemu/host-utils: Add wrappers for overflow builtins [03/72] qemu/host-utils: Add wrappers for carry builtins [04/72] accel/tcg: Use add/sub overflow routines in tcg-runtime-gvec.c [05/72] tests/fp: add quad support to the benchmark utility [06/72] softfloat: Move the binary point to the msb [07/72] softfloat: Inline float_raise [08/72] softfloat: Use float_raise in more places [09/72] softfloat: Tidy a * b + inf return [10/72] softfloat: Add float_cmask and constants [11/72] softfloat: Use return_nan in float_to_float [12/72] softfloat: fix return_nan vs default_nan_mode [13/72] target/mips: Set set_default_nan_mode with set_snan_bit_is_one [14/72] softfloat: Do not produce a default_nan from parts_silence_nan [15/72] softfloat: Rename FloatParts to FloatParts64 [16/72] softfloat: Move type-specific pack/unpack routines [17/72] softfloat: Use pointers with parts_default_nan [18/72] softfloat: Use pointers with unpack_raw [19/72] softfloat: Use pointers with ftype_unpack_raw [20/72] softfloat: Use pointers with pack_raw [21/72] softfloat: Use pointers with ftype_pack_raw [22/72] softfloat: Use pointers with ftype_unpack_canonical [23/72] softfloat: Use pointers with ftype_round_pack_canonical [24/72] softfloat: Use pointers with parts_silence_nan [25/72] softfloat: Rearrange FloatParts64 [26/72] softfloat: Convert float128_silence_nan to parts [27/72] softfloat: Convert float128_default_nan to parts [28/72] softfloat: Move return_nan to softfloat-parts.c.inc [29/72] softfloat: Move pick_nan to softfloat-parts.c.inc [30/72] softfloat: Move pick_nan_muladd to softfloat-parts.c.inc [31/72] softfloat: Move sf_canonicalize to softfloat-parts.c.inc [32/72] softfloat: Move round_canonical to softfloat-parts.c.inc [33/72] softfloat: Use uadd64_carry, usub64_borrow in softfloat-macros.h [34/72] softfloat: Move addsub_floats to softfloat-parts.c.inc [35/72] softfloat: Implement float128_add/sub via parts [36/72] softfloat: Move mul_floats to softfloat-parts.c.inc [37/72] softfloat: Move muladd_floats to softfloat-parts.c.inc [38/72] softfloat: Use mulu64 for mul64To128 [39/72] softfloat: Use add192 in mul128To256 [40/72] softfloat: Tidy mul128By64To192 [41/72] softfloat: Introduce sh[lr]_double primitives [42/72] softfloat: Move div_floats to softfloat-parts.c.inc [43/72] softfloat: Split float_to_float [44/72] softfloat: Convert float-to-float conversions with float128 [45/72] softfloat: Move round_to_int to softfloat-parts.c.inc [46/72] softfloat: Move rount_to_int_and_pack to softfloat-parts.c.inc [47/72] softfloat: Move rount_to_uint_and_pack to softfloat-parts.c.inc [48/72] softfloat: Move int_to_float to softfloat-parts.c.inc [49/72] softfloat: Move uint_to_float to softfloat-parts.c.inc [50/72] softfloat: Move minmax_flags to softfloat-parts.c.inc [51/72] softfloat: Move compare_floats to softfloat-parts.c.inc [52/72] softfloat: Move scalbn_decomposed to softfloat-parts.c.inc [53/72] softfloat: Move sqrt_float to softfloat-parts.c.inc [54/72] softfloat: Split out parts_uncanon_normal [55/72] softfloat: Reduce FloatFmt [56/72] softfloat: Introduce Floatx80RoundPrec [57/72] softfloat: Adjust parts_uncanon_normal for floatx80 [58/72] tests/fp/fp-test: Reverse order of floatx80 precision tests [59/72] softfloat: Convert floatx80_add/sub to FloatParts [60/72] softfloat: Convert floatx80_mul to FloatParts [61/72] softfloat: Convert floatx80_div to FloatParts [62/72] softfloat: Convert floatx80_sqrt to FloatParts [63/72] softfloat: Convert floatx80_round to FloatParts [64/72] softfloat: Convert floatx80_round_to_int to FloatParts [65/72] softfloat: Convert integer to floatx80 to FloatParts [66/72] softfloat: Convert floatx80 float conversions to FloatParts [67/72] softfloat: Convert floatx80 to integer to FloatParts [68/72] softfloat: Convert floatx80_scalbn to FloatParts [69/72] softfloat: Convert floatx80 compare to FloatParts [70/72] softfloat: Convert float32_exp2 to FloatParts [71/72] softfloat: Move floatN_log2 to softfloat-parts.c.inc [72/72] softfloat: Convert modrem operations to FloatParts

Message ID

20210508014802.892561-7-richard.henderson@linaro.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 650A56140E
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Subject: [PATCH 06/72] softfloat: Move the binary point to the msb
Date: Fri,  7 May 2021 18:46:56 -0700
Message-Id: <20210508014802.892561-7-richard.henderson@linaro.org>
In-Reply-To: <20210508014802.892561-1-richard.henderson@linaro.org>
References: <20210508014802.892561-1-richard.henderson@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::1032;
 envelope-from=richard.henderson@linaro.org; helo=mail-pj1-x1032.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: alex.bennee@linaro.org, david@redhat.com
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>

Series

Convert floatx80 and float128 to FloatParts | expand

Commit Message

Richard Henderson May 8, 2021, 1:46 a.m. UTC

Rather than point the binary point at msb-1, put it at the msb.
Use uadd64_overflow to detect when addition overflows instead
of DECOMPOSED_OVERFLOW_BIT.

This reduces the number of special cases within the code, such
as shifting an int64_t either left or right during conversion.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 fpu/softfloat.c | 169 +++++++++++++++++++-----------------------------
 1 file changed, 66 insertions(+), 103 deletions(-)

Comments

Alex Bennée May 10, 2021, 1:36 p.m. UTC | #1

Richard Henderson <richard.henderson@linaro.org> writes:

> Rather than point the binary point at msb-1, put it at the msb.
> Use uadd64_overflow to detect when addition overflows instead
> of DECOMPOSED_OVERFLOW_BIT.
>
> This reduces the number of special cases within the code, such
> as shifting an int64_t either left or right during conversion.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 67cfa0fd82..cd777743f1 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -503,9 +503,8 @@  typedef struct {
     bool sign;
 } FloatParts;
 
-#define DECOMPOSED_BINARY_POINT    (64 - 2)
+#define DECOMPOSED_BINARY_POINT    63
 #define DECOMPOSED_IMPLICIT_BIT    (1ull << DECOMPOSED_BINARY_POINT)
-#define DECOMPOSED_OVERFLOW_BIT    (DECOMPOSED_IMPLICIT_BIT << 1)
 
 /* Structure holding all of the relevant parameters for a format.
  *   exp_size: the size of the exponent field
@@ -657,7 +656,7 @@  static FloatParts sf_canonicalize(FloatParts part, const FloatFmt *parm,
             part.cls = float_class_zero;
             part.frac = 0;
         } else {
-            int shift = clz64(part.frac) - 1;
+            int shift = clz64(part.frac);
             part.cls = float_class_normal;
             part.exp = parm->frac_shift - parm->exp_bias - shift + 1;
             part.frac <<= shift;
@@ -727,9 +726,8 @@  static FloatParts round_canonical(FloatParts p, float_status *s,
         if (likely(exp > 0)) {
             if (frac & round_mask) {
                 flags |= float_flag_inexact;
-                frac += inc;
-                if (frac & DECOMPOSED_OVERFLOW_BIT) {
-                    frac >>= 1;
+                if (uadd64_overflow(frac, inc, &frac)) {
+                    frac = (frac >> 1) | DECOMPOSED_IMPLICIT_BIT;
                     exp++;
                 }
             }
@@ -758,9 +756,12 @@  static FloatParts round_canonical(FloatParts p, float_status *s,
             p.cls = float_class_zero;
             goto do_zero;
         } else {
-            bool is_tiny = s->tininess_before_rounding
-                        || (exp < 0)
-                        || !((frac + inc) & DECOMPOSED_OVERFLOW_BIT);
+            bool is_tiny = s->tininess_before_rounding || (exp < 0);
+
+            if (!is_tiny) {
+                uint64_t discard;
+                is_tiny = !uadd64_overflow(frac, inc, &discard);
+            }
 
             shift64RightJamming(frac, 1 - exp, &frac);
             if (frac & round_mask) {
@@ -985,7 +986,7 @@  static FloatParts addsub_floats(FloatParts a, FloatParts b, bool subtract,
                 a.cls = float_class_zero;
                 a.sign = s->float_rounding_mode == float_round_down;
             } else {
-                int shift = clz64(a.frac) - 1;
+                int shift = clz64(a.frac);
                 a.frac = a.frac << shift;
                 a.exp = a.exp - shift;
                 a.sign = a_sign;
@@ -1022,9 +1023,10 @@  static FloatParts addsub_floats(FloatParts a, FloatParts b, bool subtract,
                 shift64RightJamming(a.frac, b.exp - a.exp, &a.frac);
                 a.exp = b.exp;
             }
-            a.frac += b.frac;
-            if (a.frac & DECOMPOSED_OVERFLOW_BIT) {
+
+            if (uadd64_overflow(a.frac, b.frac, &a.frac)) {
                 shift64RightJamming(a.frac, 1, &a.frac);
+                a.frac |= DECOMPOSED_IMPLICIT_BIT;
                 a.exp += 1;
             }
             return a;
@@ -1219,16 +1221,17 @@  static FloatParts mul_floats(FloatParts a, FloatParts b, float_status *s)
         int exp = a.exp + b.exp;
 
         mul64To128(a.frac, b.frac, &hi, &lo);
-        shift128RightJamming(hi, lo, DECOMPOSED_BINARY_POINT, &hi, &lo);
-        if (lo & DECOMPOSED_OVERFLOW_BIT) {
-            shift64RightJamming(lo, 1, &lo);
+        if (hi & DECOMPOSED_IMPLICIT_BIT) {
             exp += 1;
+        } else {
+            hi <<= 1;
         }
+        hi |= (lo != 0);
 
         /* Re-use a */
         a.exp = exp;
         a.sign = sign;
-        a.frac = lo;
+        a.frac = hi;
         return a;
     }
     /* handle all the NaN cases */
@@ -1411,56 +1414,41 @@  static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
 
     p_exp = a.exp + b.exp;
 
-    /* Multiply of 2 62-bit numbers produces a (2*62) == 124-bit
-     * result.
-     */
     mul64To128(a.frac, b.frac, &hi, &lo);
-    /* binary point now at bit 124 */
 
-    /* check for overflow */
-    if (hi & (1ULL << (DECOMPOSED_BINARY_POINT * 2 + 1 - 64))) {
-        shift128RightJamming(hi, lo, 1, &hi, &lo);
+    /* Renormalize to the msb. */
+    if (hi & DECOMPOSED_IMPLICIT_BIT) {
         p_exp += 1;
+    } else {
+        shortShift128Left(hi, lo, 1, &hi, &lo);
     }
 
     /* + add/sub */
-    if (c.cls == float_class_zero) {
-        /* move binary point back to 62 */
-        shift128RightJamming(hi, lo, DECOMPOSED_BINARY_POINT, &hi, &lo);
-    } else {
+    if (c.cls != float_class_zero) {
         int exp_diff = p_exp - c.exp;
         if (p_sign == c.sign) {
             /* Addition */
             if (exp_diff <= 0) {
-                shift128RightJamming(hi, lo,
-                                     DECOMPOSED_BINARY_POINT - exp_diff,
-                                     &hi, &lo);
-                lo += c.frac;
+                shift64RightJamming(hi, -exp_diff, &hi);
                 p_exp = c.exp;
+                if (uadd64_overflow(hi, c.frac, &hi)) {
+                    shift64RightJamming(hi, 1, &hi);
+                    hi |= DECOMPOSED_IMPLICIT_BIT;
+                    p_exp += 1;
+                }
             } else {
-                uint64_t c_hi, c_lo;
-                /* shift c to the same binary point as the product (124) */
-                c_hi = c.frac >> 2;
-                c_lo = 0;
-                shift128RightJamming(c_hi, c_lo,
-                                     exp_diff,
-                                     &c_hi, &c_lo);
-                add128(hi, lo, c_hi, c_lo, &hi, &lo);
-                /* move binary point back to 62 */
-                shift128RightJamming(hi, lo, DECOMPOSED_BINARY_POINT, &hi, &lo);
+                uint64_t c_hi, c_lo, over;
+                shift128RightJamming(c.frac, 0, exp_diff, &c_hi, &c_lo);
+                add192(0, hi, lo, 0, c_hi, c_lo, &over, &hi, &lo);
+                if (over) {
+                    shift64RightJamming(hi, 1, &hi);
+                    hi |= DECOMPOSED_IMPLICIT_BIT;
+                    p_exp += 1;
+                }
             }
-
-            if (lo & DECOMPOSED_OVERFLOW_BIT) {
-                shift64RightJamming(lo, 1, &lo);
-                p_exp += 1;
-            }
-
         } else {
             /* Subtraction */
-            uint64_t c_hi, c_lo;
-            /* make C binary point match product at bit 124 */
-            c_hi = c.frac >> 2;
-            c_lo = 0;
+            uint64_t c_hi = c.frac, c_lo = 0;
 
             if (exp_diff <= 0) {
                 shift128RightJamming(hi, lo, -exp_diff, &hi, &lo);
@@ -1495,20 +1483,15 @@  static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
                 /* Normalizing to a binary point of 124 is the
                    correct adjust for the exponent.  However since we're
                    shifting, we might as well put the binary point back
-                   at 62 where we really want it.  Therefore shift as
+                   at 63 where we really want it.  Therefore shift as
                    if we're leaving 1 bit at the top of the word, but
                    adjust the exponent as if we're leaving 3 bits.  */
-                shift -= 1;
-                if (shift >= 64) {
-                    lo = lo << (shift - 64);
-                } else {
-                    hi = (hi << shift) | (lo >> (64 - shift));
-                    lo = hi | ((lo << shift) != 0);
-                }
-                p_exp -= shift - 2;
+                shift128Left(hi, lo, shift, &hi, &lo);
+                p_exp -= shift;
             }
         }
     }
+    hi |= (lo != 0);
 
     if (flags & float_muladd_halve_result) {
         p_exp -= 1;
@@ -1518,7 +1501,7 @@  static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
     a.cls = float_class_normal;
     a.sign = p_sign ^ sign_flip;
     a.exp = p_exp;
-    a.frac = lo;
+    a.frac = hi;
 
     return a;
 }
@@ -1742,25 +1725,17 @@  static FloatParts div_floats(FloatParts a, FloatParts b, float_status *s)
          * exponent to match.
          *
          * The udiv_qrnnd algorithm that we're using requires normalization,
-         * i.e. the msb of the denominator must be set.  Since we know that
-         * DECOMPOSED_BINARY_POINT is msb-1, the inputs must be shifted left
-         * by one (more), and the remainder must be shifted right by one.
+         * i.e. the msb of the denominator must be set, which is already true.
          */
         if (a.frac < b.frac) {
             exp -= 1;
-            shift128Left(0, a.frac, DECOMPOSED_BINARY_POINT + 2, &n1, &n0);
-        } else {
             shift128Left(0, a.frac, DECOMPOSED_BINARY_POINT + 1, &n1, &n0);
+        } else {
+            shift128Left(0, a.frac, DECOMPOSED_BINARY_POINT, &n1, &n0);
         }
-        q = udiv_qrnnd(&r, n1, n0, b.frac << 1);
+        q = udiv_qrnnd(&r, n1, n0, b.frac);
 
-        /*
-         * Set lsb if there is a remainder, to set inexact.
-         * As mentioned above, to find the actual value of the remainder we
-         * would need to shift right, but (1) we are only concerned about
-         * non-zero-ness, and (2) the remainder will always be even because
-         * both inputs to the division primitive are even.
-         */
+        /* Set lsb if there is a remainder, to set inexact. */
         a.frac = q | (r != 0);
         a.sign = sign;
         a.exp = exp;
@@ -2135,12 +2110,12 @@  static FloatParts round_to_int(FloatParts a, FloatRoundMode rmode,
 
             if (a.frac & rnd_mask) {
                 s->float_exception_flags |= float_flag_inexact;
-                a.frac += inc;
-                a.frac &= ~rnd_mask;
-                if (a.frac & DECOMPOSED_OVERFLOW_BIT) {
+                if (uadd64_overflow(a.frac, inc, &a.frac)) {
                     a.frac >>= 1;
+                    a.frac |= DECOMPOSED_IMPLICIT_BIT;
                     a.exp++;
                 }
+                a.frac &= ~rnd_mask;
             }
         }
         break;
@@ -2213,10 +2188,8 @@  static int64_t round_to_int_and_pack(FloatParts in, FloatRoundMode rmode,
     case float_class_zero:
         return 0;
     case float_class_normal:
-        if (p.exp < DECOMPOSED_BINARY_POINT) {
+        if (p.exp <= DECOMPOSED_BINARY_POINT) {
             r = p.frac >> (DECOMPOSED_BINARY_POINT - p.exp);
-        } else if (p.exp - DECOMPOSED_BINARY_POINT < 2) {
-            r = p.frac << (p.exp - DECOMPOSED_BINARY_POINT);
         } else {
             r = UINT64_MAX;
         }
@@ -2498,10 +2471,8 @@  static uint64_t round_to_uint_and_pack(FloatParts in, FloatRoundMode rmode,
             return 0;
         }
 
-        if (p.exp < DECOMPOSED_BINARY_POINT) {
+        if (p.exp <= DECOMPOSED_BINARY_POINT) {
             r = p.frac >> (DECOMPOSED_BINARY_POINT - p.exp);
-        } else if (p.exp - DECOMPOSED_BINARY_POINT < 2) {
-            r = p.frac << (p.exp - DECOMPOSED_BINARY_POINT);
         } else {
             s->float_exception_flags = orig_flags | float_flag_invalid;
             return max;
@@ -2765,11 +2736,11 @@  static FloatParts int_to_float(int64_t a, int scale, float_status *status)
             f = -f;
             r.sign = true;
         }
-        shift = clz64(f) - 1;
+        shift = clz64(f);
         scale = MIN(MAX(scale, -0x10000), 0x10000);
 
         r.exp = DECOMPOSED_BINARY_POINT - shift + scale;
-        r.frac = (shift < 0 ? DECOMPOSED_IMPLICIT_BIT : f << shift);
+        r.frac = f << shift;
     }
 
     return r;
@@ -2920,21 +2891,16 @@  bfloat16 int16_to_bfloat16(int16_t a, float_status *status)
 static FloatParts uint_to_float(uint64_t a, int scale, float_status *status)
 {
     FloatParts r = { .sign = false };
+    int shift;
 
     if (a == 0) {
         r.cls = float_class_zero;
     } else {
         scale = MIN(MAX(scale, -0x10000), 0x10000);
+        shift = clz64(a);
         r.cls = float_class_normal;
-        if ((int64_t)a < 0) {
-            r.exp = DECOMPOSED_BINARY_POINT + 1 + scale;
-            shift64RightJamming(a, 1, &a);
-            r.frac = a;
-        } else {
-            int shift = clz64(a) - 1;
-            r.exp = DECOMPOSED_BINARY_POINT - shift + scale;
-            r.frac = a << shift;
-        }
+        r.exp = DECOMPOSED_BINARY_POINT - shift + scale;
+        r.frac = a << shift;
     }
 
     return r;
@@ -3475,12 +3441,9 @@  static FloatParts sqrt_float(FloatParts a, float_status *s, const FloatFmt *p)
     /* We need two overflow bits at the top. Adding room for that is a
      * right shift. If the exponent is odd, we can discard the low bit
      * by multiplying the fraction by 2; that's a left shift. Combine
-     * those and we shift right if the exponent is even.
+     * those and we shift right by 1 if the exponent is odd, otherwise 2.
      */
-    a_frac = a.frac;
-    if (!(a.exp & 1)) {
-        a_frac >>= 1;
-    }
+    a_frac = a.frac >> (2 - (a.exp & 1));
     a.exp >>= 1;
 
     /* Bit-by-bit computation of sqrt.  */
@@ -3488,10 +3451,10 @@  static FloatParts sqrt_float(FloatParts a, float_status *s, const FloatFmt *p)
     s_frac = 0;
 
     /* Iterate from implicit bit down to the 3 extra bits to compute a
-     * properly rounded result. Remember we've inserted one more bit
-     * at the top, so these positions are one less.
+     * properly rounded result. Remember we've inserted two more bits
+     * at the top, so these positions are two less.
      */
-    bit = DECOMPOSED_BINARY_POINT - 1;
+    bit = DECOMPOSED_BINARY_POINT - 2;
     last_bit = MAX(p->frac_shift - 4, 0);
     do {
         uint64_t q = 1ULL << bit;
@@ -3507,7 +3470,7 @@  static FloatParts sqrt_float(FloatParts a, float_status *s, const FloatFmt *p)
     /* Undo the right shift done above. If there is any remaining
      * fraction, the result is inexact. Set the sticky bit.
      */
-    a.frac = (r_frac << 1) + (a_frac != 0);
+    a.frac = (r_frac << 2) + (a_frac != 0);
 
     return a;
 }

[06/72] softfloat: Move the binary point to the msb

Commit Message

Comments

Patch