From patchwork Sun Jan 28 11:10:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jisheng Zhang X-Patchwork-Id: 13534438 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 32D31C47DDF for ; Sun, 28 Jan 2024 11:23:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=ZRogdoJhZD14awQ7+AZ7s6Qw+3KEXJpR408Dyohsp14=; b=eg9xuHh5DiwaNj 5ot7jRJKUKh07a+tW6+sJgCK0OGxonMcqMMwiVOLkpjUu7EEdDCMR8s3qIjdJRMmTb2sBVAWB+vuQ sa7i4qD69qSJpvKb+oWNpsbYeOkxvGwirmeUCZRFterTLNzpcVX+LrUeUgI9PNcE9OouvwCuLvWkh YyNGS+BQaFoBquw7dmusNOjntGraMPOxq5modlDVdXouZ3VVV6e3AALgS3iEGY0INhOHi4lCrJF3S pGGRFwQwjZPJ/3l3TfPRXzyLExeX3Sspp6ndk0CfQfTmgQF0dHPtMRfJT+/24xl/ZnkHwIpwOYiJv MEkTCVuFV3C3HFrFdU1w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3GF-00000009JVT-1xwO; Sun, 28 Jan 2024 11:23:39 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3GB-00000009JTo-3g0F for linux-riscv@lists.infradead.org; Sun, 28 Jan 2024 11:23:38 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 37784619F4; Sun, 28 Jan 2024 11:23:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 648C4C433F1; Sun, 28 Jan 2024 11:23:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706440985; bh=H6ObIiC1/SrvGoNUsYPERHmflaLup99266tKz2q+n8o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=k6NmmI47wmL2hkfLZGltatV2vMduLuKRl2JY1X+UAwUb9K8bk+Ue6SgAtdKiekNXa J03Zs6jj/+3idPhvTF3eWS2YLXhAQxVabJGuXtodv2bWIAgZUonmSkIjImkIay8Deg pu62F1I5ogsE9OxqJK+XFYIkxG7XuVUXaAhiolz4QLeA8RpotNrdgr9pt6J+wlTG0E kqERMKu1+zU9bttFQfeCh3XUKiCft6zsZ+i5Pmo4oL/tUAzHL37t6/JlgXhBPUDQ6V o4DjUq3a3cfxDa0nHYkvE10ucloFdqc7MctaoautCBVW/n0SFAHk1mKNbYpi+OENYh LmTOEOLk0Yzaw== From: Jisheng Zhang To: Paul Walmsley , Palmer Dabbelt , Albert Ou Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Matteo Croce , kernel test robot Subject: [PATCH 1/3] riscv: optimized memcpy Date: Sun, 28 Jan 2024 19:10:11 +0800 Message-ID: <20240128111013.2450-2-jszhang@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240128111013.2450-1-jszhang@kernel.org> References: <20240128111013.2450-1-jszhang@kernel.org> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240128_032336_037167_445EDABA X-CRM114-Status: GOOD ( 16.37 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org From: Matteo Croce Write a C version of memcpy() which uses the biggest data size allowed, without generating unaligned accesses. The procedure is made of three steps: First copy data one byte at time until the destination buffer is aligned to a long boundary. Then copy the data one long at time shifting the current and the next u8 to compose a long at every cycle. Finally, copy the remainder one byte at time. On a BeagleV, the TCP RX throughput increased by 45%: before: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 76.4 MBytes 641 Mbits/sec 27 624 KBytes [ 5] 1.00-2.00 sec 72.5 MBytes 608 Mbits/sec 0 708 KBytes [ 5] 2.00-3.00 sec 73.8 MBytes 619 Mbits/sec 10 451 KBytes [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 564 KBytes [ 5] 4.00-5.00 sec 73.8 MBytes 619 Mbits/sec 0 658 KBytes [ 5] 5.00-6.00 sec 73.8 MBytes 619 Mbits/sec 14 522 KBytes [ 5] 6.00-7.00 sec 73.8 MBytes 619 Mbits/sec 0 621 KBytes [ 5] 7.00-8.00 sec 72.5 MBytes 608 Mbits/sec 0 706 KBytes [ 5] 8.00-9.00 sec 73.8 MBytes 619 Mbits/sec 20 580 KBytes [ 5] 9.00-10.00 sec 73.8 MBytes 619 Mbits/sec 0 672 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 736 MBytes 618 Mbits/sec 71 sender [ 5] 0.00-10.01 sec 733 MBytes 615 Mbits/sec receiver after: $ iperf3 -c beaglev Connecting to host beaglev, port 5201 [ 5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 109 MBytes 912 Mbits/sec 48 559 KBytes [ 5] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0 690 KBytes [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 36 396 KBytes [ 5] 3.00-4.00 sec 108 MBytes 902 Mbits/sec 0 567 KBytes [ 5] 4.00-5.00 sec 106 MBytes 891 Mbits/sec 0 699 KBytes [ 5] 5.00-6.00 sec 106 MBytes 891 Mbits/sec 32 414 KBytes [ 5] 6.00-7.00 sec 106 MBytes 891 Mbits/sec 0 583 KBytes [ 5] 7.00-8.00 sec 106 MBytes 891 Mbits/sec 0 708 KBytes [ 5] 8.00-9.00 sec 106 MBytes 891 Mbits/sec 28 433 KBytes [ 5] 9.00-10.00 sec 108 MBytes 902 Mbits/sec 0 591 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.04 GBytes 897 Mbits/sec 144 sender [ 5] 0.00-10.01 sec 1.04 GBytes 894 Mbits/sec receiver And the decreased CPU time of the memcpy() is observable with perf top. This is the `perf top -Ue task-clock` output when doing the test: before: Overhead Shared O Symbol 42.22% [kernel] [k] memcpy 35.00% [kernel] [k] __asm_copy_to_user 3.50% [kernel] [k] sifive_l2_flush64_range 2.30% [kernel] [k] stmmac_napi_poll_rx 1.11% [kernel] [k] memset after: Overhead Shared O Symbol 45.69% [kernel] [k] __asm_copy_to_user 29.06% [kernel] [k] memcpy 4.09% [kernel] [k] sifive_l2_flush64_range 2.77% [kernel] [k] stmmac_napi_poll_rx 1.24% [kernel] [k] memset Compared with Matteo's original series, Jisheng made below changes: 1. adopt Emil's change to fix boot failure when build with clang 2. add corresponding changes to purgatory 3. always build optimized string.c rather than only build when optimize for performance 4. implement unroll support when src & dst are both aligned to keep the same performance as assembly version. After disassembling, I found that the unroll version looks something like below, so it acchieves the "unroll" effect as asm version but in C programming language: ld t2,0(a5) ld t0,8(a5) ld t6,16(a5) ld t5,24(a5) ld t4,32(a5) ld t3,40(a5) ld t1,48(a5) ld a1,56(a5) sd t2,0(a6) sd t0,8(a6) sd t6,16(a6) sd t5,24(a6) sd t4,32(a6) sd t3,40(a6) sd t1,48(a6) sd a1,56(a6) And per my testing, unrolling more doesn't help performance, so the "c" version only unrolls by using 8 GP regs rather than 16 ones as asm version. 5. Add proper __pi_memcpy and __pi___memcpy alias 6. more performance numbers. Jisheng's commit msg: Use the benchmark program from [1], I got below results on TH1520, CV1800B and JH7110 platforms. *TH1520 platform (I fixed cpu freq at 750MHZ): Before the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.52 64K: 0.43 128K: 0.38 256K: 0.35 512K: 0.31 1024K: 0.22 avg 0.34 memcpy_call 32K: 0.41 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.28 1024K: 0.20 avg 0.30 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.46 16B: 0.61 32B: 0.84 64B: 0.89 128B: 3.31 256B: 3.44 512B: 3.51 memcpy_call 8B: 0.18 16B: 0.26 32B: 0.50 64B: 0.90 128B: 1.57 256B: 2.31 512B: 2.92 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.19 16B: 0.18 32B: 0.25 64B: 0.30 128B: 0.33 256B: 0.35 512B: 0.36 memcpy_call 8B: 0.16 16B: 0.22 32B: 0.39 64B: 0.70 128B: 1.11 256B: 1.46 512B: 1.81 Large memcpy (bytes/ns): __memcpy 1K: 3.57 2K: 3.85 4K: 3.75 8K: 3.98 16K: 4.03 32K: 4.06 64K: 4.40 memcpy_call 1K: 3.13 2K: 3.75 4K: 3.99 8K: 4.29 16K: 4.40 32K: 4.46 64K: 4.63 After the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.32 64K: 0.28 128K: 0.26 256K: 0.24 512K: 0.22 1024K: 0.17 avg 0.24 memcpy_call 32K: 0.39 64K: 0.34 128K: 0.32 256K: 0.30 512K: 0.27 1024K: 0.20 avg 0.29 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.20 16B: 0.22 32B: 0.25 64B: 2.43 128B: 3.19 256B: 3.36 512B: 3.55 memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.30 512B: 2.92 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.22 16B: 0.29 32B: 0.49 64B: 0.51 128B: 0.87 256B: 1.08 512B: 1.27 memcpy_call 8B: 0.12 16B: 0.21 32B: 0.40 64B: 0.70 128B: 1.10 256B: 1.46 512B: 1.80 Large memcpy (bytes/ns): __memcpy 1K: 3.63 2K: 3.66 4K: 3.78 8K: 3.87 16K: 3.96 32K: 4.11 64K: 4.40 memcpy_call 1K: 3.32 2K: 3.68 4K: 3.99 8K: 4.17 16K: 4.25 32K: 4.48 64K: 4.60 As can be seen, the unaligned medium memcpy performance is improved by about 252%, I.E got 3.5x speed of original's. The performance of other style mempcy is kept the same as original's. And since the TH1520 supports HAVE_EFFICIENT_UNALIGNED_ACCESS, we can optimize the memcpy futher without taking care of alignment at all. Random memcpy (bytes/ns): __memcpy 32K: 0.35 64K: 0.31 128K: 0.28 256K: 0.25 512K: 0.23 1024K: 0.17 av g 0.25 memcpy_call 32K: 0.40 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.27 1024K: 0.20 av g 0.30 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.21 16B: 0.23 32B: 0.27 64B: 3.34 128B: 3.42 256B: 3.50 512B: 3.58 memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.31 512B: 2.92 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.20 16B: 0.23 32B: 0.28 64B: 3.05 128B: 2.70 256B: 2.82 512B: 2.88 memcpy_call 8B: 0.16 16B: 0.21 32B: 0.38 64B: 0.70 128B: 1.11 256B: 1.50 512B: 1.81 Large memcpy (bytes/ns): __memcpy 1K: 3.62 2K: 3.71 4K: 3.76 8K: 3.92 16K: 3.96 32K: 4.12 64K: 4.40 memcpy_call 1K: 3.11 2K: 3.66 4K: 4.02 8K: 4.16 16K: 4.34 32K: 4.47 64K: 4.62 As can be seen, the unaligned medium memcpy is improved by 700%, I.E 8x speed of original's. *CV1800B platform: Before the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.21 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08 memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.26 16B: 0.36 32B: 0.48 64B: 0.51 128B: 2.01 256B: 2.44 512B: 2.73 memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.11 16B: 0.12 32B: 0.15 64B: 0.16 128B: 0.16 256B: 0.17 512B: 0.17 memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77 Large memcpy (bytes/ns): __memcpy 1K: 2.90 2K: 2.91 4K: 3.00 8K: 3.04 16K: 3.03 32K: 2.89 64K: 2.52 memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54 After the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.15 64K: 0.08 128K: 0.06 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07 memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.11 16B: 0.11 32B: 0.14 64B: 1.15 128B: 1.62 256B: 2.06 512B: 2.40 memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.11 16B: 0.12 32B: 0.21 64B: 0.32 128B: 0.43 256B: 0.52 512B: 0.59 memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77 Large memcpy (bytes/ns): __memcpy 1K: 2.56 2K: 2.71 4K: 2.78 8K: 2.81 16K: 2.80 32K: 2.68 64K: 2.51 memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54 We get similar performance improvement as TH1520. And since CV1800B also supports HAVE_EFFICIENT_UNALIGNED_ACCESS, so the performance can be improved futher: Random memcpy (bytes/ns): __memcpy 32K: 0.15 64K: 0.08 128K: 0.07 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07 memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.55 128B: 2.01 256B: 2.36 512B: 2.58 memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.06 128B: 1.26 256B: 1.39 512B: 1.46 memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77 Large memcpy (bytes/ns): __memcpy 1K: 2.65 2K: 2.76 4K: 2.80 8K: 2.82 16K: 2.81 32K: 2.68 64K: 2.51 memcpy_call 1K: 1.63 2K: 1.74 4K: 1.80 8K: 1.84 16K: 1.84 32K: 1.78 64K: 1.54 Now the unaligned medium memcpy is running at 8.6x speed of original's! *JH7110 (I fixed cpufreq at 1.5GHZ) Before the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.45 64K: 0.40 128K: 0.36 256K: 0.33 512K: 0.33 1024K: 0.31 avg 0.36 memcpy_call 32K: 0.43 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.42 16B: 0.55 32B: 0.65 64B: 0.72 128B: 2.91 256B: 3.36 512B: 3.65 memcpy_call 8B: 0.16 16B: 0.36 32B: 0.67 64B: 1.14 128B: 1.70 256B: 2.26 512B: 2.71 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.17 16B: 0.18 32B: 0.19 64B: 0.19 128B: 0.19 256B: 0.20 512B: 0.20 memcpy_call 8B: 0.16 16B: 0.20 32B: 0.36 64B: 0.62 128B: 0.94 256B: 1.28 512B: 1.52 Large memcpy (bytes/ns): __memcpy 1K: 3.62 2K: 3.82 4K: 3.90 8K: 3.95 16K: 3.97 32K: 1.33 64K: 1.33 memcpy_call 1K: 2.93 2K: 3.14 4K: 3.25 8K: 3.31 16K: 3.19 32K: 1.27 64K: 1.28 After the patch: Random memcpy (bytes/ns): __memcpy 32K: 0.26 64K: 0.24 128K: 0.23 256K: 0.22 512K: 0.22 1024K: 0.21 avg 0.23 memcpy_call 32K: 0.42 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34 Aligned medium memcpy (bytes/ns): __memcpy 8B: 0.17 16B: 0.17 32B: 0.18 64B: 1.94 128B: 2.56 256B: 3.04 512B: 3.36 memcpy_call 8B: 0.17 16B: 0.36 32B: 0.65 64B: 1.12 128B: 1.73 256B: 2.37 512B: 2.91 Unaligned medium memcpy (bytes/ns): __memcpy 8B: 0.17 16B: 0.24 32B: 0.41 64B: 0.63 128B: 0.85 256B: 1.00 512B: 1.14 memcpy_call 8B: 0.16 16B: 0.22 32B: 0.38 64B: 0.65 128B: 0.99 256B: 1.35 512B: 1.61 Large memcpy (bytes/ns): __memcpy 1K: 3.43 2K: 3.59 4K: 3.67 8K: 3.72 16K: 3.73 32K: 1.28 64K: 1.28 memcpy_call 1K: 3.21 2K: 3.46 4K: 3.60 8K: 3.68 16K: 3.51 32K: 1.27 64K: 1.28 As can be seen, the unaligned medium memcpy performance is improved by about 470%, I.E 5.7x speed of original's. The performance of other style mempcy is kept the same as original's. Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1] Signed-off-by: Matteo Croce Co-developed-by: Jisheng Zhang Signed-off-by: Jisheng Zhang Reported-by: kernel test robot --- arch/riscv/include/asm/string.h | 6 +- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 7 +- arch/riscv/lib/memcpy.S | 110 ----------------------------- arch/riscv/lib/string.c | 121 ++++++++++++++++++++++++++++++++ arch/riscv/purgatory/Makefile | 10 +-- 6 files changed, 136 insertions(+), 120 deletions(-) delete mode 100644 arch/riscv/lib/memcpy.S create mode 100644 arch/riscv/lib/string.c diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index a96b1fea24fe..edf1d56e4f13 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -12,9 +12,11 @@ #define __HAVE_ARCH_MEMSET extern asmlinkage void *memset(void *, int, size_t); extern asmlinkage void *__memset(void *, int, size_t); + #define __HAVE_ARCH_MEMCPY -extern asmlinkage void *memcpy(void *, const void *, size_t); -extern asmlinkage void *__memcpy(void *, const void *, size_t); +extern void *memcpy(void *dest, const void *src, size_t count); +extern void *__memcpy(void *dest, const void *src, size_t count); + #define __HAVE_ARCH_MEMMOVE extern asmlinkage void *memmove(void *, const void *, size_t); extern asmlinkage void *__memmove(void *, const void *, size_t); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index a72879b4249a..c69dc74e0a27 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,11 +10,9 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memcpy); EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(strcmp); EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strncmp); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memcpy); EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index bd6e6c1b0497..5f2f94f6db17 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,10 +1,10 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memcpy.o lib-y += memset.o lib-y += memmove.o lib-y += strcmp.o lib-y += strlen.o +lib-y += string.o lib-y += strncmp.o lib-y += csum.o ifeq ($(CONFIG_MMU), y) @@ -14,6 +14,11 @@ lib-$(CONFIG_MMU) += uaccess.o lib-$(CONFIG_64BIT) += tishift.o lib-$(CONFIG_RISCV_ISA_ZICBOZ) += clear_page.o +# string.o implements standard library functions like memset/memcpy etc. +# Use -ffreestanding to ensure that the compiler does not try to "optimize" +# them into calls to themselves. +CFLAGS_string.o := -ffreestanding + obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o lib-$(CONFIG_RISCV_ISA_V) += xor.o lib-$(CONFIG_RISCV_ISA_V) += riscv_v_helpers.o diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S deleted file mode 100644 index 44e009ec5fef..000000000000 --- a/arch/riscv/lib/memcpy.S +++ /dev/null @@ -1,110 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - -#include -#include - -/* void *memcpy(void *, const void *, size_t) */ -SYM_FUNC_START(__memcpy) - move t6, a0 /* Preserve return value */ - - /* Defer to byte-oriented copy for small sizes */ - sltiu a3, a2, 128 - bnez a3, 4f - /* Use word-oriented copy only if low-order bits match */ - andi a3, t6, SZREG-1 - andi a4, a1, SZREG-1 - bne a3, a4, 4f - - beqz a3, 2f /* Skip if already aligned */ - /* - * Round to nearest double word-aligned address - * greater than or equal to start address - */ - andi a3, a1, ~(SZREG-1) - addi a3, a3, SZREG - /* Handle initial misalignment */ - sub a4, a3, a1 -1: - lb a5, 0(a1) - addi a1, a1, 1 - sb a5, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: - andi a4, a2, ~((16*SZREG)-1) - beqz a4, 4f - add a3, a1, a4 -3: - REG_L a4, 0(a1) - REG_L a5, SZREG(a1) - REG_L a6, 2*SZREG(a1) - REG_L a7, 3*SZREG(a1) - REG_L t0, 4*SZREG(a1) - REG_L t1, 5*SZREG(a1) - REG_L t2, 6*SZREG(a1) - REG_L t3, 7*SZREG(a1) - REG_L t4, 8*SZREG(a1) - REG_L t5, 9*SZREG(a1) - REG_S a4, 0(t6) - REG_S a5, SZREG(t6) - REG_S a6, 2*SZREG(t6) - REG_S a7, 3*SZREG(t6) - REG_S t0, 4*SZREG(t6) - REG_S t1, 5*SZREG(t6) - REG_S t2, 6*SZREG(t6) - REG_S t3, 7*SZREG(t6) - REG_S t4, 8*SZREG(t6) - REG_S t5, 9*SZREG(t6) - REG_L a4, 10*SZREG(a1) - REG_L a5, 11*SZREG(a1) - REG_L a6, 12*SZREG(a1) - REG_L a7, 13*SZREG(a1) - REG_L t0, 14*SZREG(a1) - REG_L t1, 15*SZREG(a1) - addi a1, a1, 16*SZREG - REG_S a4, 10*SZREG(t6) - REG_S a5, 11*SZREG(t6) - REG_S a6, 12*SZREG(t6) - REG_S a7, 13*SZREG(t6) - REG_S t0, 14*SZREG(t6) - REG_S t1, 15*SZREG(t6) - addi t6, t6, 16*SZREG - bltu a1, a3, 3b - andi a2, a2, (16*SZREG)-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, a1, a2 - - /* Use word-oriented copy if co-aligned to word boundary */ - or a5, a1, t6 - or a5, a5, a3 - andi a5, a5, 3 - bnez a5, 5f -7: - lw a4, 0(a1) - addi a1, a1, 4 - sw a4, 0(t6) - addi t6, t6, 4 - bltu a1, a3, 7b - - ret - -5: - lb a4, 0(a1) - addi a1, a1, 1 - sb a4, 0(t6) - addi t6, t6, 1 - bltu a1, a3, 5b -6: - ret -SYM_FUNC_END(__memcpy) -SYM_FUNC_ALIAS_WEAK(memcpy, __memcpy) -SYM_FUNC_ALIAS(__pi_memcpy, __memcpy) -SYM_FUNC_ALIAS(__pi___memcpy, __memcpy) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c new file mode 100644 index 000000000000..5f9c83ec548d --- /dev/null +++ b/arch/riscv/lib/string.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * String functions optimized for hardware which doesn't + * handle unaligned memory accesses efficiently. + * + * Copyright (C) 2021 Matteo Croce + */ + +#include +#include + +/* Minimum size for a word copy to be convenient */ +#define BYTES_LONG sizeof(long) +#define WORD_MASK (BYTES_LONG - 1) +#define MIN_THRESHOLD (BYTES_LONG * 2) + +/* convenience union to avoid cast between different pointer types */ +union types { + u8 *as_u8; + unsigned long *as_ulong; + uintptr_t as_uptr; +}; + +union const_types { + const u8 *as_u8; + const unsigned long *as_ulong; + const uintptr_t as_uptr; +}; + +static void __memcpy_aligned(unsigned long *dest, const unsigned long *src, size_t count) +{ + for (; count > 0; count -= BYTES_LONG * 8) { + register unsigned long d0, d1, d2, d3, d4, d5, d6, d7; + d0 = src[0]; + d1 = src[1]; + d2 = src[2]; + d3 = src[3]; + d4 = src[4]; + d5 = src[5]; + d6 = src[6]; + d7 = src[7]; + dest[0] = d0; + dest[1] = d1; + dest[2] = d2; + dest[3] = d3; + dest[4] = d4; + dest[5] = d5; + dest[6] = d6; + dest[7] = d7; + dest += 8; + src += 8; + } +} + +void *__memcpy(void *dest, const void *src, size_t count) +{ + union const_types s = { .as_u8 = src }; + union types d = { .as_u8 = dest }; + int distance = 0; + + if (count < MIN_THRESHOLD) + goto copy_remainder; + + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { + /* Copy a byte at time until destination is aligned. */ + for (; d.as_uptr & WORD_MASK; count--) + *d.as_u8++ = *s.as_u8++; + + distance = s.as_uptr & WORD_MASK; + } + + if (distance) { + unsigned long last, next; + + /* + * s is distance bytes ahead of d, and d just reached + * the alignment boundary. Move s backward to word align it + * and shift data to compensate for distance, in order to do + * word-by-word copy. + */ + s.as_u8 -= distance; + + next = s.as_ulong[0]; + for (; count >= BYTES_LONG; count -= BYTES_LONG) { + last = next; + next = s.as_ulong[1]; + + d.as_ulong[0] = last >> (distance * 8) | + next << ((BYTES_LONG - distance) * 8); + + d.as_ulong++; + s.as_ulong++; + } + + /* Restore s with the original offset. */ + s.as_u8 += distance; + } else { + /* + * If the source and dest lower bits are the same, do a simple + * aligned copy. + */ + size_t aligned_count = count & ~(BYTES_LONG * 8 - 1); + + __memcpy_aligned(d.as_ulong, s.as_ulong, aligned_count); + d.as_u8 += aligned_count; + s.as_u8 += aligned_count; + count &= BYTES_LONG * 8 - 1; + } + +copy_remainder: + while (count--) + *d.as_u8++ = *s.as_u8++; + + return dest; +} +EXPORT_SYMBOL(__memcpy); + +void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy); +EXPORT_SYMBOL(memcpy); +void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); +void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); diff --git a/arch/riscv/purgatory/Makefile b/arch/riscv/purgatory/Makefile index 280b0eb352b8..8b940ff04895 100644 --- a/arch/riscv/purgatory/Makefile +++ b/arch/riscv/purgatory/Makefile @@ -1,21 +1,21 @@ # SPDX-License-Identifier: GPL-2.0 OBJECT_FILES_NON_STANDARD := y -purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memcpy.o memset.o -purgatory-y += strcmp.o strlen.o strncmp.o +purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memset.o +purgatory-y += strcmp.o strlen.o strncmp.o riscv_string.o targets += $(purgatory-y) PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y)) +$(obj)/riscv_string.o: $(srctree)/arch/riscv/lib/string.c FORCE + $(call if_changed_rule,cc_o_c) + $(obj)/string.o: $(srctree)/lib/string.c FORCE $(call if_changed_rule,cc_o_c) $(obj)/ctype.o: $(srctree)/lib/ctype.c FORCE $(call if_changed_rule,cc_o_c) -$(obj)/memcpy.o: $(srctree)/arch/riscv/lib/memcpy.S FORCE - $(call if_changed_rule,as_o_S) - $(obj)/memset.o: $(srctree)/arch/riscv/lib/memset.S FORCE $(call if_changed_rule,as_o_S) From patchwork Sun Jan 28 11:10:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jisheng Zhang X-Patchwork-Id: 13534432 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C17AAC47258 for ; Sun, 28 Jan 2024 11:23:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=0PyNhA/z83ZSGdaNW4eeGLAq1VZv3uRr2YrGeJAMbrE=; b=m45t0555ertE1v YVrPgjJnNgjMxnnq3qLSIA07f7zm8ScdCuVvz5nlLy3FHlDOhKMl959W3Ptn41gO1KdIEqdkhau3Q eMGZ5Z0ireM37m3kbX7/PSRXRwY7y3huS+GudRv3sVIxwVg4NK5+3w6qAq26ZJqGSbKjn4iFUfCQM Piz0QYroiujZpoaabaQ+U3sG5TA+uRGuKCM9t70GT8BUcnn2+Sn7zp2M97Ah0GbnBW478SvFWIoBI SdMjX6cv8DMpuO1SoRG7RjuwAsl1PMkex9JLU83tFD0QruYzHcFd+xIoSubJmWPFL/YYKtvFU8sFX jx/o0hcfdTqqQ1kmJrfw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3Fn-00000009JLT-3eHK; Sun, 28 Jan 2024 11:23:11 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3Fk-00000009JJu-234R for linux-riscv@lists.infradead.org; Sun, 28 Jan 2024 11:23:10 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 20F0D61A10; Sun, 28 Jan 2024 11:23:07 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 71445C433B2; Sun, 28 Jan 2024 11:23:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706440986; bh=/od1c7MyoQNpI2RDNFAjU9gX1pL9wSnySvV0sadsH0E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kf1+V7HY4GYXLpeGiqpvV0SN8/z9kSnocTgUcsX85JaCdLjsfHuJ2CxlOgVtGXU26 0GOR0OqSoCpgZrUSNJ0S1x3yyEArR//TihcKOEaWRJxvCsArknqUQYXicE0+k0s6nS hu9CDu4nyLMcNG3BihjTgFi/B2iZFyGzyFqqMAPaKfkW6tUEr6gyKyaYUUCM5SV8vL utU2gvuaVoEmx0LA3l0JEFI7T4UKVMcI5WLfxdlBLO3Ewo3tw8nXuW09m4q3TyAU6w pd5XLTnk9vYDrD6Y1M02n3tB+lk329yUZqNttz9i0q6cySadWLGPKL/LtkxfRKXZtD +l4+FOE2d54aQ== From: Jisheng Zhang To: Paul Walmsley , Palmer Dabbelt , Albert Ou Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Matteo Croce , kernel test robot Subject: [PATCH 2/3] riscv: optimized memmove Date: Sun, 28 Jan 2024 19:10:12 +0800 Message-ID: <20240128111013.2450-3-jszhang@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240128111013.2450-1-jszhang@kernel.org> References: <20240128111013.2450-1-jszhang@kernel.org> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240128_032308_656578_AD7FED9B X-CRM114-Status: GOOD ( 32.27 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org From: Matteo Croce When the destination buffer is before the source one, or when the buffers doesn't overlap, it's safe to use memcpy() instead, which is optimized to use a bigger data size possible. Signed-off-by: Matteo Croce Reported-by: kernel test robot Signed-off-by: Jisheng Zhang --- arch/riscv/include/asm/string.h | 4 +- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memmove.S | 317 -------------------------------- arch/riscv/lib/string.c | 25 +++ 5 files changed, 27 insertions(+), 322 deletions(-) delete mode 100644 arch/riscv/lib/memmove.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index edf1d56e4f13..17c3b40382e1 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count); extern void *__memcpy(void *dest, const void *src, size_t count); #define __HAVE_ARCH_MEMMOVE -extern asmlinkage void *memmove(void *, const void *, size_t); -extern asmlinkage void *__memmove(void *, const void *, size_t); +extern void *memmove(void *dest, const void *src, size_t count); +extern void *__memmove(void *dest, const void *src, size_t count); #define __HAVE_ARCH_STRCMP extern asmlinkage int strcmp(const char *cs, const char *ct); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index c69dc74e0a27..76849d0906ef 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -10,9 +10,7 @@ * Assembly functions that may be used (directly or indirectly) by modules */ EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(strcmp); EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strncmp); EXPORT_SYMBOL(__memset); -EXPORT_SYMBOL(__memmove); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 5f2f94f6db17..5fa88c5a601c 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o lib-y += memset.o -lib-y += memmove.o lib-y += strcmp.o lib-y += strlen.o lib-y += string.o diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S deleted file mode 100644 index cb3e2e7ef0ba..000000000000 --- a/arch/riscv/lib/memmove.S +++ /dev/null @@ -1,317 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2022 Michael T. Kloos - */ - -#include -#include - -SYM_FUNC_START(__memmove) - /* - * Returns - * a0 - dest - * - * Parameters - * a0 - Inclusive first byte of dest - * a1 - Inclusive first byte of src - * a2 - Length of copy n - * - * Because the return matches the parameter register a0, - * we will not clobber or modify that register. - * - * Note: This currently only works on little-endian. - * To port to big-endian, reverse the direction of shifts - * in the 2 misaligned fixup copy loops. - */ - - /* Return if nothing to do */ - beq a0, a1, .Lreturn_from_memmove - beqz a2, .Lreturn_from_memmove - - /* - * Register Uses - * Forward Copy: a1 - Index counter of src - * Reverse Copy: a4 - Index counter of src - * Forward Copy: t3 - Index counter of dest - * Reverse Copy: t4 - Index counter of dest - * Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest - * Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest - * Both Copy Modes: t0 - Link / Temporary for load-store - * Both Copy Modes: t1 - Temporary for load-store - * Both Copy Modes: t2 - Temporary for load-store - * Both Copy Modes: a5 - dest to src alignment offset - * Both Copy Modes: a6 - Shift ammount - * Both Copy Modes: a7 - Inverse Shift ammount - * Both Copy Modes: a2 - Alternate breakpoint for unrolled loops - */ - - /* - * Solve for some register values now. - * Byte copy does not need t5 or t6. - */ - mv t3, a0 - add t4, a0, a2 - add a4, a1, a2 - - /* - * Byte copy if copying less than (2 * SZREG) bytes. This can - * cause problems with the bulk copy implementation and is - * small enough not to bother. - */ - andi t0, a2, -(2 * SZREG) - beqz t0, .Lbyte_copy - - /* - * Now solve for t5 and t6. - */ - andi t5, t3, -SZREG - andi t6, t4, -SZREG - /* - * If dest(Register t3) rounded down to the nearest naturally - * aligned SZREG address, does not equal dest, then add SZREG - * to find the low-bound of SZREG alignment in the dest memory - * region. Note that this could overshoot the dest memory - * region if n is less than SZREG. This is one reason why - * we always byte copy if n is less than SZREG. - * Otherwise, dest is already naturally aligned to SZREG. - */ - beq t5, t3, 1f - addi t5, t5, SZREG - 1: - - /* - * If the dest and src are co-aligned to SZREG, then there is - * no need for the full rigmarole of a full misaligned fixup copy. - * Instead, do a simpler co-aligned copy. - */ - xor t0, a0, a1 - andi t1, t0, (SZREG - 1) - beqz t1, .Lcoaligned_copy - /* Fall through to misaligned fixup copy */ - -.Lmisaligned_fixup_copy: - bltu a1, a0, .Lmisaligned_fixup_copy_reverse - -.Lmisaligned_fixup_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a1, t3 /* Find the difference between src and dest */ - andi a1, a1, -SZREG /* Align the src pointer */ - addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Forward - * load_val0 = load_ptr[0]; - * do { - * load_val1 = load_ptr[1]; - * store_ptr += 2; - * store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val0 = load_ptr[2]; - * load_ptr += 2; - * store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t0, (0 * SZREG)(a1) - 1: - REG_L t1, (1 * SZREG)(a1) - addi t3, t3, (2 * SZREG) - srl t0, t0, a6 - sll t2, t1, a7 - or t2, t0, t2 - REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3) - - beq t3, a2, 2f - - REG_L t0, (2 * SZREG)(a1) - addi a1, a1, (2 * SZREG) - srl t1, t1, a6 - sll t2, t0, a7 - or t2, t1, t2 - REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3) - - bne t3, t6, 1b - 2: - mv t3, t6 /* Fix the dest pointer in case the loop was broken */ - - add a1, t3, a5 /* Restore the src pointer */ - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lmisaligned_fixup_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */ - slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */ - sub a5, a4, t4 /* Find the difference between src and dest */ - andi a4, a4, -SZREG /* Align the src pointer */ - addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/ - - /* - * Compute The Inverse Shift - * a7 = XLEN - a6 = XLEN + -a6 - * 2s complement negation to find the negative: -a6 = ~a6 + 1 - * Add that to XLEN. XLEN = SZREG * 8. - */ - not a7, a6 - addi a7, a7, (SZREG * 8 + 1) - - /* - * Fix Misalignment Copy Loop - Reverse - * load_val1 = load_ptr[0]; - * do { - * load_val0 = load_ptr[-1]; - * store_ptr -= 2; - * store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7}); - * - * if (store_ptr == {a2}) - * break; - * - * load_val1 = load_ptr[-2]; - * load_ptr -= 2; - * store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7}); - * - * } while (store_ptr != store_ptr_end); - * store_ptr = store_ptr_end; - */ - - REG_L t1, ( 0 * SZREG)(a4) - 1: - REG_L t0, (-1 * SZREG)(a4) - addi t4, t4, (-2 * SZREG) - sll t1, t1, a7 - srl t2, t0, a6 - or t2, t1, t2 - REG_S t2, ( 1 * SZREG)(t4) - - beq t4, a2, 2f - - REG_L t1, (-2 * SZREG)(a4) - addi a4, a4, (-2 * SZREG) - sll t0, t0, a7 - srl t2, t1, a6 - or t2, t0, t2 - REG_S t2, ( 0 * SZREG)(t4) - - bne t4, t5, 1b - 2: - mv t4, t5 /* Fix the dest pointer in case the loop was broken */ - - add a4, t4, a5 /* Restore the src pointer */ - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * Simple copy loops for SZREG co-aligned memory locations. - * These also make calls to do byte copies for any unaligned - * data at their terminations. - */ -.Lcoaligned_copy: - bltu a1, a0, .Lcoaligned_copy_reverse - -.Lcoaligned_copy_forward: - jal t0, .Lbyte_copy_until_aligned_forward - - 1: - REG_L t1, ( 0 * SZREG)(a1) - addi a1, a1, SZREG - addi t3, t3, SZREG - REG_S t1, (-1 * SZREG)(t3) - bne t3, t6, 1b - - j .Lbyte_copy_forward /* Copy any remaining bytes */ - -.Lcoaligned_copy_reverse: - jal t0, .Lbyte_copy_until_aligned_reverse - - 1: - REG_L t1, (-1 * SZREG)(a4) - addi a4, a4, -SZREG - addi t4, t4, -SZREG - REG_S t1, ( 0 * SZREG)(t4) - bne t4, t5, 1b - - j .Lbyte_copy_reverse /* Copy any remaining bytes */ - -/* - * These are basically sub-functions within the function. They - * are used to byte copy until the dest pointer is in alignment. - * At which point, a bulk copy method can be used by the - * calling code. These work on the same registers as the bulk - * copy loops. Therefore, the register values can be picked - * up from where they were left and we avoid code duplication - * without any overhead except the call in and return jumps. - */ -.Lbyte_copy_until_aligned_forward: - beq t3, t5, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t5, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -.Lbyte_copy_until_aligned_reverse: - beq t4, t6, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t6, 1b - 2: - jalr zero, 0x0(t0) /* Return to multibyte copy loop */ - -/* - * Simple byte copy loops. - * These will byte copy until they reach the end of data to copy. - * At that point, they will call to return from memmove. - */ -.Lbyte_copy: - bltu a1, a0, .Lbyte_copy_reverse - -.Lbyte_copy_forward: - beq t3, t4, 2f - 1: - lb t1, 0(a1) - addi a1, a1, 1 - addi t3, t3, 1 - sb t1, -1(t3) - bne t3, t4, 1b - 2: - ret - -.Lbyte_copy_reverse: - beq t4, t3, 2f - 1: - lb t1, -1(a4) - addi a4, a4, -1 - addi t4, t4, -1 - sb t1, 0(t4) - bne t4, t3, 1b - 2: - -.Lreturn_from_memmove: - ret - -SYM_FUNC_END(__memmove) -SYM_FUNC_ALIAS_WEAK(memmove, __memmove) -SYM_FUNC_ALIAS(__pi_memmove, __memmove) -SYM_FUNC_ALIAS(__pi___memmove, __memmove) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 5f9c83ec548d..20677c8067da 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy) EXPORT_SYMBOL(memcpy); void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy); + +/* + * Simply check if the buffer overlaps an call memcpy() in case, + * otherwise do a simple one byte at time backward copy. + */ +void *__memmove(void *dest, const void *src, size_t count) +{ + if (dest < src || src + count <= dest) + return __memcpy(dest, src, count); + + if (dest > src) { + const char *s = src + count; + char *tmp = dest + count; + + while (count--) + *--tmp = *--s; + } + return dest; +} +EXPORT_SYMBOL(__memmove); + +void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove); +EXPORT_SYMBOL(memmove); +void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove); +void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove); From patchwork Sun Jan 28 11:10:13 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jisheng Zhang X-Patchwork-Id: 13534439 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 480D7C47258 for ; Sun, 28 Jan 2024 11:23:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=T6HJDzERmVFNUJx/KOdquD6aTWMT+hkjl949fVPeBoc=; b=r/LlDMoWtjGIpK 2dMuTuCdIdu/ko1ovW9cTx+7d2ODC2m3+vq2f2Qb3HxGZe9i3NdcqCvaxR+FkjRkmOjXr6PYQIdmA H4os3ysghTqLsCND+6g5dL8TPE05Btt5ohVHf1y9SgmzuF10hoAW0WZKOyp6IKvjNOQEgc0RPq9/I gQufq+aB8Nv9REm3nh1dLqXgYmobGffs8nqnlaoAYbKKr8wI/MatGcpQXB4oCV2JiQoMlHDoAen9c gaeekpFFzGTzAWCgvCuS5ZyrWMmuMcIOaKwB5ISYrDn0JBSGOmfLKh4i82sNedK/nRhVnysG/VG1y h5XZGO0rBPj4VAiebIeA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3GI-00000009JXA-3fiw; Sun, 28 Jan 2024 11:23:42 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1rU3GF-00000009JVK-23cH for linux-riscv@lists.infradead.org; Sun, 28 Jan 2024 11:23:41 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id EE2AD61A2A; Sun, 28 Jan 2024 11:23:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 53F28C433C7; Sun, 28 Jan 2024 11:23:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706440988; bh=51aqju5D7ZfZtgqtq/RpE3mm4RKMUFkOzrpLBoF/U2U=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=J7Fx9I4u43VhttKk0N/v3sTyoEKBYpBz6adEaiBKTBC2kSSAueq5qjAZA8Je3V79J knK7DA7OSuxibwQV4FI70Is4ZLCIp7MU40N6dbgAz93rx9OFZ795DQknOcGCGT3mib cLUpkRyEX5UPLHgwZv1u+Gg8p6HQM22RtuJuwyCNcwP+bIUBxDaeX41o/N27VC8oEb lxB6xMm/lHFLI8Enazh9LmgQjR+AK0hAnYauTWEQOXyGW7sScr18UxED14cQMhwLLG ELuDanpu/23DcPxqbFlQBZa2gSiZAFswntsc5NkB9pWWTA45MhknJehttjJcKNcBQJ jhPXcTovWha1Q== From: Jisheng Zhang To: Paul Walmsley , Palmer Dabbelt , Albert Ou Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Matteo Croce Subject: [PATCH 3/3] riscv: optimized memset Date: Sun, 28 Jan 2024 19:10:13 +0800 Message-ID: <20240128111013.2450-4-jszhang@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240128111013.2450-1-jszhang@kernel.org> References: <20240128111013.2450-1-jszhang@kernel.org> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240128_032339_661921_471AED31 X-CRM114-Status: GOOD ( 21.14 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org From: Matteo Croce The generic memset is defined as a byte at time write. This is always safe, but it's slower than a 4 byte or even 8 byte write. Write a generic memset which fills the data one byte at time until the destination is aligned, then fills using the largest size allowed, and finally fills the remaining data one byte at time. Signed-off-by: Matteo Croce Signed-off-by: Jisheng Zhang --- arch/riscv/include/asm/string.h | 4 +- arch/riscv/kernel/riscv_ksyms.c | 2 - arch/riscv/lib/Makefile | 1 - arch/riscv/lib/memset.S | 113 -------------------------------- arch/riscv/lib/string.c | 41 ++++++++++++ arch/riscv/purgatory/Makefile | 5 +- 6 files changed, 44 insertions(+), 122 deletions(-) delete mode 100644 arch/riscv/lib/memset.S diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h index 17c3b40382e1..3d45d61ddab9 100644 --- a/arch/riscv/include/asm/string.h +++ b/arch/riscv/include/asm/string.h @@ -10,8 +10,8 @@ #include #define __HAVE_ARCH_MEMSET -extern asmlinkage void *memset(void *, int, size_t); -extern asmlinkage void *__memset(void *, int, size_t); +extern void *memset(void *s, int c, size_t count); +extern void *__memset(void *s, int c, size_t count); #define __HAVE_ARCH_MEMCPY extern void *memcpy(void *dest, const void *src, size_t count); diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c index 76849d0906ef..9a07cbf67095 100644 --- a/arch/riscv/kernel/riscv_ksyms.c +++ b/arch/riscv/kernel/riscv_ksyms.c @@ -9,8 +9,6 @@ /* * Assembly functions that may be used (directly or indirectly) by modules */ -EXPORT_SYMBOL(memset); EXPORT_SYMBOL(strcmp); EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strncmp); -EXPORT_SYMBOL(__memset); diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile index 5fa88c5a601c..b20cadccff78 100644 --- a/arch/riscv/lib/Makefile +++ b/arch/riscv/lib/Makefile @@ -1,6 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only lib-y += delay.o -lib-y += memset.o lib-y += strcmp.o lib-y += strlen.o lib-y += string.o diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S deleted file mode 100644 index 35f358e70bdb..000000000000 --- a/arch/riscv/lib/memset.S +++ /dev/null @@ -1,113 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 Regents of the University of California - */ - - -#include -#include - -/* void *memset(void *, int, size_t) */ -SYM_FUNC_START(__memset) - move t0, a0 /* Preserve return value */ - - /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f - - /* - * Round to nearest XLEN-aligned address - * greater than or equal to start address - */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ - /* Handle initial misalignment */ - sub a4, a3, t0 -1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ - -2: /* Duff's device with 32 XLEN stores per iteration */ - /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 -#ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 -#endif - - /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 - - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ - - /* Adjust start address with offset */ - sub t0, t0, a4 - - /* Jump into loop body */ - /* Assumes 32-bit instruction lengths */ - la a5, 3f -#ifdef CONFIG_64BIT - srli a4, a4, 1 -#endif - add a5, a5, a4 - jr a5 -3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ - -4: - /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 -5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b -6: - ret -SYM_FUNC_END(__memset) -SYM_FUNC_ALIAS_WEAK(memset, __memset) diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c index 20677c8067da..022edda68f1c 100644 --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -144,3 +144,44 @@ void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmov EXPORT_SYMBOL(memmove); void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove); void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove); + +void *__memset(void *s, int c, size_t count) +{ + union types dest = { .as_u8 = s }; + + if (count >= MIN_THRESHOLD) { + unsigned long cu = (unsigned long)c; + + /* Compose an ulong with 'c' repeated 4/8 times */ +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER + cu *= 0x0101010101010101UL; +#else + cu |= cu << 8; + cu |= cu << 16; + /* Suppress warning on 32 bit machines */ + cu |= (cu << 16) << 16; +#endif + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { + /* + * Fill the buffer one byte at time until + * the destination is word aligned. + */ + for (; count && dest.as_uptr & WORD_MASK; count--) + *dest.as_u8++ = c; + } + + /* Copy using the largest size allowed */ + for (; count >= BYTES_LONG; count -= BYTES_LONG) + *dest.as_ulong++ = cu; + } + + /* copy the remainder */ + while (count--) + *dest.as_u8++ = c; + + return s; +} +EXPORT_SYMBOL(__memset); + +void *memset(void *s, int c, size_t count) __weak __alias(__memset); +EXPORT_SYMBOL(memset); diff --git a/arch/riscv/purgatory/Makefile b/arch/riscv/purgatory/Makefile index 8b940ff04895..a31513346951 100644 --- a/arch/riscv/purgatory/Makefile +++ b/arch/riscv/purgatory/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 OBJECT_FILES_NON_STANDARD := y -purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memset.o +purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o purgatory-y += strcmp.o strlen.o strncmp.o riscv_string.o targets += $(purgatory-y) @@ -16,9 +16,6 @@ $(obj)/string.o: $(srctree)/lib/string.c FORCE $(obj)/ctype.o: $(srctree)/lib/ctype.c FORCE $(call if_changed_rule,cc_o_c) -$(obj)/memset.o: $(srctree)/arch/riscv/lib/memset.S FORCE - $(call if_changed_rule,as_o_S) - $(obj)/strcmp.o: $(srctree)/arch/riscv/lib/strcmp.S FORCE $(call if_changed_rule,as_o_S)