[v10,4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

From: Barry Song <v-songbaohua@oppo.com>

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  ......
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
        w/o             w/
real    0m13.460s       0m11.771s
user    0m0.248s        0m0.279s
sys     0m12.039s       0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make this depends on EXPERT at this stage for
tests on more small platforms.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 48 +++++++++++++++++--
 4 files changed, 59 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

Message ID	20230710083914.18336-5-yangyicong@huawei.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CEC83EB64D9 for <linux-riscv@archiver.kernel.org>; Mon, 10 Jul 2023 08:41:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:CC:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=YEEO9PE2zEDOpnSYgsGrhqPnqj/hQwUhWdHEMBwWOxw=; b=Eg8yWl3zEtfNC/ Mbzv8KzRMELf23NV217NKLE2mgWo7hbTouVLDVRR1KrbKprITTY2fa0n98vZGzWD8guoaUKQW+K/k GytWgoAh4Nrxw2nfSfgyL+/IYvMmpX1VVIv7WHd7a4a7uJAFSrkgr800CtSFLIn/TIugdm7QlB9SF eTBPbiRa56xqxuuxqeQxeK8EZdEIxWqV8551fBirGo1pJCxjyFoC87363YR3Xq43oppBEqziqImvo Aqq0TnZFdRk/vxVpeTKZLC+pxptyV2l346gM0toblISGm8f36mhvcCLI0A9Pq7M9Or+hXjkRlAfkK BnUAWFiHDIJ6eRdfQWhw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qImST-00Asa5-03; Mon, 10 Jul 2023 08:41:25 +0000 Received: from szxga08-in.huawei.com ([45.249.212.255]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qImSO-00AsSh-0W; Mon, 10 Jul 2023 08:41:22 +0000 Received: from canpemm500009.china.huawei.com (unknown [172.30.72.57]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4QzyBd1pnCz1FDnF; Mon, 10 Jul 2023 16:40:41 +0800 (CST) Received: from localhost.localdomain (10.50.163.32) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Mon, 10 Jul 2023 16:41:12 +0800 From: Yicong Yang <yangyicong@huawei.com> To: <akpm@linux-foundation.org>, <catalin.marinas@arm.com>, <linux-mm@kvack.org>, <linux-arm-kernel@lists.infradead.org>, <x86@kernel.org>, <mark.rutland@arm.com>, <ryan.roberts@arm.com>, <will@kernel.org>, <anshuman.khandual@arm.com>, <linux-doc@vger.kernel.org> CC: <corbet@lwn.net>, <peterz@infradead.org>, <arnd@arndb.de>, <punit.agrawal@bytedance.com>, <linux-kernel@vger.kernel.org>, <darren@os.amperecomputing.com>, <yangyicong@hisilicon.com>, <huzhanyuan@oppo.com>, <lipeifeng@oppo.com>, <zhangshiming@oppo.com>, <guojian@oppo.com>, <realmz6@gmail.com>, <linux-mips@vger.kernel.org>, <openrisc@lists.librecores.org>, <linuxppc-dev@lists.ozlabs.org>, <linux-riscv@lists.infradead.org>, <linux-s390@vger.kernel.org>, Barry Song <21cnbao@gmail.com>, <wangkefeng.wang@huawei.com>, <xhao@linux.alibaba.com>, <prime.zeng@hisilicon.com>, <Jonathan.Cameron@Huawei.com>, Barry Song <v-songbaohua@oppo.com>, Nadav Amit <namit@vmware.com>, Mel Gorman <mgorman@suse.de> Subject: [PATCH v10 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration Date: Mon, 10 Jul 2023 16:39:14 +0800 Message-ID: <20230710083914.18336-5-yangyicong@huawei.com> X-Mailer: git-send-email 2.31.0 In-Reply-To: <20230710083914.18336-1-yangyicong@huawei.com> References: <20230710083914.18336-1-yangyicong@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.50.163.32] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230710_014120_617383_46ABF18B X-CRM114-Status: GOOD ( 26.46 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: <linux-riscv.lists.infradead.org> List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-riscv>, <mailto:linux-riscv-request@lists.infradead.org?subject=unsubscribe> List-Archive: <http://lists.infradead.org/pipermail/linux-riscv/> List-Post: <mailto:linux-riscv@lists.infradead.org> List-Help: <mailto:linux-riscv-request@lists.infradead.org?subject=help> List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-riscv>, <mailto:linux-riscv-request@lists.infradead.org?subject=subscribe> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	arm64: support batched/deferred tlb shootdown during page reclamation/migration \| expand [v10,0/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration [v10,1/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() [v10,2/4] mm/tlbbatch: Rename and extend some functions [v10,3/4] mm/tlbbatch: Introduce arch_flush_tlb_batched_pending() [v10,4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

Context	Check	Description
conchuod/cover_letter	success	Series has a cover letter
conchuod/tree_selection	success	Guessed tree name to be for-next at HEAD e8605e8fdf42
conchuod/fixes_present	success	Fixes tag not required for -next series
conchuod/maintainers_pattern	success	MAINTAINERS pattern errors before the patch: 4 and now 4
conchuod/verify_signedoff	success	Signed-off-by tag matches author and committer
conchuod/kdoc	success	Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig	success	Errors and warnings before: 8 this patch: 8
conchuod/module_param	success	Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig	success	Errors and warnings before: 8 this patch: 8
conchuod/build_rv32_defconfig	success	Build OK
conchuod/dtb_warn_rv64	success	Errors and warnings before: 3 this patch: 3
conchuod/header_inline	success	No static functions without inline keyword in header files
conchuod/checkpatch	warning	WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
conchuod/build_rv64_nommu_k210_defconfig	success	Build OK
conchuod/verify_fixes	success	No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig	success	Build OK

[v10,4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

Checks

Commit Message

Comments

Patch