[RESEND,v10,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that:

   1. For the folios that map only to non-writable tlb entries, prevent
      tlb flush during unmapping but perform it just before the folios
      actually become used, out of buddy or pcp.

   2. When any non-writable ptes change to writable e.g. through fault
      handler, give up luf mechanism and perform tlb flush required
      right away.

   3. When a writable mapping is created e.g. through mmap(), give up
      luf mechanism and perform tlb flush required right away.

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb flushes are reduced about 95%.
   2. tlb misses(itlb) are reduced about 80%.
   3. tlb misses(dtlb store) are reduced about 57%.
   4. tlb misses(dtlb load) are reduced about 24%.
   5. tlb shootdown interrupts are reduced about 95%.
   6. The test program runtime is reduced about 5%.

The test environment and the result is like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   The test set:

      llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
      wait

      where -t: nr of threads, -s: seed used to make the runtime stable,
      -n: nr of tokens that determines the runtime, -p: prompt to ask,
      -m: LLM model to use.

   Run the test set 10 times successively with caches dropped every run
   via 'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its
   runtime at the end of each.

   1. Runtime from the output of llama.cpp:

   BEFORE
   ------
   llama_print_timings:       total time = 1002461.95 ms /    24 tokens
   llama_print_timings:       total time = 1044978.38 ms /    24 tokens
   llama_print_timings:       total time = 1000653.09 ms /    24 tokens
   llama_print_timings:       total time = 1047104.80 ms /    24 tokens
   llama_print_timings:       total time = 1069430.36 ms /    24 tokens
   llama_print_timings:       total time = 1068201.16 ms /    24 tokens
   llama_print_timings:       total time = 1078092.59 ms /    24 tokens
   llama_print_timings:       total time = 1073200.45 ms /    24 tokens
   llama_print_timings:       total time = 1067136.00 ms /    24 tokens
   llama_print_timings:       total time = 1076442.56 ms /    24 tokens
   llama_print_timings:       total time = 1004142.64 ms /    24 tokens
   llama_print_timings:       total time = 1042942.65 ms /    24 tokens
   llama_print_timings:       total time =  999933.76 ms /    24 tokens
   llama_print_timings:       total time = 1046548.83 ms /    24 tokens
   llama_print_timings:       total time = 1068671.48 ms /    24 tokens
   llama_print_timings:       total time = 1068285.76 ms /    24 tokens
   llama_print_timings:       total time = 1077789.63 ms /    24 tokens
   llama_print_timings:       total time = 1071558.93 ms /    24 tokens
   llama_print_timings:       total time = 1066181.55 ms /    24 tokens
   llama_print_timings:       total time = 1076767.53 ms /    24 tokens
   llama_print_timings:       total time = 1004065.63 ms /    24 tokens
   llama_print_timings:       total time = 1044522.13 ms /    24 tokens
   llama_print_timings:       total time =  999725.33 ms /    24 tokens
   llama_print_timings:       total time = 1047510.77 ms /    24 tokens
   llama_print_timings:       total time = 1068010.27 ms /    24 tokens
   llama_print_timings:       total time = 1068999.31 ms /    24 tokens
   llama_print_timings:       total time = 1077648.05 ms /    24 tokens
   llama_print_timings:       total time = 1071378.96 ms /    24 tokens
   llama_print_timings:       total time = 1066326.32 ms /    24 tokens
   llama_print_timings:       total time = 1077088.92 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  988522.03 ms /    24 tokens
   llama_print_timings:       total time =  997204.52 ms /    24 tokens
   llama_print_timings:       total time =  996605.86 ms /    24 tokens
   llama_print_timings:       total time =  991985.50 ms /    24 tokens
   llama_print_timings:       total time = 1035143.31 ms /    24 tokens
   llama_print_timings:       total time =  993660.18 ms /    24 tokens
   llama_print_timings:       total time =  983082.14 ms /    24 tokens
   llama_print_timings:       total time =  990431.36 ms /    24 tokens
   llama_print_timings:       total time =  992707.09 ms /    24 tokens
   llama_print_timings:       total time =  992673.27 ms /    24 tokens
   llama_print_timings:       total time =  989285.43 ms /    24 tokens
   llama_print_timings:       total time =  996710.06 ms /    24 tokens
   llama_print_timings:       total time =  996534.64 ms /    24 tokens
   llama_print_timings:       total time =  991344.17 ms /    24 tokens
   llama_print_timings:       total time = 1035210.84 ms /    24 tokens
   llama_print_timings:       total time =  994714.13 ms /    24 tokens
   llama_print_timings:       total time =  984184.15 ms /    24 tokens
   llama_print_timings:       total time =  990909.45 ms /    24 tokens
   llama_print_timings:       total time =  991881.48 ms /    24 tokens
   llama_print_timings:       total time =  993918.03 ms /    24 tokens
   llama_print_timings:       total time =  990061.34 ms /    24 tokens
   llama_print_timings:       total time =  998076.69 ms /    24 tokens
   llama_print_timings:       total time =  997082.59 ms /    24 tokens
   llama_print_timings:       total time =  990677.58 ms /    24 tokens
   llama_print_timings:       total time = 1036054.94 ms /    24 tokens
   llama_print_timings:       total time =  994125.93 ms /    24 tokens
   llama_print_timings:       total time =  982467.01 ms /    24 tokens
   llama_print_timings:       total time =  990191.60 ms /    24 tokens
   llama_print_timings:       total time =  993319.24 ms /    24 tokens
   llama_print_timings:       total time =  992540.57 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts':

   BEFORE
   ------
   TLB:
   125553646  141418810  161932620  176853972  186655697  190399283
   192143823  196414038  192872439  193313658  193395617  192521416
   190788161  195067598  198016061  193607347  194293972  190786732
   191545637  194856822  191801931  189634535  190399803  196365922
   195268398  190115840  188050050  193194908  195317617  190820190
   190164820  185556071  226797214  229592631  216112464  209909495
   205575979  205950252  204948111  197999795  198892232  205287952
   199344631  195015158  195869844  198858745  195692876  200961904
   203463252  205921722  199850838  206145986  199613202  199961345
   200129577  203020521  207873649  203697671  197093386  204243803
   205993323  200934664  204193128  194435376  TLB shootdowns

   AFTER
   -----
   TLB:
     5648092    6610142    7032849    7882308    8088518    8352310
     8656536    8705136    8647426    8905583    8985408    8704522
     8884344    9026261    8929974    8869066    8877575    8810096
     8770984    8754503    8801694    8865925    8787524    8656432
     8755912    8682034    8773935    8832925    8797997    8515777
     8481240    8891258   10595243   10285973    9756935    9573681
     9398968    9069244    9242984    8899009    9310690    9029095
     9069758    9105825    9092703    9270202    9460287    9258546
     9180415    9232723    9270611    9175020    9490420    9360316
     9420818    9057663    9525631    9310152    9152242    8654483
     9181804    9050847    8919916    8883856  TLB shootdowns

   3. tlb numbers from 'perf stat' per test set:

   BEFORE
   ------
   3163679332	dTLB-load-misses
   2017751856	dTLB-store-misses
   327092903	iTLB-load-misses
   1357543886	tlb:tlb_flush

   AFTER
   -----
   2394694609	dTLB-load-misses
   861144167	dTLB-store-misses
   64055579	iTLB-load-misses
   69175002	tlb:tlb_flush

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |   9 ++
 mm/internal.h         |  43 +++++-
 mm/memory.c           |   8 ++
 mm/mmap.c             |   8 ++
 mm/rmap.c             | 308 +++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 366 insertions(+), 10 deletions(-)

Message ID	20240520021734.21527-10-byungchul@sk.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36764C25B75 for <linux-mm@archiver.kernel.org>; Mon, 20 May 2024 02:18:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 818AA6B009E; Sun, 19 May 2024 22:17:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 750DD6B009B; Sun, 19 May 2024 22:17:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D4086B009E; Sun, 19 May 2024 22:17:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DA3026B009B for <linux-mm@kvack.org>; Sun, 19 May 2024 22:17:54 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 9BF2480B40 for <linux-mm@kvack.org>; Mon, 20 May 2024 02:17:54 +0000 (UTC) X-FDA: 82137163668.28.5239176 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf14.hostedemail.com (Postfix) with ESMTP id 9512210000B for <linux-mm@kvack.org>; Mon, 20 May 2024 02:17:52 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716171473; a=rsa-sha256; cv=none; b=7rMuCt2prGaXOPqG4qs/Xev6LA6KCfYC81Ypy0Oy5MSlnhy38OncDp+EhboIJgZZA30lIF nInnbYn/3ssfZns9Kpt8hlmqzfZo3YqHioUVT6rnguNW+FoWOhOvolzoKPut1LpwVCR0NO sk3umg3rbAUBgCFrgbePJa4462q7pBE= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716171473; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=Ye+uDH4ekKhxitmpX2Za+Hhru+j/PYywAcSJDSAhNF8=; b=Lh7XSfa3CY41pjK1mCvJL7jIc9oUVuuOCpFCmUPz8YWdK7BXu90wmrwxzxm2HoPT3/qNxh zJ9hGP96Pr/DL6NzfY4VoXm+cfonp5vHHFUoU3BEw3X4d0h06lawYSbv+ka1fttF5qovpB aIxHv21E6938hxK3hDJjsmhREUqYGMU= X-AuditID: a67dfc5b-d6dff70000001748-c2-664ab2c9d6f7 From: Byungchul Park <byungchul@sk.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, rjgolo@gmail.com Subject: [RESEND PATCH v10 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Date: Mon, 20 May 2024 11:17:31 +0900 Message-Id: <20240520021734.21527-10-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240520021734.21527-1-byungchul@sk.com> References: <20240520021734.21527-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrOLMWRmVeSWpSXmKPExsXC9ZZnoe7JTV5pBgcnCFrMWb+GzeLzhn9s Fi82tDNafF3/i9ni6ac+FovLu+awWdxb85/V4vyutawWO5buY7K4dGABk8Xx3gNMFvPvfWaz 2LxpKrPF8SlTGS1+/wAqPjlrMouDgMf31j4Wj52z7rJ7LNhU6rF5hZbH4j0vmTw2repk89j0 aRK7x7tz59g9Tsz4zeIx72Sgx/t9V9k8tv6y82iceo3N4/MmuQC+KC6blNSczLLUIn27BK6M zZM+shd82sNYcb0xu4Hxdj9jFyMnh4SAicS0FY/YYOy+mz9ZQGw2AXWJGzd+MoPYIgJmEgdb /7CD2MwCd5kkDvSD1QsLFEo0T3oMNodFQFWi52IvmM0LVP95y19WiJnyEqs3HACbwwkUn7Fq J9h8IQFTiR9HvwLN4QKqec8msaP/GtQRkhIHV9xgmcDIu4CRYRWjUGZeWW5iZo6JXkZlXmaF XnJ+7iZGYPAvq/0TvYPx04XgQ4wCHIxKPLw7HnmmCbEmlhVX5h5ilOBgVhLh3bQFKMSbklhZ lVqUH19UmpNafIhRmoNFSZzX6Ft5ipBAemJJanZqakFqEUyWiYNTqoFxWoCObtDe2wX/pzxw PXA3VEZa4EZ1SP51xZKkiw3z1j5f9j3/an7NLBVxV4n0b3f+TJh+ZtGn8LS3DN/vGvHf4n+w Q+6McbWEENOBnD6x4EfH0rJLt11Q9JzEoW5W2R2e9mG93gJRrbLbZQv5AicetHh3r/HZy6PP g14wmxxcWH4r+fHLFr8QJZbijERDLeai4kQAY8dIKXoCAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrHLMWRmVeSWpSXmKPExsXC5WfdrHtyk1eawZ13rBZz1q9hs/i84R+b xYsN7YwWX9f/YrZ4+qmPxeLw3JOsFpd3zWGzuLfmP6vF+V1rWS12LN3HZHHpwAImi+O9B5gs 5t/7zGaxedNUZovjU6YyWvz+AVR8ctZkFgdBj++tfSweO2fdZfdYsKnUY/MKLY/Fe14yeWxa 1cnmsenTJHaPd+fOsXucmPGbxWPeyUCP9/uusnksfvGByWPrLzuPxqnX2Dw+b5IL4I/isklJ zcksSy3St0vgytg86SN7wac9jBXXG7MbGG/3M3YxcnJICJhI9N38yQJiswmoS9y48ZMZxBYR MJM42PqHHcRmFrjLJHGgnw3EFhYolGie9Bisl0VAVaLnYi+YzQtU/3nLX1aImfISqzccAJvD CRSfsWon2HwhAVOJH0e/sk1g5FrAyLCKUSQzryw3MTPHVK84O6MyL7NCLzk/dxMjMJSX1f6Z uIPxy2X3Q4wCHIxKPLwbbnumCbEmlhVX5h5ilOBgVhLh3bQFKMSbklhZlVqUH19UmpNafIhR moNFSZzXKzw1QUggPbEkNTs1tSC1CCbLxMEp1cB4/tD/m7PrZOS6xZfduhjf7MHuHNzgsWFV PEPk/AdVPYtrdnRKK+z21fijmBq/J+0Xr/sH5Yqypxddcrol/WJeLCh/LFXYXik68ct67cvr v11/wr+w/Pv7HbJR7vuPP7rPeMK4pVtA3NA0RDcp+8COtVZ911+98Kn4vdNE2ylrpcnWpcsP +dxRYinOSDTUYi4qTgQAoksxS2ECAAA= X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 9512210000B X-Stat-Signature: 5o51jycd1r16fuj59r5qxm64s3qwkz7q X-HE-Tag: 1716171472-304713 X-HE-Meta: U2FsdGVkX19D6+V32MmtITtO7wKSQNVPMfKyAAV2Ig2IZ+cAk5sSV91iertDT+r2g2OnW0kKacSfOu9QJ7bQwiRnZIjo4W7HrumxALvYrastNjdvc82RwRNMgFCENNYuntf4qOWfrbo1YrRneSL3gYh8aQksaHNltKz74t4nq/a2cVmw+RzRrQys3/VzCLyMCbleUqeEG62n7D5qU0LwXH2cEP9HiWOpvrV9AX2NXWsbaJP+OwzrSJuahUQRvnvz+PeVE6P42ubHjJVvDPFldTVR9FnNMWiKVBoObY9tRu/y9S6FYmTD2gkq9Dq6ggLQg2sQBFseBBNFbHHEKVYVKCU0nw/9Y3LUkq8s69FrT85EixPKqXMDC7h1Icrwrd+WBHOlScuj2chIXv5T+RM19v3iaFPR0t6B4oZLCukI0/snWzH0bma+ucSyWvldjqXl+1+CAmpOwcJwCvre/RkjczIoTjs33GddD/BVftothFVJ+zLI/X37W5YJjFDSR6YELmGcjIjrw4XMPRK7n7TKgr68aYmPUOoou7yAI18AM3emPI5VV/4eWuO/Gh+jgh+bW0lwwaFZLrnSarlLKFOFuuMwT4FhUCY8zUqVd8DLd/VRG6un9MPUJ3KtN09Zpuo0tap+5L/OsPGzrHLmCvx8QbzwqZsY0/a6RHfTBG3dam38PwwFDF3AeOGBOil+gkoWnQivRYGYrVPzUO4qB66sp/a4oD5rosXXWYbel71xxmvGnegyT7TmO9TurylXCE85BBk5nbmSFLiQc0MbYdOT9lD96min8UQYXWTk3viy3jNt471MYdCe+UjnBmxfRisuKe7vNCKWGy8j4QYQ63szDo5+VQkb6htSk1hh6BHQSziU1vs2gpPH0WSplvYyoqtcTuelyi+PVcBjSGtHV7PWdnCoCFEuJ1e44cdH2tm4Kiq/gkDRvGaLST5QsrCk03Je3QKGbJhtXCdgRg7jHvs 8Pw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	LUF(Lazy Unmap Flush) reducing tlb numbers over 90% \| expand [RESEND,v10,00/12] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% [RESEND,v10,01/12] x86/tlb: add APIs manipulating tlb batch's arch data [RESEND,v10,02/12] arm64: tlbflush: add APIs manipulating tlb batch's arch data [RESEND,v10,03/12] riscv, tlb: add APIs manipulating tlb batch's arch data [RESEND,v10,04/12] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch… [RESEND,v10,05/12] mm: buddy: make room for a new variable, ugen, in struct page [RESEND,v10,06/12] mm: add folio_put_ugen() to deliver unmap generation number to pcp or buddy [RESEND,v10,07/12] mm: add a parameter, unmap generation number, to free_unref_folios() [RESEND,v10,08/12] mm/rmap: recognize read-only tlb entries during batched tlb flush [RESEND,v10,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped [RESEND,v10,10/12] mm: separate move/undo parts from migrate_pages_batch() [RESEND,v10,11/12] mm, migrate: apply luf mechanism to unmapping during migration [RESEND,v10,12/12] mm, vmscan: apply luf mechanism to unmapping during folio reclaim

[RESEND,v10,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Commit Message

Patch