Message ID | 20230526234435.662652-1-yuzhao@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/kvm: locklessly clear the accessed bit | expand |
TLDR ==== Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1]. Hardware ======== HOST $ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 Stepping: r3p1 Frequency boost: disabled CPU max MHz: 2800.0000 CPU min MHz: 1000.0000 BogoMIPS: 50.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs Caches (sum of all): L1d: 8 MiB (128 instances) L1i: 8 MiB (128 instances) L2: 128 MiB (128 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, BHB Srbds: Not affected Tsx async abort: Not affected HOST $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0-63 node 0 size: 257730 MB node 0 free: 1447 MB node 1 cpus: 64-127 node 1 size: 256877 MB node 1 free: 256093 MB node distances: node 0 1 0: 10 20 1: 20 10 HOST $ cat /sys/class/nvme/nvme0/model INTEL SSDPF21Q800GB HOST $ cat /sys/class/nvme/nvme0/numa_node 0 Software ======== HOST $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" HOST $ uname -a Linux arm 6.4.0-rc4 #1 SMP Sat Jun 3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux HOST $ cat /proc/swaps Filename Type Size Used Priority /dev/nvme0n1p2 partition 466838356 116922112 -2 HOST $ cat /sys/kernel/mm/lru_gen/enabled 0x000b HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag always defer defer+madvise madvise [never] HOST $ qemu-system-aarch64 --version QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6) Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers GUEST $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS" GUEST $ java --version openjdk 17.0.7 2023-04-18 OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2) OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, sharing) GUEST $ spark-shell --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.0 /_/ Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7 Branch HEAD Compiled by user xinrong.meng on 2023-04-07T02:18:01Z Revision 87a5442f7ed96b11051d8a9333476d080054e5a0 Url https://github.com/apache/spark Type --help for more information. Procedure ========= HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \ -M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \ -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \ -drive if=virtio,format=raw,file=/dev/nvme0n1p1 GUEST $ cat gen.scala import java.io._ import scala.collection.mutable.ArrayBuffer object GenData { def main(args: Array[String]): Unit = { val file = new File("/dev/shm/dataset.txt") val writer = new BufferedWriter(new FileWriter(file)) val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L) for(_ <- 0 until 400000000) { for (i <- 0 until 10) { buf.update(i, scala.util.Random.nextLong()) } writer.write(s"${buf.mkString(",")}\n") } writer.close() } } GenData.main(Array()) GUEST $ cat sort.scala import java.time.temporal.ChronoUnit import org.apache.spark.sql.SparkSession object SparkSort { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().getOrCreate() val file = sc.textFile("/dev/shm/dataset.txt", 64) val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10) results.foreach(println) spark.stop() } } SparkSort.main(Array()) GUEST $ cat run_spark.sh export SPARK_LOCAL_DIRS=/dev/shm/ spark-shell <gen.scala start=$SECONDS for ((i=0; i<20; i++)) do spark-3.4.0-bin-hadoop3/bin/spark-shell --master "local[64]" --driver-memory 160g <sort.scala done echo "wall time: $((SECONDS - start))" Results ======= Before [1] After Change ---------------------------------------------------- Wall time (seconds) 14455 12865 -12% Notes ===== [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to Apples. https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
TLDR ==== Memcached achieved 10% more operations per second (in ~4 hours) after this patchset [1]. Hardware ======== HOST $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 184 On-line CPU(s) list: 0-183 Model name: POWER9 (raw), altivec supported Model: 2.2 (pvr 004e 1202) Thread(s) per core: 4 Core(s) per socket: 23 Socket(s): 2 CPU max MHz: 3000.0000 CPU min MHz: 2300.0000 Caches (sum of all): L1d: 1.4 MiB (46 instances) L1i: 1.4 MiB (46 instances) L2: 12 MiB (24 instances) L3: 240 MiB (24 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-91 NUMA node1 CPU(s): 92-183 Vulnerabilities: Itlb multihit: Not affected L1tf: Mitigation; RFI Flush, L1D private per thread Mds: Not affected Meltdown: Mitigation; RFI Flush, L1D private per thread Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio) Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled Spectre v2: Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush Srbds: Not affected Tsx async abort: Not affected HOST $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0-91 node 0 size: 261659 MB node 0 free: 259152 MB node 1 cpus: 92-183 node 1 size: 261713 MB node 1 free: 261076 MB node distances: node 0 1 0: 10 40 1: 40 10 HOST $ cat /sys/class/nvme/nvme0/model INTEL SSDPF21Q800GB HOST $ cat /sys/class/nvme/nvme0/numa_node 0 Software ======== HOST $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS" HOST $ uname -a Linux ppc 6.3.0 #1 SMP Sun Jun 4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le GNU/Linux HOST $ cat /proc/swaps Filename Type Size Used Priority /dev/nvme0n1p2 partition 466838272 0 -2 HOST $ cat /sys/kernel/mm/lru_gen/enabled 0x0009 HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag always defer defer+madvise madvise [never] HOST $ qemu-system-ppc64 --version QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6) Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers GUEST $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" GUEST $ cat /etc/memcached.conf ... -t 92 -m 262144 -B binary -s /var/run/memcached/memcached.sock -a 0766 GUEST $ memtier_benchmark -v memtier_benchmark 1.4.0 Copyright (C) 2011-2022 Redis Ltd. This is free software. You may redistribute copies of it under the terms of the GNU General Public License <http://www.gnu.org/licenses/gpl.html>. There is NO WARRANTY, to the extent permitted by law. Procedure ========= HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \ -M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g -nographic -nic user \ -drive if=virtio,format=raw,file=/dev/nvme0n1p1 GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \ --key-minimum=1 --key-maximum=120000000 --key-pattern=P:P \ -n allkeys -d 2000 GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \ --key-minimum=1 --key-maximum=120000000 --key-pattern=R:R \ -n allkeys --randomize --distinct-client-seed Results ======= Before [1] After Change ------------------------------------------------- Ops/sec 721586.10 800210.12 +10% Avg. Latency 0.12546 0.11260 -10% p50 Latency 0.08700 0.08700 N/C p99 Latency 0.28700 0.24700 -13% Notes ===== [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to Apples. https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
TLDR ==== Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1]. Hardware ======== HOST $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 Stepping: 0 Frequency boost: disabled CPU max MHz: 4308.3979 CPU min MHz: 2200.0000 BogoMIPS: 5390.20 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ... Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-127 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected HOST $ numactl -H available: 1 nodes (0) node 0 cpus: 0-127 node 0 size: 257542 MB node 0 free: 224855 MB node distances: node 0 0: 10 HOST $ cat /sys/class/nvme/nvme0/model INTEL SSDPF21Q800GB HOST $ cat /sys/class/nvme/nvme0/numa_node 0 Software ======== HOST $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" HOST $ uname -a Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux HOST $ cat /proc/swaps Filename Type Size Used Priority /dev/nvme0n1p2 partition 466838356 0 -2 HOST $ cat /sys/kernel/mm/lru_gen/enabled 0x000f HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag always defer defer+madvise madvise [never] Procedure ========= HOST $ git clone https://github.com/google/multichase HOST $ <Build multichase> HOST $ <Unpack /boot/initrd.img into ./initrd/> HOST $ cp multichase/multichase ./initrd/bin/ HOST $ sed -i \ "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \ ./initrd/init HOST $ <Pack ./initrd/ into ./initrd.img> HOST $ cat run_microvms.sh memcgs=64 run() { path=/sys/fs/cgroup/memcg$1 mkdir $path echo $BASHPID >$path/cgroup.procs qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \ -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \ -append "console=ttyS0 loglevel=0" } for ((memcg = 0; memcg < $memcgs; memcg++)); do run $memcg & done wait Results ======= Before [1] After Change ---------------------------------------------- Total samples 6824 7237 +6% Notes ===== [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to Apples. https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
On 5/27/23 01:44, Yu Zhao wrote: > TLDR > ==== > This patchset adds a fast path to clear the accessed bit without > taking kvm->mmu_lock. It can significantly improve the performance of > guests when the host is under heavy memory pressure. > > ChromeOS has been using a similar approach [1] since mid 2021 and it > was proven successful on tens of millions devices. > > This v2 addressed previous requests [2] on refactoring code, removing > inaccurate/redundant texts, etc. > > [1]https://crrev.com/c/2987928 > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/ From the KVM point of view the patches look good (though I wouldn't mind if Nicholas took a look at the ppc part). Jason's comment on the MMU notifier side are promising as well. Can you send v3 with Oliver's comments addressed? Thanks, Paolo
On Fri, 09 Jun 2023 01:59:35 +0100, Yu Zhao <yuzhao@google.com> wrote: > > TLDR > ==== > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1]. Why are the 3 architectures you have considered being evaluated with 3 different benchmarks? I am not suspecting you to have cherry-picked the best results, but I'd really like to see a variety of benchmarks that exercise this stuff differently. Thanks, M.
On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao <yuzhao@google.com> wrote: > > TLDR > ==== > Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1]. > > Hardware > ======== > HOST $ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Address sizes: 43 bits physical, 48 bits virtual > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Vendor ID: AuthenticAMD > Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores > CPU family: 23 > Model: 49 > Thread(s) per core: 2 > Core(s) per socket: 64 > Socket(s): 1 > Stepping: 0 > Frequency boost: disabled > CPU max MHz: 4308.3979 > CPU min MHz: 2200.0000 > BogoMIPS: 5390.20 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 > ... > Virtualization features: > Virtualization: AMD-V > Caches (sum of all): > L1d: 2 MiB (64 instances) > L1i: 2 MiB (64 instances) > L2: 32 MiB (64 instances) > L3: 256 MiB (16 instances) > NUMA: > NUMA node(s): 1 > NUMA node0 CPU(s): 0-127 > Vulnerabilities: > Itlb multihit: Not affected > L1tf: Not affected > Mds: Not affected > Meltdown: Not affected > Mmio stale data: Not affected > Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection > Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl > Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization > Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected > Srbds: Not affected > Tsx async abort: Not affected > > HOST $ numactl -H > available: 1 nodes (0) > node 0 cpus: 0-127 > node 0 size: 257542 MB > node 0 free: 224855 MB > node distances: > node 0 > 0: 10 > > HOST $ cat /sys/class/nvme/nvme0/model > INTEL SSDPF21Q800GB > > HOST $ cat /sys/class/nvme/nvme0/numa_node > 0 > > Software > ======== > HOST $ cat /etc/lsb-release > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=22.04 > DISTRIB_CODENAME=jammy > DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" > > HOST $ uname -a > Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > > HOST $ cat /proc/swaps > Filename Type Size Used Priority > /dev/nvme0n1p2 partition 466838356 0 -2 > > HOST $ cat /sys/kernel/mm/lru_gen/enabled > 0x000f > > HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > > HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag > always defer defer+madvise madvise [never] > > Procedure > ========= > HOST $ git clone https://github.com/google/multichase > > HOST $ <Build multichase> > HOST $ <Unpack /boot/initrd.img into ./initrd/> > > HOST $ cp multichase/multichase ./initrd/bin/ > HOST $ sed -i \ > "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \ I was reminded that I missed one parameter above, i.e., "/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \ ^^ > ./initrd/init > > HOST $ <Pack ./initrd/ into ./initrd.img> > > HOST $ cat run_microvms.sh > memcgs=64 > > run() { > path=/sys/fs/cgroup/memcg$1 > > mkdir $path > echo $BASHPID >$path/cgroup.procs And one line here: echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills > qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \ > -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \ > -append "console=ttyS0 loglevel=0" > } > > for ((memcg = 0; memcg < $memcgs; memcg++)); do > run $memcg & > done > > wait > > Results > ======= > Before [1] After Change > ---------------------------------------------- > Total samples 6824 7237 +6% > > Notes > ===== > [1] "mm: rmap: Don't flush TLB after checking PTE young for page > reference" was included so that the comparison is apples to > Apples. > https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/
On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier <maz@kernel.org> wrote: > > On Fri, 09 Jun 2023 01:59:35 +0100, > Yu Zhao <yuzhao@google.com> wrote: > > > > TLDR > > ==== > > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1]. > > Why are the 3 architectures you have considered being evaluated with 3 > different benchmarks? I was hoping people having special interests in different archs might try to reproduce the benchmarks that I didn't report (but did cover) and see what happens. > I am not suspecting you to have cherry-picked > the best results I'm generally very conservative when reporting *synthetic* results. For example, the same memcached benchmark used on powerpc yielded >50% improvement on aarch64, because the default Ubuntu Kconfig uses 64KB base page size for powerpc but 4KB for aarch64. (Before the series, the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all pages to consider); after the series, it becomes O(actual nr of pages to swap), which is <10% given how the benchmark was set up.) Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency ------------------------------------------------------------------------ Before 639511.40 0.09940 0.04700 0.27100 22.52700 After 974184.60 0.06471 0.04700 0.15900 3.75900 > but I'd really like to see a variety of benchmarks > that exercise this stuff differently. I'd be happy to try other synthetic workloads that people think that are relatively representative. Also, I've backported the series and started an A/B experiment involving ~1 million devices (real-world workloads). We should have the preliminary results by the time I post the next version.
On Fri, Jun 9, 2023 at 3:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 5/27/23 01:44, Yu Zhao wrote: > > TLDR > > ==== > > This patchset adds a fast path to clear the accessed bit without > > taking kvm->mmu_lock. It can significantly improve the performance of > > guests when the host is under heavy memory pressure. > > > > ChromeOS has been using a similar approach [1] since mid 2021 and it > > was proven successful on tens of millions devices. > > > > This v2 addressed previous requests [2] on refactoring code, removing > > inaccurate/redundant texts, etc. > > > > [1]https://crrev.com/c/2987928 > > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/ > > From the KVM point of view the patches look good (though I wouldn't > mind if Nicholas took a look at the ppc part). Jason's comment on the > MMU notifier side are promising as well. Can you send v3 with Oliver's > comments addressed? Thanks. I'll address all the comments in v3 and post it asap. Meanwhile, some updates on the recent progress from my side: 1. I've asked some downstream kernels to pick up v2 for testing, the Archlinux Zen kernel did. I don't really expect its enthusiastic testers to find this series relevant to their use cases. But who knows. 2. I've also asked openbenchmarking.org to run their popular highmem benchmark suites with v2. Hopefully they'll have some independent results soon. 3. I've backported v2 to v5.15 and v6.1 and started an A/B experiment involving ~1 million devices, as I mentioned in another email in this thread. I should have some results to share when posting v3.