From patchwork Tue Jul 25 18:57:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13326947 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0DBDCEB64DD for ; Tue, 25 Jul 2023 18:57:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4AFD26B0072; Tue, 25 Jul 2023 14:57:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45F828D0001; Tue, 25 Jul 2023 14:57:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 300596B007B; Tue, 25 Jul 2023 14:57:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1BF1C6B0072 for ; Tue, 25 Jul 2023 14:57:44 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D1B7CB2884 for ; Tue, 25 Jul 2023 18:57:43 +0000 (UTC) X-FDA: 81051043206.27.51766F4 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf23.hostedemail.com (Postfix) with ESMTP id A06EB140005 for ; Tue, 25 Jul 2023 18:57:41 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b="UjDwy/uN"; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690311461; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=Si0ePvAv/R5DKHWf+5a+iVAaS71bq0odnNN5QnlpYDU=; b=ELGr0pEHaByIulJNyAn/GExpTAZh6B0XKmjt3MOabnHH97UuTKijw9Z3VEBbUKJZ8R1fNP zAcTqw80eIy7NiOFje0q1UH8fJWRGZO/RkmylfF9mRmMziW83jLfgYJ65GJhZmyO9nfWYE J60QMjkKbU8cQ4gNXttjqcYdXq9jWbo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690311461; a=rsa-sha256; cv=none; b=uKqQD9Z1fWBOGUut5GTj7QH5WzlKwz+H4AuCkM1VKZ5zWPRNF248uVrVwVNCQzfMapi49y 59GowzS4wzcPEtSDxfy9+w8VhykjNXtFqoBvmc41hJKdcLLm14S/rJ/SmpLKBE6TNpkeLX Jvqe6F2LorcRhB3IdOVfRk0EpX9ViJQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b="UjDwy/uN"; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-1bb9e6c2a90so20354955ad.1 for ; Tue, 25 Jul 2023 11:57:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690311460; x=1690916260; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=Si0ePvAv/R5DKHWf+5a+iVAaS71bq0odnNN5QnlpYDU=; b=UjDwy/uN3dYsryJwS8unysEAevtuTRzGuZuufVY0e4zbyPXo6OAAz0ytgBUJC+eXMm ocr955d1E/PlzH3PCxgT7mcf42JgcT5wCsi+pRCCO1CmnCPsd2ZwgU1BeIFqVLngT8gt FDfstvjptMQsJNqUtVUNLZJIJWc4tWnWN7vkJ1Dad122m0GhNCCGgn0OiHyK5Elt2n3x ER7tLn2sx6CdVTD6wMlwVmzWjTyMCY4C0T6Lc/7rduNiM6P4wVhcNaP6xxxWrjY77hhs s25aKQ0HbwGIc21yXqTJTKrK1gHIsop/s7k55C6VYQnshaPMYN29io6kS3X1AisX+ewz +ISg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690311460; x=1690916260; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Si0ePvAv/R5DKHWf+5a+iVAaS71bq0odnNN5QnlpYDU=; b=Tu4KIoO04z5y0t4uSMxpi40xSHJa0I2MfvzgpEnNqrfZxAtzU0KzQ2N1Pswx2yoVhG 5aAxec2YVNHlEIamtjnVMx6Es8m2WLlH4CbMO+8a6vUNGFzk+JCRwGllyghAB8CRl+nH K1lYudzoM8YqsTdhHhwQC46yCz+o+ZqtCTYy8OKf2ANE7/m3O13Jghw7PvctAIPm9RUJ ocAoHseF0sKMzsDgF1zMNmaLbs9lSF9ydoTRTCzkp4gIu+Mb9A/dmxbsdGJ3YQIBmfzG bwL2BBZz9G8/bntL+D4qr42K9fq2eh9YWsjoAxHZ2Z1DB8RJWnKhRD0b+vwdZP+asCvN o6HA== X-Gm-Message-State: ABy/qLZnEDLFXUWyYJ4Mkb3lwLqLwAhb23L5uyyKwJH7+JYU4y0EE/1X XS+Ws9oj/lNxmS2P1zVYF5vBVhWlU6j8MjLdQky/EA== X-Google-Smtp-Source: APBJJlGZazJn7sSN7Hz0/bWU8DfTgeMe9CPIbtav8zuyT4CzyuPeKNKHvrKymT2RmFP3KkulBSurHQ== X-Received: by 2002:a17:902:d508:b0:1b6:c229:c350 with SMTP id b8-20020a170902d50800b001b6c229c350mr41744plg.18.1690311459728; Tue, 25 Jul 2023 11:57:39 -0700 (PDT) Received: from KASONG-MB2.tencent.com ([240e:306:1582:8dc2:70ef:4d19:edfe:c3a3]) by smtp.gmail.com with ESMTPSA id j17-20020a170902c3d100b001b03f208323sm11443150plj.64.2023.07.25.11.57.36 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 25 Jul 2023 11:57:39 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Yu Zhao , Roman Gushchin , Johannes Weiner , Michal Hocko , Hugh Dickins , Nhat Pham , Yuanchu Xie , Suren Baghdasaryan , "T . J . Mercier" , Kairui Song Subject: [RFC PATCH 0/4] Refault distance checking for MGLRU Date: Wed, 26 Jul 2023 02:57:28 +0800 Message-ID: <20230725185733.43929-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.41.0 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Queue-Id: A06EB140005 X-Rspam-User: X-Stat-Signature: 87i5odnsknn3zpj6npjz8t4c4uizumc6 X-Rspamd-Server: rspam03 X-HE-Tag: 1690311461-11234 X-HE-Meta: U2FsdGVkX18na9EZg7Abx39L0AUY6wlkuBX27lSjdx838R/XZQQ2uldqAqnlKA339lA70maqcVpOBVs9q/D9MEWgdCcQkf7OZ+8yVPU0Z15+kB1zrZ2Uoy0kJerQScA7ngrmiG9HDXcaw2y+gp5IwdaF87+dHEEZ1eh/DU3MYOVXJnNT4sNSQj4edgduq+RQJrbS1TtSd4Os9/eHE8YHw+cYIfRQ8UovMSaQFI/zY6hfNfsws+GQr7HFG5WjQ4PVRiy/zzfihhI72rnXTgxDSXmcEHdOCOt/VdGDrrjg1Ucusk8TYOEup7DrQ0To3mxlMS91/v453m5FuZ92jC0nVrhp2nzZY30iNLpCt362LRFrAJiHqZnhzFih0wa2bek94Od9R8err9N4BHJ9iGNi60v6ulGuL0S54Q09HNbLROo65+KndnmHYPwQKONsjpF+L8zrACsxDnPfustgyGAgXkRDqlS/9moUEaZ7H64+6CJiTqxUNN9oi6WAj4ItArF/HLgWqtjjf7qAPdi+DqV614LwhL8eS9QiCD1IRSxZcBJdTe//hylFhIPa+PDXSdwNxWnUVb7MM6Cv8youMoJIXrF/NynU2yT+QqgHt1Dt2ISdHWq8Et8hGUKENKOcvx9tDDofLYPmMkXQnrxmF9fJA/7COljtmpEwMiWVybRwX8Mai2BiSe8cAzmcr/uto1pzRO6rHNplUEVjr6QvTmH0GdqAq3kZT4XAU/en8GWBj6xxSgA4WzUrXiB+awCqggN4lmRRlkr6gvQmHqVLqFycxJQJg74s0I/KhhI99JyZY6vX7nR1cVusi58vSlbtU1H5ZvQbi8jhflsqq17cnffvQ8XKeyZFzL+THY/3mmL7VukHJH1CzKmyNyDEdp4BUCPy5Qy0VZoRyH7edO+DHH31lRY58O0p95sVNgctRos1xoAEeEW9kbD00NRzyfWK2LtMfItF/f3mvQFNNoDgL+o 6/hkZQg2 r+v+eUfrJkpC9YRPUfEIlfDAEPCu7qdEaQ4WFDdKT12BDhUsxz12TlZEb3QrOj3Y77DOhrysVbt4+ZUvYiyrXyq6kW9cUj+LvdiNTgI+nUKB9/M3Y0ja4pakrtkV370KnzsBzogwYkvhjwldSDu6Sbaq4/6/iByAgZnmQBhNIxOFNbDVUtpPYLAN60vvIH03B1gbq+ChK+9uhWZ+4RLEgLE9Ay4C/17S4Hv8gOJY9lCUyn2Rnu8tskUO/X1PKmlKcTYjOYIX1PlmoDg/MvZ3Uta1gCfLin5WV6UtvX6HrVfGDNuU9C94uglzTxz/RBw1kwzwzb0nQUv/BttATBykxOTQFHfdj5Ya/nOBXcUsNxRHXLmW3YSTFQ+b1hcekKacwCI1TZkMnix1X/0qmEsh/W3SISnuUgBCyE02o3ne2v/R0mbiaSiGLzQHO0cSiJ90MNl+WJO57UjADNoczZgbtSODpVPd58FaO2PPz/kLm/X6IeF8vA4/+Z+0zRBKvrc5YAze2B6b7GqGhP2ksIsxep/3wQcEueeIs2C19uOmw5B3D5VS7fqBnn8u+H+FfTxamOZiIhY4xOKHFaEKRSgrwRQH5IQIfNt/a+l3b9WfnE9dXy/TbeCNbsGOLTcJPgLwz8hvDoy3jeTV7iJHsA0IjsoHYjHTK3Cw6aIXBFJOWW/+hN+M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Kairui Song Hi, linux-mm I noticed MGLRU not working very well on certain workflows, which is observed on some instances on some heavily stressed machines. I found this was related to refault distance detection, when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably, few get activated and stay in memory. This series tries to fix this problem by rework the refault distance detection to better fit MGLRU, and also tries to use a unified algorithm for both MGLRU and Inactive/Active LRU. Patch 1/4 reworked the refault distance detection model for Inactive/Active LRU. Patch 2/4 and 3/4 are simplification and prepare. Patch 4/4 applies the modified refault distance detection for MGLRU. Following benchmark showed 5x improvement: To simulate the workflow, I setup a 3-replicated mongodb cluster using docker, each in a standalone cgroup, set to use 5 gb of cache and 10g of oplog, on a 32G VM. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Before the patch (with 10G swap, the result won't change whether swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 904 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 503 27150226136.4 0.02 txn/s ------------------------------------------------------------------ TOTAL 503 27150226136.4 0.02 txn/s $ cat /proc/vmstat | grep working workingset_nodes 53391 workingset_refault_anon 0 workingset_refault_file 23856735 workingset_activate_anon 0 workingset_activate_file 23845737 workingset_restore_anon 0 workingset_restore_file 18280692 workingset_nodereclaim 1024 $ free -m total used free shared buff/cache available Mem: 31837 6752 379 23 24706 24607 Swap: 10239 0 10239 After the patch (with 10G swap on same disk, similar result using ZRAM): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 903 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2575 27094953498.8 0.10 txn/s ------------------------------------------------------------------ TOTAL 2575 27094953498.8 0.10 txn/s $ cat /proc/vmstat | grep working workingset_nodes 78249 workingset_refault_anon 10139 workingset_refault_file 23001863 workingset_activate_anon 7238 workingset_activate_file 6718032 workingset_restore_anon 7432 workingset_restore_file 6719406 workingset_nodereclaim 9747 $ free -m total used free shared buff/cache available Mem: 31837 7376 320 3 24140 24014 Swap: 10239 1662 8577 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. Testing with lower stress also shows a improvement. I also checked the benchmark with memtier/memcached and fio, using similar setup as in commit ac35a4902374 but scaled down to fit in my test environment: memtier test (with 16G ramdisk as swap and 2G cgroup limit): memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \ -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test (with 16G ramdisk on /mnt and 4G cgroup limit): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting Before this patch: memcached: Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec Best 52832.79 0.00 0.00 1.82042 1.70300 4.54300 6.27100 105641.69 Worst 46613.56 0.00 0.00 2.05686 1.77500 7.80700 11.83900 93206.05 Avg (6x) 51024.85 0.00 0.00 1.88506 1.73500 5.43900 9.47100 102026.64 fio: read: IOPS=2211k, BW=8637MiB/s (9056MB/s)(2530GiB/300001msec) After this patch: memcached: Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec Best 54218.92 1.76930 1.65500 4.41500 6.27100 108413.34 Worst 47640.13 2.01495 1.74300 7.64700 11.64700 95258.72 Avg (6x) 51408.33 1.86988 1.71900 5.43900 9.34300 102793.42 fio: read: IOPS=2166k, BW=8462MiB/s (8873MB/s)(2479GiB/300001msec) memcached looks ok but there is a %2 performance drop for FIO test, and after some profiling this is mainly caused by the extra atomic operations and new functions, there seems to be no LRU accuracy drop. Sending this as RFC as I'm not entirely sure if this is the right way to fix this issue, of if this is a generic issue or considered more of a misconfiguration. Any suggetions about how should I test it is welcomed. Signed-off-by: Kairui Song Kairui Song (4): workingset: simplify and use a more intuitive model workingset: simplify lru_gen_test_recent lru_gen: convert avg_total and avg_refaulted to atomic workingset, lru_gen: apply refault-distance based re-activation include/linux/mmzone.h | 4 +- include/linux/swap.h | 2 - mm/swap.c | 1 - mm/vmscan.c | 18 ++- mm/workingset.c | 315 ++++++++++++++++++++++------------------- 5 files changed, 179 insertions(+), 161 deletions(-)