From patchwork Tue Oct 22 17:55:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13846012 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF252D2CE0D for ; Tue, 22 Oct 2024 17:55:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 658616B0082; Tue, 22 Oct 2024 13:55:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6085F6B0083; Tue, 22 Oct 2024 13:55:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A9066B0095; Tue, 22 Oct 2024 13:55:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 22B776B0082 for ; Tue, 22 Oct 2024 13:55:26 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9437BA0128 for ; Tue, 22 Oct 2024 17:54:55 +0000 (UTC) X-FDA: 82701989832.15.891054E Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) by imf09.hostedemail.com (Postfix) with ESMTP id 6B91E140004 for ; Tue, 22 Oct 2024 17:55:12 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AZbade2f; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729619557; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=fR1p2CdHlCuEEst6VnVj+vTqihZCOT2GJV0XoDP9RwQ=; b=8Xzzh+x4ca1bu1As3X8XjiLi886U+59S4wETb5ORc9YTas3yFLy+KbEePDyZ0MClsCeYpT O/g2JTme9etvCLGEnc6oCmz5owup1YVDAzwpG4bT2V93eFb+WRWmkG1kNgztGjp8lv1Ni6 EkvceasIw3BwKOmRNBKkcHoMnHJtnDw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729619557; a=rsa-sha256; cv=none; b=Fqrbsb/JMZhzAVjjy19QX0WvCZdy32fnkpKEi84MU0qModLA+At+2nDQS4hMaSA8PrxzPJ Q19VZ4xL/BfqKUo+LbtKPA+voKJjC9gDeacDTcDna8GXiIVQdU/GxbsFgU5ZRZFru0RAoW RchWaW3/kHgJ1F9xOhYb5UJ1VyVt51M= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AZbade2f; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-7ea8d7341e6so4302859a12.3 for ; Tue, 22 Oct 2024 10:55:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729619722; x=1730224522; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=fR1p2CdHlCuEEst6VnVj+vTqihZCOT2GJV0XoDP9RwQ=; b=AZbade2fhwXRaPX9RUfuJHM1wSxktHSP5yGY7bOO/itoELIXqrKKYsI6PCB2PdMBjV f1wqi5ZyFnKix4x3RvJ3DX29k+IY4wuVuInNz313jRzG9UUrqNmxrKdbG1D+GlWCtJ6F 1+ywMcphnX/v1usGvTMNjae57lOWZO4JIH5eXiGD5sRKUJxKqQxbyP70I0ycuairquFY xJLWvFng9/oNHHCsrnR9xgVwb43crZezWkarA3ggjotLCHCLtaFf5aIEqfpKAJjPCGRG CXihLWubAeu96VqE1x6Xmy92P+rCU6bGXMrCVO8Xz0YOhZ3fh7Hq6jjy8D7fZoY5fZ4h +QOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729619722; x=1730224522; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=fR1p2CdHlCuEEst6VnVj+vTqihZCOT2GJV0XoDP9RwQ=; b=fSrVWu44dHOBd9nJzU8qzerNfOyhHez6lLpuRmCop1BDMD3RL31Ly9K2iQyBeG9XtD 2apYusDefoReD1JEQzUuFpiy2SfJoW9J8lmWQmNrgIEBjaxWXDDfuk/u8TDvxHMDzfu2 IdQBowYpZk+jOmzlfGq71VQ3MsbYOg2k/4M3BTVe4tTempc+MjxINlGYPxTauUAw0XLo fF9xgNHxylq3E2Qb58Cjn17wJ5chVHJGMS4FLH0bTgjJLMm05y7BDVNzhJLws3WJ2Fnl NFR2NLDrs20ISBM8C/5mq2I7zIetY3C4lNMlt9B7Ko+dYHH4tA0uCn3LlHI1/emvQ4hR Gw8A== X-Gm-Message-State: AOJu0YzVpDdmIVXPDcM7s9kRsZPvqnKzU+w2co/xzbkVjzD6Uo8S7dQr JmFyY+ezCrM//25jgGbkg+jLgaHVwfaGt1JA97IKdLb7Z3097BNqIbZNBRlTBir2UA== X-Google-Smtp-Source: AGHT+IEiIOJraMTaUAeMRUDZ+DwSmyD/DtDSzP4HDPiKn396oOIVhQfUkOZkRlFZsCHpf+QrfKVVQA== X-Received: by 2002:a05:6a20:d80c:b0:1d9:239b:a125 with SMTP id adf61e73a8af0-1d92c4df420mr19971012637.15.1729619721093; Tue, 22 Oct 2024 10:55:21 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71ec1407da1sm5004535b3a.191.2024.10.22.10.55.17 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 10:55:20 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , Kalesh Singh , "Huang, Ying" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2] mm, swap: avoid over reclaim of full clusters Date: Wed, 23 Oct 2024 01:55:12 +0800 Message-ID: <20241022175512.10398-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6B91E140004 X-Stat-Signature: gh6qqng7jr46ec87zfrssfrir57n7so5 X-Rspam-User: X-HE-Tag: 1729619712-604612 X-HE-Meta: U2FsdGVkX1/azDJsgG6wHCMc7U9w/A4i4wxYdTVCFr5C15a9haTotltNodRUrBF7SYs7U1vDoBv2tsNvXHURBnbZ8cQ9Nl2/tpOsds3UdUKchKlFPeCCde5QCgIkli7DlceNwpYSCC7umUZh/rBjv3bIj1M7jMVDbfp0DkoBU9mzhWahnB+RVHHNkNNzHihbwampSdmI6gZhwkwAy+zlCcvo9NniTR1PaufOsNwtQvpXt2qCPHjtoY3IBaxpm/qCJPF/ePbnZVWug4e92idH4yLe8G5u6L2etxynoNmzpqGy0MFE9tN8nphTkb4gWrhGZ8HAaHHB8fDSTWUbr5UG73UApkKjYHbtFcWDiIls327wEhmff3+z7cGQVk7xuSMX4JOMbEhVkvVIGV7L0cUs5gfm+Hbw874sThhbXBw2bc8bGnH45jqslND8y/6tCziWiLfixtxbH870VP2Q7RQgv2ZGY6PL9q1aMEa7e/eZfOG0Zo64FGBp8mL+AGpiuZqVYFyT4vbs+RumaptaAEplpZQ3Ozq9t4OHkyZQmBY9+0B6Bma3gSuqrbcPDOl0RySJImcCubig9nNK4uNdGs2HN89NOoCq2mQbSGbdwtf/cuZWU0nPlLWiSR/CUTUCF9RkhV2ZsRgSEbLN6OuqLj/paSw47Jb9kVI/k5hwMQDbS/Y6W066NvjYbPToEznWbvvMvEVX3GFGW9DdIQOJy6+XtlssDD7C6S1djX0t4yBS0lbeeLkyc7UJCmHewi4I3Z0GU0UkW4N9yvzXB2HrWUggGCHJBN52zGNdsD6MxHSHuBE3+zQ4RMax938BKJRrt/IGBgE20dgTgPaME5L3ehDLYD6jCeFc6T66YmE4OWXkJLkeX1LvN45LEzXbsJdY7bzVi2kLOZKn90oAcpf18T7MgvPWpG0jW/okSXZOLsEvxYp9Vm/lLAxi0txVoB/pQzW8z7tO9wT+JkJlYB+nt22 h5/fvOyQ NjF2zHURe4vVe9hJa6lJIzJIow8eaIskj7gYOZtEYF37OYqafxB1ay6LV74hidRYv0xPWcFXieG3zHqyVjVHQZCByCt+e5ZJq8tUD1NTDe5MhHmqvsmwxBHupLbB8IzF/i3wZ7i7QSOkNM7ixxLs0y3Mt+vIsUm1OFZXmNk6B2Q31v6JiMI7CcKky1l+36JEeuaW0+44P/vR+vR2PRZFx1uT6W4DjqdGhFUAx5tI071/hb7p4wfuaZ3/RnroZ6RIg5AbSyez2ggN+DK1ge78d8O8ZMGPilHqBV8yIcDvBZVliSvV+5oQ3bDKBZkVw1Y0U64ahVLonMM30KV+v+UPiaFfKEsikLE068FA/adYOOW9c+TiCh04bXBsvr9gzi84oifx3x8GYLw8nKOMTUVh2W+MiATV4FjQYRY1v64RCdPMzLK9npV1FPnUXsPB7qZh1c52OtIBLWocRHb4nVZgcyeapY9q/W/2uDFL5ctWrJV27E8/l0Ns5ddL06+rLttDadyfIuZQYCYbIgzcx2Yhn40KrAV75nvTEaSph0sLEd4AwUR7BzYPPm2KpirQo4deK5TQT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM early. Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily reproduced with following simple program: #include #include #include #include #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; } Setup a 8G non ramdisk swap, the first run of the program will swapout 8G ram successfully. But run same program again after the first run paused, the second run can't swapout all 8G memory as now half of the swap device is pinned by HAS_CACHE. There was a random scan in the old allocator that may reclaim part of the HAS_CACHE by luck, but it's unreliable. The new allocator's added reclaim of full clusters when device is low on usable slots. But when multiple CPUs are seeing the device is low on usable slots at the same time, they ran into a thundering herd problem. This is an observable problem on large machine with mass parallel workload, as full cluster reclaim is slower on large swap device and higher number of CPUs will also make things worse. Testing using a 128G ZRAM on a 48c96t system. When the swap device is very close to full (eg. 124G / 128G), running build linux kernel with make -j96 in a 1G memory cgroup will hung (not a softlockup though) spinning in full cluster reclaim for about ~5min before go OOM. To solve this, split the full reclaim into two parts: - Instead of do a synchronous aggressively reclaim when device is low, do only one aggressively reclaim when device is strictly full with a kworker. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots. - To avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This is kind of similar to the random reclaim before, keeps the full clusters rotated and has a minimal latency. This should provide a fair reclaim strategy suitable for most workloads. Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song --- Update from V1: https://lore.kernel.org/linux-mm/20240925175241.46679-1-ryncsn@gmail.com/ - Resend patch, minor adjustment of commit message. include/linux/swap.h | 1 + mm/swapfile.c | 49 +++++++++++++++++++++++++++----------------- 2 files changed, 31 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ca533b478c21..f3e0ac20c2e8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -335,6 +335,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ + struct work_struct reclaim_work; /* reclaim worker */ struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one diff --git a/mm/swapfile.c b/mm/swapfile.c index b0915f3fab31..46bd4b1a3c07 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -731,15 +731,16 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne return offset; } -static void swap_reclaim_full_clusters(struct swap_info_struct *si) +/* Return true if reclaimed a whole cluster */ +static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) { long to_scan = 1; unsigned long offset, end; struct swap_cluster_info *ci; unsigned char *map = si->swap_map; - int nr_reclaim, total_reclaimed = 0; + int nr_reclaim; - if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER) + if (force) to_scan = si->inuse_pages / SWAPFILE_CLUSTER; while (!list_empty(&si->full_clusters)) { @@ -749,28 +750,36 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si) end = min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; + spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) { - spin_unlock(&si->lock); nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); - spin_lock(&si->lock); - if (nr_reclaim > 0) { - offset += nr_reclaim; - total_reclaimed += nr_reclaim; - continue; - } else if (nr_reclaim < 0) { - offset += -nr_reclaim; + if (nr_reclaim) { + offset += abs(nr_reclaim); continue; } } offset++; } - if (to_scan <= 0 || total_reclaimed) + spin_lock(&si->lock); + + if (to_scan <= 0) break; } } +static void swap_reclaim_work(struct work_struct *work) +{ + struct swap_info_struct *si; + + si = container_of(work, struct swap_info_struct, reclaim_work); + + spin_lock(&si->lock); + swap_reclaim_full_clusters(si, true); + spin_unlock(&si->lock); +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -800,6 +809,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; } + /* Try reclaim from full clusters if free clusters list is drained */ + if (vm_swap_full()) + swap_reclaim_full_clusters(si, false); + if (order < PMD_ORDER) { unsigned int frags = 0; @@ -881,13 +894,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } done: - /* Try reclaim from full clusters if device is nearfull */ - if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) { - swap_reclaim_full_clusters(si); - if (!found && !order && si->pages != si->inuse_pages) - goto new_cluster; - } - cluster->next[order] = offset; return found; } @@ -922,6 +928,9 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, si->lowest_bit = si->max; si->highest_bit = 0; del_from_avail_list(si); + + if (vm_swap_full()) + schedule_work(&si->reclaim_work); } } @@ -2816,6 +2825,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) wait_for_completion(&p->comp); flush_work(&p->discard_work); + flush_work(&p->reclaim_work); destroy_swap_extents(p); if (p->flags & SWP_CONTINUED) @@ -3376,6 +3386,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) return PTR_ERR(si); INIT_WORK(&si->discard_work, swap_discard_work); + INIT_WORK(&si->reclaim_work, swap_reclaim_work); name = getname(specialfile); if (IS_ERR(name)) {