From patchwork Wed Jun 21 02:31:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13286578 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F75DEB64D7 for ; Wed, 21 Jun 2023 02:31:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A69268D0003; Tue, 20 Jun 2023 22:31:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A18768D0001; Tue, 20 Jun 2023 22:31:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BA678D0003; Tue, 20 Jun 2023 22:31:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7BB308D0001 for ; Tue, 20 Jun 2023 22:31:06 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3F69A805BF for ; Wed, 21 Jun 2023 02:31:06 +0000 (UTC) X-FDA: 80925177732.06.164AA20 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf09.hostedemail.com (Postfix) with ESMTP id 860F814000B for ; Wed, 21 Jun 2023 02:31:04 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=fWo8m3ve; spf=pass (imf09.hostedemail.com: domain of 352CSZAoKCB4SIMLS4BG87AIIAF8.6IGFCHOR-GGEP46E.ILA@flex--yosryahmed.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=352CSZAoKCB4SIMLS4BG87AIIAF8.6IGFCHOR-GGEP46E.ILA@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687314664; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=2i6mxBMLlJpzwoYg2wZeOmq9J33Dk9/METQ7rddthzM=; b=nytQ2+faPfCgZYReafiFKPOXbm+ItNWoDqXnDgjimyYNiwolz4iWfWhBMoBf5aP6ZvO65C q06XiN2s51AJM5jmgRLny21BKhPTtBe92EFpR+ZXVYryyDWAUXnVC1dN4/jgh3A2B7oki9 wLON7txznw2Zpx85IiO7IXiU/g1u/Ek= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687314664; a=rsa-sha256; cv=none; b=8bdVtq4i87NPYMJJVJLxaywzUc3Kln6iczooLFmXQFA/wPwBhYRKN6dBR+9xOUkvKs0Ga7 hMSFejHAFophJ2yq4j3neStLDS8nypWhnymgh9CWcMnUoKaYwCxGBeZeJcZAleV1EphemB OEq9oKr4xx+a+cBpFqqLTcN6VqonEis= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=fWo8m3ve; spf=pass (imf09.hostedemail.com: domain of 352CSZAoKCB4SIMLS4BG87AIIAF8.6IGFCHOR-GGEP46E.ILA@flex--yosryahmed.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=352CSZAoKCB4SIMLS4BG87AIIAF8.6IGFCHOR-GGEP46E.ILA@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-25eb77bcd2fso2184716a91.3 for ; Tue, 20 Jun 2023 19:31:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1687314663; x=1689906663; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=2i6mxBMLlJpzwoYg2wZeOmq9J33Dk9/METQ7rddthzM=; b=fWo8m3veR2+Dh+41IRQkxot7E9H5Ga0natg38/c0ovy3NMSgF6qfN6kDIT+W9hDXdj sw5fhmDrn1cK0HoWCYIcUSV97oWVwJkVAsBLBO7AaRkjzY6oMV67Iy4VkWEmfaSIeSZQ RANT9wo3kWJhyolFjka1xQ/WnPo0lfX7WqbzhY0c/sZUyNywus125YAWCLcppYbbXDCr X855SxEMuvPSzjupi059z6GzFkyvoiaYeBHHWlRU+lkE17YcSGzWDStJUjN2RWO4bgll di12bINaMXyQfybzv0UdvIUcTu5EJjIT4+3Auw0prdWke+9ChOadWEGhIiEpHuMUA09s qO/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687314663; x=1689906663; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2i6mxBMLlJpzwoYg2wZeOmq9J33Dk9/METQ7rddthzM=; b=JzUKquRTh1xWaT8IjAp40SsDirPKIM47MosdKiJgCf+PbX/iDXbdkeNZlhJnUuMYEG krcC0JwTvKnEu9m43/3t3ZUXVfRbAlz3hXdQnTsglTV4BSy6ZNEdpl0N78vva2BeHNnL tPGX2ycIwQ+Se4bmCOcNin1AJe+pvQ9tJUgQ+d74C6mntHIDjqsVGroUoIOrEBM3KVk2 u65ed2txIpEAuS8Qw1rdFDex1qT+XHiiu7YuSOucmnjbThYkeVach8SpzQ0fJWk4JARp rs8r/hAx/kyDTjjCOJ95kgc0+3XrJaZJFObdsqAGls0UjnhtqMKf9e0z/IE0dj+ShWTk c2lw== X-Gm-Message-State: AC+VfDzIyhvYXu4Rdv0U0cjhytQkhP6L+h8+J3ikbhVhP08lbhKrNFRa GG8vPOaEbJ0dVcUeN69HDaCJ6QM6/Ol2zHJV X-Google-Smtp-Source: ACHHUZ4wSiyW43/DbwdPIXmiSBoQP0oMQm+KTVuk3m/PT1ybCYpY9GCI1fNtqyYn2IN9N0HQBmdZQq3mQbNPrPaO X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:90a:dac4:b0:25b:c272:ab34 with SMTP id g4-20020a17090adac400b0025bc272ab34mr1914359pjx.7.1687314663432; Tue, 20 Jun 2023 19:31:03 -0700 (PDT) Date: Wed, 21 Jun 2023 02:31:01 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.162.gfafddb0af9-goog Message-ID: <20230621023101.432780-1-yosryahmed@google.com> Subject: [PATCH 2/2] mm/vmscan: fix root proactive reclaim unthrottling unbalanced node From: Yosry Ahmed To: Andrew Morton Cc: Yu Zhao , Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed X-Rspamd-Queue-Id: 860F814000B X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: o3r7316qtqi616kfketm3eiqr9587a4z X-HE-Tag: 1687314664-92811 X-HE-Meta: U2FsdGVkX1+Yv0Yet5h5wJq7pkWnt5Mr/Ih2r5cUqr6yc+sZ7rU58++V+Y35kI5aB6ejgf3Kjnnsk2QJp3JQYJs9+x2nU+OF2+1nyW4uUJJgX1M9IBltT92jMvzCgw8gNlZq6badBNxO16k+WRCG/D7AuEDnRKmdbH2I08cEcpIhFoNe2qX9apHnbMXNjdTgo51q8t7q/GOdV+sePxjAS2LcRmSDEZE759BwyUhTY3Ffj0UuITG8id+QYf5y5lvbqtSB7sNmJHX3r0ArsmOEdlnOVX7t7d7Kg3kdhK21blMXYNJ2Xvte9XSXx7/r11oeX/J+F/173EisSfCEIkRKX2PkhJjlYBJXGZcgdJvmOe9TrBdsQT6HRFDMqFft2wBcx33LMOhk+Eyl+kgG9ge1qBwqesb2qTWrPhEhBaMDIdgJ6NJ80nZ5QRG67Wi/kc79T0uWCnQW6nOHGIIprDGhkbtyLS97xpoJXNzX5jIa7mIVde1GkFfy4yJTdtBUKsNvLnXsXy9/0WSwb0NCleHIH4USI/rVxAVCK0T4E1mHMqrWwjsM7Q/AS6ag0mhx6Yz5pW8we4m7TSVHSmQ+oACcEwsriNIh0J3qdxBvYJVgc3512mQGZYSkvXnnjTSErJ2ICYUHHmWKCadmpoxV4csIQCOaaD1SLyUvnh4C4VnBxnOa+uMvbPfO8u977h1uZDoye49vCyvmodRGBm83TKXlKGyKfkWpubkFobc5B9PaXZ6kKOM1gO0GtUadfBqrnmFNdEotzf1ZAVB6no6PiRgr1nOXu9e4+soH5fgw529uX3mjMeNx4IJpbTwdcx/DvDxP/KnD0PAv+XXfYdppCjSizJEbbS/pT219z8AAQ8MJs8d+p4R+dIxm+REf+kfDjVqvTv8zrbzEcm1I08A3NBTQ5VaYy7ge2lLXzAsJvyMvWTqfFL4yYSNLYIztWYegFgwsuQ9ZbGBaay4mJ0qYPfe M0nOdK6t 8UXNNVpzPa2vVm2suEsY1VC0a7IcJ6unvKibV0C5kctXlDxpo6nNR7AIm+HKKNeLxyNsl34/sRW1enf2zwshlg5ITCi91ZAdRsDozj/6I4jIpPnlsDThdJyrDhdSTcSKu6TcjS9RNrZ/wCZCcfxEiFvv0gmdLYwk1KiPyjpMkfTS526ig2Ta36HyLhTlBokKprtOD4zRpkgokJhUvd7Q3gEKb+gY04TtFhgOngp2227DOZ2G7PSbuPV7FUzLJuWpO7/BBKnmmvZkEkBc1osNrkIbpyu8/JJ2Ado9a4nCoR+viWnYqVOCy7LH8lKxH5NeoV0YDT7NmNAimzWymPberDJMKyjwEcZ/afIma7PoBVoc+PeX3zlvksbuWiV3GWF6Lxqu4qgD2aaYma88GxAkRCeAR+ej8Q05s1ourzngQV0ufU3DlE4hV4KFh6FrD9j53CmFRpW3RQmmgdaq/1URoLEmE/eyKJeGDvx3cbIDdwpzoOhiF6G806aH8TZku+pC8W11q19386Oy/s3oUn+9iPfqfjp1eOuJDkpw8bNgs5iu++q05oStELXH0YPrm8dwGd2CvF/d7QqwUh+jFtgylRlz9SkuMTek/w+aNjpibJy+ZlbFsr4Ekxc2Q40j87ltRPX2PX98TSZ5XEf66XmmASH1RA0+VSvU4Gmfit0zC7FsqTu0MR58mfT5EMSLLNf4XVY/6um+p6//U4PYhY5boe1tsUjRuIcTkWnJOqISczmXNhp5/cAMIHEGntoBYNwzYrTiJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When memory.reclaim was introduced, it became the first case where cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that for most cases this is okay, except for one case. Historically, kswapd would throttle reclaim on a node if a lot of pages marked for reclaim are under writeback (aka the node is congested). This occurred by setting LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the node is balanced. Similarly, cgroup reclaim would set the same bit when an lruvec is congested, and clear it on the way out of reclaim (to throttle local reclaimers). Before the introduction of memory.reclaim, the root memcg was the only target of kswapd reclaim, and non-root memcgs were the only targets of cgroup reclaim, so they would never interfere. Using the same bit for both was fine. After memory.reclaim, it is possible for cgroup reclaim on the root cgroup to clear the bit set by kswapd. This would result in reclaim on the node to be unthrottled before the node is balanced. Fix this by introducing separate bits for cgroup-level and node-level congestion. kswapd can unthrottle an lruvec that is marked as congested by cgroup reclaim (as the entire node should no longer be congested), but not vice versa (to prevent premature unthrottling before the entire node is balanced). [1]https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Reported-by: Johannes Weiner Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Signed-off-by: Yosry Ahmed --- include/linux/mmzone.h | 18 +++++++++++++++--- mm/vmscan.c | 19 ++++++++++++------- 2 files changed, 27 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3e822335f214..d863698a84e0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -293,9 +293,21 @@ static inline bool is_active_lru(enum lru_list lru) #define ANON_AND_FILE 2 enum lruvec_flags { - LRUVEC_CONGESTED, /* lruvec has many dirty pages - * backed by a congested BDI - */ + /* + * An lruvec has many dirty pages backed by a congested BDI: + * 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim. + * It can be cleared by cgroup reclaim or kswapd. + * 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim. + * It can only be cleared by kswapd. + * + * Essentially, kswapd can unthrottle an lruvec throttled by cgroup + * reclaim, but not vice versa. This only applies to the root cgroup. + * The goal is to prevent cgroup reclaim on the root cgroup (e.g. + * memory.reclaim) to unthrottle an unbalanced node (that was throttled + * by kswapd). + */ + LRUVEC_CGROUP_CONGESTED, + LRUVEC_NODE_CONGESTED, }; #endif /* !__GENERATING_BOUNDS_H */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 0dbbf718c53e..c22e4e7368da 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6592,10 +6592,13 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * Legacy memcg will stall in page writeback so avoid forcibly * stalling in reclaim_throttle(). */ - if ((current_is_kswapd() || - (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && - sc->nr.dirty && sc->nr.dirty == sc->nr.congested) - set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) { + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); + + if (current_is_kswapd()) + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); + } /* * Stall direct reclaim for IO completions if the lruvec is @@ -6605,7 +6608,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) */ if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && - test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc)) @@ -6862,7 +6866,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, zone->zone_pgdat); - clear_bit(LRUVEC_CONGESTED, &lruvec->flags); + clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags); } } @@ -7251,7 +7255,8 @@ static void clear_pgdat_congested(pg_data_t *pgdat) { struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat); - clear_bit(LRUVEC_CONGESTED, &lruvec->flags); + clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags); + clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags); clear_bit(PGDAT_DIRTY, &pgdat->flags); clear_bit(PGDAT_WRITEBACK, &pgdat->flags); }