From patchwork Thu Apr 13 21:43:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kalesh Singh X-Patchwork-Id: 13210687 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF6E7C77B6E for ; Thu, 13 Apr 2023 21:43:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 44873900002; Thu, 13 Apr 2023 17:43:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F86B6B0075; Thu, 13 Apr 2023 17:43:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 272E0900002; Thu, 13 Apr 2023 17:43:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 163416B0072 for ; Thu, 13 Apr 2023 17:43:47 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id DD62C80454 for ; Thu, 13 Apr 2023 21:43:46 +0000 (UTC) X-FDA: 80677695252.18.FE1FD5D Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf27.hostedemail.com (Postfix) with ESMTP id B20624001E for ; Thu, 13 Apr 2023 21:43:43 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=q5EyTw1L; spf=pass (imf27.hostedemail.com: domain of 3jnc4ZAsKCIcvlwp3s3tyrsrzzrwp.nzxwty58-xxv6lnv.z2r@flex--kaleshsingh.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3jnc4ZAsKCIcvlwp3s3tyrsrzzrwp.nzxwty58-xxv6lnv.z2r@flex--kaleshsingh.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681422223; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=0cXhr/0vViBCou2AobOWelqCRwyRDRt9EBxxuCD7JWc=; b=HxkKQrVoNFvAXb0SvSIhYFQQRdwQv91uTWq/rfndPHuWjKKM1N1ki97GwW+Bdf8ozVgrL7 IvEd0Iw1XYt6EZ8SgKayYLqWm4wZj8tLtJbnrzEc49/OJ9OL0odW+WjcVh07dYw6ru6HKW G3t4Ri+cwFER5e3zPQejXAe7AhELnGI= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=q5EyTw1L; spf=pass (imf27.hostedemail.com: domain of 3jnc4ZAsKCIcvlwp3s3tyrsrzzrwp.nzxwty58-xxv6lnv.z2r@flex--kaleshsingh.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3jnc4ZAsKCIcvlwp3s3tyrsrzzrwp.nzxwty58-xxv6lnv.z2r@flex--kaleshsingh.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681422223; a=rsa-sha256; cv=none; b=EGfHiulWmuyazZTWCg2W2QqlMwaIBmqxO65iAQV2EdujfYGy9wlef2MyyOKz1yT7BDqdZP YCGmB6tFMNUnUyp/lVrm/kAIIgyVazdLeb+E6V6L8Ks16qB4o41OaeAKMw/lYZF2wpBrzg jPobAREO+jzOBWxoNzEUx+QythA3R28= Received: by mail-yb1-f201.google.com with SMTP id h206-20020a2521d7000000b00b8f3681db1eso7410926ybh.11 for ; Thu, 13 Apr 2023 14:43:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681422223; x=1684014223; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=0cXhr/0vViBCou2AobOWelqCRwyRDRt9EBxxuCD7JWc=; b=q5EyTw1L+OjKhrdPZfZnE7CuuFCAB3ha/ZJ1IJCFNrCDw+2Yo6TB7XWbH1wiXD9Vmt +SZdL1mAoflmzcQRMa10JlA4soOs8o8IOfWObzsx0V5rfNfQgSVTC2ntJR7oHgLunb4o j+yJUt49CATat0fNkSPvORmjnV6c0nGUFf+Mrk669jhWOBLK7OC/HtGMg7xeEbnJqB+S tvtnIJ6wOyrp3qXdwJlWommwwiGtCi/cFMAwjHG1Vt1I1jrqzwENmAS72EzG8RWuzNqb 6oEHRjo0gVJdBe9fewwL7Rgzd8JJxZpN131FaY9ifgijNky4elrcZp6o8zzeYk2HX+vd UePw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681422223; x=1684014223; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0cXhr/0vViBCou2AobOWelqCRwyRDRt9EBxxuCD7JWc=; b=VKot8UkouAOrbFqDAwlECpB8VeR+A/9FV4zdlI/8y5muC3IrVSgAnzBNPVol4jPJyo G1XELN4mdT/r5uFNXdEHIsyiU3v4y7eOMxCzPrAkY1G2Vt0fT3a+0ayTHVXTPXwE6kcG R3hozi+gQxL3jCkuGSy3/5qMUhvIzlfj/8Wn1VGHh8r15wgfRUalXi/ysHJOrTSFYf4P pRPgUmZursUzu1i4fRKj/rTmyIvivcfzz7aN+S0q99ABsxkEKzWObPOBLCUDmBR/x4L+ Q0P4OwxIdBSM8y4KNJkqCQnHoyyoL4iV9myJchOW5W5eRstkzJEuIUZgjcyBunqbRFm/ JS9A== X-Gm-Message-State: AAQBX9eelS0H8+bnngdNvx+cmeAhgOwwY3OGtWElXV7FP8AAPoxgG94A vpjyDtyjP+1I875vu7Q5rbC4dbwJzumy8OOuSg== X-Google-Smtp-Source: AKy350bdpGMqxM8VuJbWEo9RboQytzXx9gu6cIeR9z6xE74PRMs9klWsEvZoauGhXodrEz9E7tUwP+qXnq9ASIfQWg== X-Received: from kalesh.mtv.corp.google.com ([2620:15c:211:201:9ada:6023:3d9b:2ecc]) (user=kaleshsingh job=sendgmr) by 2002:a25:774f:0:b0:b8f:66a1:6e1c with SMTP id s76-20020a25774f000000b00b8f66a16e1cmr1284338ybc.7.1681422222831; Thu, 13 Apr 2023 14:43:42 -0700 (PDT) Date: Thu, 13 Apr 2023 14:43:26 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.0.634.g4ca3ef3211-goog Message-ID: <20230413214326.2147568-1-kaleshsingh@google.com> Subject: [PATCH] mm: Multi-gen LRU: remove wait_event_killable() From: Kalesh Singh To: yuzhao@google.com, minchan@google.com, akpm@linux-foundation.org Cc: surenb@google.com, wvw@google.com, android-mm@google.com, kernel-team@android.com, Kalesh Singh , Minchan Kim , Oleksandr Natalenko , "Jan Alexander Steffens (heftig)" , Suleiman Souhlal , linux-mm@kvack.org, linux-kernel@vger.kernel.org X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: B20624001E X-Rspam-User: X-Stat-Signature: j4k9xjpumdmi9z6qm57txjxcmxqk574s X-HE-Tag: 1681422223-486902 X-HE-Meta: U2FsdGVkX194RuPRfTC6tsAgLE7JkkUT6hpwuwai7DIG5VL6sdkRwlOiLwezEIWbLmWbEpJtiC/V5BtFB9zt40X1+4ogMcgx5r4wIoy2fswV7QEBTHv7xP0qg+nswmlegIm+/MEIUi0wzxiGml2LK4PQ6MP2hUFXOWQPPpHkMcZ/Xg8I5Kjx0WYg4b2iZPCBmGanDy1pumq9iq5P7UX7OZ3/+DYxwQcvScDtptV/3TrLyRgDyx0ISAgrUNrytg7/vmu+lS9VeDZ8SIaQdUoM7ShaRHzE4QZSCuGYHGlg7lBKKZhFXYk/InrRKn5+gENJ6F8Y8UXRDwFaFUEkDPbreWb4SvfBIYNhk4CQX97ybTgPEVzjb1orutYk4+j8uqsynS6/k9sUukMrkNSiqjHmFfCc8DNyfo9Wj3+TXfOt4cgsjaksrmRHre61xI2PWiV8sRD1BExTmJIoMi/aIqdZ8jhQAnDoT7FIOusMG0vVlr3Ekq4+IM6xv0po0ikDquPc1pAeO/uxkGqm6mTXCV9AzpzxhSh1GAsBopMbnl+/819s/jaCorv53W0uXJ7Obq/7WR3aPP/X5zhxSl6gfNSCT4EKM7vBWQdufP1ypbg8hw+Bp8YIKk2Inl803TLQ95cu0sCoFpfxeQ0Z8Hd5gJp8GQXzK51lywQdAhunlZmIVM1ATU4xHzoO8+MklG2XE+5vFYqiYVMtR4EfUr+edrKt8BoWH6tHGqvGtlV4KdDCoe+j+f0/fYDw8ci1GOQUiZLil+/JR+hxmkGzfhZVGS7QqB/4IPFRFx2TBO4nLUsIYQN+6l03RPhNcC8JwxPdZ7OlYfQNbI2o17865LjR+0uOj+Uu+s06SHN8c0ykRzbv3A+Zvqo7DbTF/wzqELpBhvhUXKgvj3tL8ObZBJYmkz72o7IjrMIxTnWTXR59vKlDAsRauFwPx7OPP2ZmkhMn89uyG4rvdhcIqd3MJNKcuKL mL2KaoUj M8DIRg99Jf0bnyoldme/eUaX/oehEwdPy8JCqC5Vdu4RllgtZk9Z685h1qb095FxfoPNIaQv4R3maV+H0VV2dNsoAeGQPl0ojPsj/eH4pzhyPXEYQvh0PoAK+CbP78+RELZ2RvqKU0xWivosdWW9MlK1Dk7nHRk+K1x/+AwRPJxNUbm+k9Jo+84HZyP2uuKLYRRHNOHLfBjR+OE9FwiWMaBmczZNK0cAYdWftKfieJ95iy9Ht9JM67pWyOfxRWbzhS+FoOMfp84HOn/PimTenzfcVFzL4bOdjpXdec9f/SyEij5Zb6D1nN6gDWgQGf3BKoK/rwuVqTAY27zuljbn5kl836T+AeIh9H7+7FUhD98QnXmOi19pq3Zge3LrVhbAz6dvoiKotZmqJVfjS3h1WfMIHKRDOdFsQ6JbHP1nUobH115N4H+ho2qfyDr20wAAjMMUMwCTdgX4DzdzZNTpYXIzfqW4imfj+8e/jCAov0Gs8Z4e/yZYvvK07xKCTLQZ7l89uy5vzgNbnnahiaRpaYHYAWGI++RjZ6wv0l7utQ5AEYpOQqNROt42wL5mlqEX57u5C9XjDQ0nSAXm9tHEz/du8apoY+CRh5Su3qEwQX6Xu1bo1D2YV89u4dygRMlCYj/oXBLrgohXHfFcPr3qEDkM/AN0De0+FAfENmjZPSm9f0eKkuajDBAMwFl7c9euuBFTk5dPGjsWqMxSRWKzrNI9w81GLw06wxnK12SpxrSgUd+DXySZ/xeE2gJFsZHWii2UQRtX6nHuQDbo3OYfz4VePYPeJsb/rDtNYWxwbYd+zrXNGNP/Kl104ee1QCscA9uggxiXseSQasM8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Android 14 and later default to MGLRU [1] and field telemetry showed occasional long tail latency (>100ms) in the reclaim path. Tracing revealed priority inversion in the reclaim path. In try_to_inc_max_seq(), when high priority tasks were blocked on wait_event_killable(), the preemption of the low priority task to call wake_up_all() caused those high priority tasks to wait longer than necessary. In general, this problem is not different from others of its kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU because it introduced the new wait queue lruvec->mm_state.wait. The purpose of this new wait queue is to avoid the thundering herd problem. If many direct reclaimers rush into try_to_inc_max_seq(), only one can succeed, i.e., the one to wake up the rest, and the rest who failed might cause premature OOM kills if they do not wait. So far there is no evidence supporting this scenario, based on how often the wait has been hit. And this begs the question how useful the wait queue is in practice. Based on Minchan's recommendation, which is in line with his commit 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and the rest of the MGLRU code which also uses trylock when possible, remove the wait queue. [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Cc: Yu Zhao Cc: Minchan Kim Reported-by: Wei Wang Suggested-by: Minchan Kim Signed-off-by: Kalesh Singh Acked-by: Yu Zhao --- include/linux/mmzone.h | 8 +-- mm/vmscan.c | 112 +++++++++++++++-------------------------- 2 files changed, 42 insertions(+), 78 deletions(-) base-commit: de4664485abbc0529b1eec44d0061bbfe58a28fb diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9fb1b03b83b2..4509ac2b54a6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -453,18 +453,14 @@ enum { struct lru_gen_mm_state { /* set to max_seq after each iteration */ unsigned long seq; - /* where the current iteration continues (inclusive) */ + /* where the current iteration continues after */ struct list_head *head; - /* where the last iteration ended (exclusive) */ + /* where the last iteration ended before */ struct list_head *tail; - /* to wait for the last page table walker to finish */ - struct wait_queue_head wait; /* Bloom filters flip after each iteration */ unsigned long *filters[NR_BLOOM_FILTERS]; /* the mm stats for debugging */ unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; - /* the number of concurrent page table walkers */ - int nr_walkers; }; struct lru_gen_mm_walk { diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c1c5e8b24b8..a038fe70dda0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3394,18 +3394,13 @@ void lru_gen_del_mm(struct mm_struct *mm) for_each_node(nid) { struct lruvec *lruvec = get_lruvec(memcg, nid); - /* where the last iteration ended (exclusive) */ + /* where the current iteration continues after */ + if (lruvec->mm_state.head == &mm->lru_gen.list) + lruvec->mm_state.head = lruvec->mm_state.head->prev; + + /* where the last iteration ended before */ if (lruvec->mm_state.tail == &mm->lru_gen.list) lruvec->mm_state.tail = lruvec->mm_state.tail->next; - - /* where the current iteration continues (inclusive) */ - if (lruvec->mm_state.head != &mm->lru_gen.list) - continue; - - lruvec->mm_state.head = lruvec->mm_state.head->next; - /* the deletion ends the current iteration */ - if (lruvec->mm_state.head == &mm_list->fifo) - WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1); } list_del_init(&mm->lru_gen.list); @@ -3501,68 +3496,54 @@ static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, struct mm_struct **iter) { bool first = false; - bool last = true; + bool last = false; struct mm_struct *mm = NULL; struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct lru_gen_mm_list *mm_list = get_mm_list(memcg); struct lru_gen_mm_state *mm_state = &lruvec->mm_state; /* - * There are four interesting cases for this page table walker: - * 1. It tries to start a new iteration of mm_list with a stale max_seq; - * there is nothing left to do. - * 2. It's the first of the current generation, and it needs to reset - * the Bloom filter for the next generation. - * 3. It reaches the end of mm_list, and it needs to increment - * mm_state->seq; the iteration is done. - * 4. It's the last of the current generation, and it needs to reset the - * mm stats counters for the next generation. + * mm_state->seq is incremented after each iteration of mm_list. There + * are three interesting cases for this page table walker: + * 1. It tries to start a new iteration with a stale max_seq: there is + * nothing left to do. + * 2. It started the next iteration: it needs to reset the Bloom filter + * so that a fresh set of PTE tables can be recorded. + * 3. It ended the current iteration: it needs to reset the mm stats + * counters and tell its caller to increment max_seq. */ spin_lock(&mm_list->lock); VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq); - VM_WARN_ON_ONCE(*iter && mm_state->seq > walk->max_seq); - VM_WARN_ON_ONCE(*iter && !mm_state->nr_walkers); - if (walk->max_seq <= mm_state->seq) { - if (!*iter) - last = false; + if (walk->max_seq <= mm_state->seq) goto done; - } - if (!mm_state->nr_walkers) { - VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo); + if (!mm_state->head) + mm_state->head = &mm_list->fifo; - mm_state->head = mm_list->fifo.next; + if (mm_state->head == &mm_list->fifo) first = true; - } - - while (!mm && mm_state->head != &mm_list->fifo) { - mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list); + do { mm_state->head = mm_state->head->next; + if (mm_state->head == &mm_list->fifo) { + WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + last = true; + break; + } /* force scan for those added after the last iteration */ - if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) { - mm_state->tail = mm_state->head; + if (!mm_state->tail || mm_state->tail == mm_state->head) { + mm_state->tail = mm_state->head->next; walk->force_scan = true; } + mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list); if (should_skip_mm(mm, walk)) mm = NULL; - } - - if (mm_state->head == &mm_list->fifo) - WRITE_ONCE(mm_state->seq, mm_state->seq + 1); + } while (!mm); done: - if (*iter && !mm) - mm_state->nr_walkers--; - if (!*iter && mm) - mm_state->nr_walkers++; - - if (mm_state->nr_walkers) - last = false; - if (*iter || last) reset_mm_stats(lruvec, walk, last); @@ -3590,9 +3571,9 @@ static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq) VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq); - if (max_seq > mm_state->seq && !mm_state->nr_walkers) { - VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo); - + if (max_seq > mm_state->seq) { + mm_state->head = NULL; + mm_state->tail = NULL; WRITE_ONCE(mm_state->seq, mm_state->seq + 1); reset_mm_stats(lruvec, NULL, true); success = true; @@ -4192,10 +4173,6 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end, walk_pmd_range(&val, addr, next, args); - /* a racy check to curtail the waiting time */ - if (wq_has_sleeper(&walk->lruvec->mm_state.wait)) - return 1; - if (need_resched() || walk->batched >= MAX_LRU_BATCH) { end = (addr | ~PUD_MASK) + 1; goto done; @@ -4228,8 +4205,14 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_ walk->next_addr = FIRST_USER_ADDRESS; do { + DEFINE_MAX_SEQ(lruvec); + err = -EBUSY; + /* another thread might have called inc_max_seq() */ + if (walk->max_seq != max_seq) + break; + /* folio_update_gen() requires stable folio_memcg() */ if (!mem_cgroup_trylock_pages(memcg)) break; @@ -4462,25 +4445,12 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, success = iterate_mm_list(lruvec, walk, &mm); if (mm) walk_mm(lruvec, mm, walk); - - cond_resched(); } while (mm); done: - if (!success) { - if (sc->priority <= DEF_PRIORITY - 2) - wait_event_killable(lruvec->mm_state.wait, - max_seq < READ_ONCE(lrugen->max_seq)); - return false; - } + if (success) + inc_max_seq(lruvec, can_swap, force_scan); - VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq)); - - inc_max_seq(lruvec, can_swap, force_scan); - /* either this sees any waiters or they will see updated max_seq */ - if (wq_has_sleeper(&lruvec->mm_state.wait)) - wake_up_all(&lruvec->mm_state.wait); - - return true; + return success; } /****************************************************************************** @@ -6122,7 +6092,6 @@ void lru_gen_init_lruvec(struct lruvec *lruvec) INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]); lruvec->mm_state.seq = MIN_NR_GENS; - init_waitqueue_head(&lruvec->mm_state.wait); } #ifdef CONFIG_MEMCG @@ -6155,7 +6124,6 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg) for_each_node(nid) { struct lruvec *lruvec = get_lruvec(memcg, nid); - VM_WARN_ON_ONCE(lruvec->mm_state.nr_walkers); VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, sizeof(lruvec->lrugen.nr_pages)));