From patchwork Thu Nov 9 04:59:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Byungchul Park X-Patchwork-Id: 13450609 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4819AC4167B for ; Thu, 9 Nov 2023 04:59:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 876758D0073; Wed, 8 Nov 2023 23:59:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 81B378002D; Wed, 8 Nov 2023 23:59:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5362F8D0073; Wed, 8 Nov 2023 23:59:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 35EE18D00D1 for ; Wed, 8 Nov 2023 23:59:24 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 07926802DA for ; Thu, 9 Nov 2023 04:59:24 +0000 (UTC) X-FDA: 81437212248.02.768E108 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf23.hostedemail.com (Postfix) with ESMTP id 7148F140014 for ; Thu, 9 Nov 2023 04:59:21 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699505962; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=tMkuIFDucDn4K8UkLSIcjaBo7QuWngab7iODf/vA+Gk=; b=FhV6iQi1owexBjzJ+jON+UIUOBu2k2u3dyL2iDQiVlrKUgxgWtstjoGJdEmGcsPGgNqtrD SXzCA4YKIhBde83iG6vhEhxGtwMbG0Dcsq7AvWlNO5Z0cO69rsejvtzXqKS/H02WrJUe4t mHzIYI99pIomqG+q///B1CLdyzwI2rI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699505962; a=rsa-sha256; cv=none; b=a5NfUX8qADCuL2Z59yjy+++N4k6bcVdGJtUi2rMAPNaLF3HWGQhU0ECmevSzNRlomoedo0 sYCCjLYsBUIfeY7myWa8Hi0LCwTaEj7WspLot3Wrqt+/NOHBwcuy0pNe0sPfkX7XCO5VkT YRJu3z8CKqqjzifLnPxDWajMWvMzumA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none X-AuditID: a67dfc5b-d6dff70000001748-fc-654c67269bde From: Byungchul Park To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com Subject: [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Date: Thu, 9 Nov 2023 13:59:06 +0900 Message-Id: <20231109045908.54996-2-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20231109045908.54996-1-byungchul@sk.com> References: <20231109045908.54996-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrCLMWRmVeSWpSXmKPExsXC9ZZnoa5auk+qwcEeQ4s569ewWXze8I/N 4sWGdkaLr+t/MVs8/dTHYnF51xw2i3tr/rNanN+1ltVix9J9TBaXDixgsri+6yGjxfHeA0wW mzdNZbb4/QOobs4UK4uTsyazOAh4fG/tY/FYsKnUY/MKLY/Fe14yeWxa1cnmsenTJHaPd+fO sXucmPGbxWPnQ0uPeScDPd7vu8rmsfWXncfnTXIe7+a/ZQvgi+KySUnNySxLLdK3S+DKWPpm KUvBLZWKGU82MjcwrpPrYuTgkBAwkfi3SKSLkRPM/PnhCDuIzSagLnHjxk9mEFtEwEziYOsf oDgXB7PAAyaJuW9XMIIkhAWCJY4vPswGYrMIqEo8Pf8OLM4rYCox/dlrNoih8hKrNxwAG8QJ NOjP1O1gthBQzdTP+xlBhkoIvGeTWL3zADtEg6TEwRU3WCYw8i5gZFjFKJSZV5abmJljopdR mZdZoZecn7uJERj2y2r/RO9g/HQh+BCjAAejEg/vjb/eqUKsiWXFlbmHGCU4mJVEeC+Y+KQK 8aYkVlalFuXHF5XmpBYfYpTmYFES5zX6Vp4iJJCeWJKanZpakFoEk2Xi4JRqYPT8pP7/aYei U/SlDunlQY9LNsmFfHg7/+30m4X1ys+ved37HL1ddsZO/dp835oVMXlJ7zy1HF7en/pfoGfJ y3NVxi49jsF9Vnvyr20xvFHSe6SBq17ovaX604Jn931y9Dm09ssqeL6c8UKLcYbpLJnpXMrW rxgVlZp2v3Lf7/nzTJHTHOVLi5RYijMSDbWYi4oTAYrWpb13AgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrDLMWRmVeSWpSXmKPExsXC5WfdrKuW7pNqsPeLgsWc9WvYLD5v+Mdm 8WJDO6PF1/W/mC2efupjsTg89ySrxeVdc9gs7q35z2pxftdaVosdS/cxWVw6sIDJ4vquh4wW x3sPMFls3jSV2eL3D6C6OVOsLE7OmsziIOjxvbWPxWPBplKPzSu0PBbvecnksWlVJ5vHpk+T 2D3enTvH7nFixm8Wj50PLT3mnQz0eL/vKpvH4hcfmDy2/rLz+LxJzuPd/LdsAfxRXDYpqTmZ ZalF+nYJXBlL3yxlKbilUjHjyUbmBsZ1cl2MnBwSAiYSPz8cYQex2QTUJW7c+MkMYosImEkc bP0DFOfiYBZ4wCQx9+0KRpCEsECwxPHFh9lAbBYBVYmn59+BxXkFTCWmP3vNBjFUXmL1hgNg gziBBv2Zuh3MFgKqmfp5P+MERq4FjAyrGEUy88pyEzNzTPWKszMq8zIr9JLzczcxAsN4We2f iTsYv1x2P8QowMGoxMObMMU7VYg1say4MvcQowQHs5II7wUTn1Qh3pTEyqrUovz4otKc1OJD jNIcLErivF7hqQlCAumJJanZqakFqUUwWSYOTqkGRpsSs+z6tFvLflbdjphjF8GqW+tws7ao 06hU4rRf+N1vJT0u7HcW3bi/8bbas6z927zZWYvlD5gsfJDPw3vvx2YW3QNNlUUrXvj8s240 553ae1I6xLdr06F7l/fVP6iv9PPb33WsW/RQrAVLtfxRnQuFKjU2j0wPae4MTJDxiudJttl+ +9BzJZbijERDLeai4kQAgfMvLV8CAAA= X-CFilter-Loop: Reflected X-Stat-Signature: ss9dypg3i1ycf3rr74kher8e7gjarimp X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 7148F140014 X-Rspam-User: X-HE-Tag: 1699505961-374505 X-HE-Meta: U2FsdGVkX1/6/FBhLsUUKzLqbtQjcBn9vb18o6ZZ4qJAVqM+3iLYSMFPLXD+T2IqZEgNBrM56By7YzcjIbvqqyIKJoDGfaXKOxIWrROTh0PoMS+4XILzRRyPelevimy23e1RGyJI+FdJ9WDp3b0D9RQBK6m/1wzzTUH0xDCKKC2eN2u+OSHaPk7pX3EFICw10RXdRmjehh/JF7EcbArcFU9SLWIyDJ2waU3a1/Mv1H7ALx95WE3qO+VU39MxAKWc06ugNPR+xhKJ2k8xEFMG1fw8SbJVMzAywzT5V66e4SCiv40euuEV8fueJo0r7fI2V6o4tzjrk+/9RKiQbFSZChVwJ3m5NXyV7cUCnBPsAMVurEiUJfJ7OpJlqMwf0AK6EwKO58XSY2GIiumw03VEc6PCBcYxk8E6+2h2wsFwqcTtMIT6k0laNn/EoJ4BfE8RIMpDJOUVhso3PQju7XaGEcZv/tgl64zM/lh1f7KQvIp8bdYB92MYw9oJsf1JJ3+NUHB1cKljZEG3fTB25DMeKhXFh3vraSqiPKpdqD4sy/xZocUf1xYGeTp3he2OZBLCPheQQNPVXhIsC4Oiw4XVtcgnBNkarOsGgDc0KR2DF5ZrxWykbiNGH8espCXV3iOu4ebJbeeeJ6XTZJYmw0VPfBWIEvJhbI7OdJYUxRNSG9n6T0av7BmOD4PE15gqvrTiXIkS47Li8SUKBwd6MqCaIWfeixWnEeYC/Uygbok3HfkgEo9xeZdZ3VtUNAn+DparEiyzfngDYpd55hA2cbYWRksVR3ebviH0nUr29BTt0YTjkMk5tHxQbZLzeXsFrw4YmJRQB+7GiOQnKGudiUG07sCpPJBqKOEQS5W6aa0RLnuSEcFLGOvU+vgf8yavv4Pf48bRTv2dcEfvGw4TtIxF0ElxjCwelknnSo57UU2SEWV0OBE+RvJK0YtE6xtBQll3z3ZX7b3v2yVWz6v5C6O fVg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Functionally, no change. This is a preparation for migrc mechanism that requires to recognize read-only TLB entries and makes use of them to batch more aggressively. Signed-off-by: Byungchul Park --- arch/x86/include/asm/tlbflush.h | 3 +++ arch/x86/mm/tlb.c | 11 +++++++++++ include/linux/sched.h | 1 + mm/internal.h | 4 ++++ mm/rmap.c | 30 +++++++++++++++++++++++++++++- 5 files changed, 48 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 25726893c6f4..5c618a8821de 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -292,6 +292,9 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) } extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, + struct arch_tlbflush_unmap_batch *bsrc); static inline bool pte_flags_need_flush(unsigned long oldflags, unsigned long newflags, diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 453ea95b667d..d3c89a3d91eb 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1274,6 +1274,17 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) put_cpu(); } +void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch) +{ + cpumask_clear(&batch->cpumask); +} + +void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, + struct arch_tlbflush_unmap_batch *bsrc) +{ + cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask); +} + /* * Blindly accessing user memory from NMI context can be dangerous * if we're in the middle of switching the current user task or diff --git a/include/linux/sched.h b/include/linux/sched.h index 77f01ac385f7..8a31527d9ed8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1324,6 +1324,7 @@ struct task_struct { #endif struct tlbflush_unmap_batch tlb_ubc; + struct tlbflush_unmap_batch tlb_ubc_ro; /* Cache last used pipe for splice(): */ struct pipe_inode_info *splice_pipe; diff --git a/mm/internal.h b/mm/internal.h index 30cf724ddbce..9764b240e259 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -861,6 +861,7 @@ extern struct workqueue_struct *mm_percpu_wq; void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); void flush_tlb_batched_pending(struct mm_struct *mm); +void fold_ubc_ro(void); #else static inline void try_to_unmap_flush(void) { @@ -871,6 +872,9 @@ static inline void try_to_unmap_flush_dirty(void) static inline void flush_tlb_batched_pending(struct mm_struct *mm) { } +static inline void fold_ubc_ro(void) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/rmap.c b/mm/rmap.c index 9f795b93cf40..c787ae94b4c6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -605,6 +605,28 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, } #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + +void fold_ubc_ro(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro; + + if (!tlb_ubc_ro->flush_required) + return; + + /* + * Fold tlb_ubc_ro's data to tlb_ubc. + */ + arch_tlbbatch_fold(&tlb_ubc->arch, &tlb_ubc_ro->arch); + tlb_ubc->flush_required = true; + + /* + * Reset tlb_ubc_ro's data. + */ + arch_tlbbatch_clear(&tlb_ubc_ro->arch); + tlb_ubc_ro->flush_required = false; +} + /* * Flush TLB entries for recently unmapped pages from remote CPUs. It is * important if a PTE was dirty when it was unmapped that it's flushed @@ -615,6 +637,7 @@ void try_to_unmap_flush(void) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + fold_ubc_ro(); if (!tlb_ubc->flush_required) return; @@ -645,13 +668,18 @@ void try_to_unmap_flush_dirty(void) static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, unsigned long uaddr) { - struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + struct tlbflush_unmap_batch *tlb_ubc; int batch; bool writable = pte_dirty(pteval); if (!pte_accessible(mm, pteval)) return; + if (pte_write(pteval) || writable) + tlb_ubc = ¤t->tlb_ubc; + else + tlb_ubc = ¤t->tlb_ubc_ro; + arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); tlb_ubc->flush_required = true; From patchwork Thu Nov 9 04:59:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Byungchul Park X-Patchwork-Id: 13450611 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A593C4332F for ; Thu, 9 Nov 2023 04:59:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ED7AA8D00D1; Wed, 8 Nov 2023 23:59:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DCAF68002F; Wed, 8 Nov 2023 23:59:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B073F8D00D1; Wed, 8 Nov 2023 23:59:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8CD358002D for ; Wed, 8 Nov 2023 23:59:26 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 66F09140D5F for ; Thu, 9 Nov 2023 04:59:26 +0000 (UTC) X-FDA: 81437212332.26.8430359 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf23.hostedemail.com (Postfix) with ESMTP id 5D6C614000E for ; Thu, 9 Nov 2023 04:59:24 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699505964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=3RFEUN9U3JO6KKgNd+4jyl/cYGN50ja48t88mNwMefg=; b=Zjhbd8/3Jmpsr2ZfeCO7Kg2r51VuG1MEFaDkbdQs7oqldeNoHiQ9nkb1B7oDf2KqAqy1v1 awkL6bAhNoUy8zacGgg2rTdXv0xEMokM4PsDXyyscEBH/mt8F+/bCQpszF7GlSjP5++EUz WKujHoS+XOtvLCeduljOyGuevTOMLKo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699505964; a=rsa-sha256; cv=none; b=E+Y4AOweZS2iehjPVcup3gqjThv2gt2L1awqaG62/AImN+yuPd/GEIWS2Eioyc9We1lium SzUAYVWPfQFJtakdq+Qo9M2HVL28qgrpr332MS6LMsNNsBLq72qKlU/QqGuavpg6oisUjP MbeM4efLzyPTlKAmaX6bFZxhlmlb/t4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none X-AuditID: a67dfc5b-d6dff70000001748-01-654c6726db7e From: Byungchul Park To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com Subject: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Date: Thu, 9 Nov 2023 13:59:07 +0900 Message-Id: <20231109045908.54996-3-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20231109045908.54996-1-byungchul@sk.com> References: <20231109045908.54996-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrELMWRmVeSWpSXmKPExsXC9ZZnka5auk+qwb1tVhZz1q9hs/i84R+b xYsN7YwWX9f/YrZ4+qmPxeLyrjlsFvfW/Ge1OL9rLavFjqX7mCwuHVjAZHF910NGi+O9B5gs Nm+aymzx+wdQ3ZwpVhYnZ01mcRDw+N7ax+KxYFOpx+YVWh6L97xk8ti0qpPNY9OnSewe786d Y/c4MeM3i8fOh5Ye804Gerzfd5XNY+svO4/Pm+Q83s1/yxbAF8Vlk5Kak1mWWqRvl8CV8WDF JpaCqfMZK27s38rWwLimpouRg0NCwERi2twMGPPOa7MuRk4ONgF1iRs3fjKD2CICZhIHW/+w dzFycTALPGCSmPt2BSNIQlggVOL6tk2MIL0sAqoSK66lg4R5BUwl/vxoZwKxJQTkJVZvOAA2 hxNozp+p28FsIaCaqZ/3M0LUfGeTuDNXHMKWlDi44gbLBEbeBYwMqxiFMvPKchMzc0z0Mirz Miv0kvNzNzECQ35Z7Z/oHYyfLgQfYhTgYFTi4b3x1ztViDWxrLgy9xCjBAezkgjvBROfVCHe lMTKqtSi/Pii0pzU4kOM0hwsSuK8Rt/KU4QE0hNLUrNTUwtSi2CyTBycUg2M2e/+WOckF75K Epm3xXbvj933nwj5xCQq5cXt9jkXpjVxi6z5ldUi3KInJ/G4cC+8zs00y2und8CbszOdQk/z neY0mLJ91+QcXW9Dny7x9VLXFp3s2Ns2aTLno60+Sz9+P/xl9pLZnZ8mPa3y7K116L2xZraQ XWPAWeZDy4vn1tzyEn2w7+zrcCWW4oxEQy3mouJEAMKp2ah1AgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrNLMWRmVeSWpSXmKPExsXC5WfdrKuW7pNq0PRfzWLO+jVsFp83/GOz eLGhndHi6/pfzBZPP/WxWByee5LV4vKuOWwW99b8Z7U4v2stq8WOpfuYLC4dWMBkcX3XQ0aL 470HmCw2b5rKbPH7B1DdnClWFidnTWZxEPT43trH4rFgU6nH5hVaHov3vGTy2LSqk81j06dJ 7B7vzp1j9zgx4zeLx86Hlh7zTgZ6vN93lc1j8YsPTB5bf9l5fN4k5/Fu/lu2AP4oLpuU1JzM stQifbsErowHKzaxFEydz1hxY/9WtgbGNTVdjBwcEgImEndem3UxcnKwCahL3LjxkxnEFhEw kzjY+oe9i5GLg1ngAZPE3LcrGEESwgKhEte3bWIE6WURUJVYcS0dJMwrYCrx50c7E4gtISAv sXrDAbA5nEBz/kzdDmYLAdVM/byfcQIj1wJGhlWMIpl5ZbmJmTmmesXZGZV5mRV6yfm5mxiB Qbys9s/EHYxfLrsfYhTgYFTi4U2Y4p0qxJpYVlyZe4hRgoNZSYT3golPqhBvSmJlVWpRfnxR aU5q8SFGaQ4WJXFer/DUBCGB9MSS1OzU1ILUIpgsEwenVAMj27bMxezJh5SMchQLt+vcM5Vq EZNeMqHgUKfHx7MX7X3W2Amdzt86a8bNkE/PAyKz551vtWfd9Pyd9PGtxVu+T2lrFV+krccm a75217m9WyZZHjM70rT7g/dL9ZQyKeVvokY27GemHzKQjG95+feKnukuh1nXNb+3FW3PWrBA olI/6UWC5TU9JZbijERDLeai4kQAuUzSv14CAAA= X-CFilter-Loop: Reflected X-Stat-Signature: 9otr6xtntrpntd1dipz76a6jjtfz6yuc X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5D6C614000E X-Rspam-User: X-HE-Tag: 1699505964-372259 X-HE-Meta: U2FsdGVkX1/lLunS7fHDkbXywNIX7MQ6QziwFiAaHUYEpr/nVFSag2T9dajoyfgD0i3nUtinUZYwzXwXtyXKu1/+yn5/ITCAB4E5xd1M9dM23BSl2tm7QhMSdUXqTxK+kqfiMniUcy/IG81mZR6Lk+c624ze+8Kkk3fxFPtG4kReo5vAn0eQ2ttL/PFZl2RTsKtJTGFp57plyFHQsXFOtiHdYRyaXfqv5CkO0DT6+Kyxl/ByHXRJFPG51swNizkngjT0Qd1uPVD9f3Ea4fRC0oe+1NGs/ZM8wHnspzeg9ZxY88Av1Rwp8J+mwaO/eVGMFtlUbnrVQw+CVu8zEU3blxtCkVwsSYDCYWuwYLmvcDldsrR+teBQSaXGGGmErJUnFjRSuVmUpxJafj0PGZdS4pGteKC7tdR4QZfrxEVpOSUlqBW6ElEQc6DOJ1630cbA5t0V0lsBy/Kb9h6PnqWoZM1bbnqYubJjjQib9hZOarZ6ZNt0I0iZX9EgV55LnPaIWfVGsAbsyWjvmWGDTtOtgYF4kIRhEX5P7tqz8XRQe9vvn3i497Jb/r8af+X/ufdkRkGy6kpFr34phNUkk3y8f29Q5pyTwLj6+Y6zCuLnwSen9PuJVS/ViFXjLfrfqwQqiTrTK61JfyJ3zprWgJUZF0Vgk59RsO6h9pD+o/6k00QnVT9N0BbnjleZ0lorRqxVpBV0mLI5Tu3JP6r9pnJpJd0tjVNBdqrmD4dllHUQ0QBvB5ICN4vDGKBHlkOIFhof4gxo80Dp0iN8PcTSh1orndRECnBqlbuGJ0HNTjUH6N0eBWoi1d/sWH/YOqMJti/1lZqU7VegaWQMAPJ3tJiGfw/VyNPHJLlLZUnLrXbx3e9+3creC+x77SIu8ryE5pZfsMwZH4yDJfBx1DuUABOLYuHuQo7MPgcqVTtGDmw5cw4CRDNI/ywfbDkvft98lgLK19ogrU7Xc76r/B1PiXE SLQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Implementation of MIGRC mechanism that stands for 'Migration Read Copy'. We always face the migration overhead at either promotion or demotion, while working with tiered memory e.g. CXL memory and found out TLB shootdown is a quite big one that is needed to get rid of if possible. Fortunately, TLB flush can be defered if both source and destination of folios during migration are kept until all TLB flushes required will have been done, of course, only if the target PTE entries have read-only permission, more precisely speaking, don't have write permission. Otherwise, no doubt the folio might get messed up. To achieve that: 1. For the folios that map only to non-writable TLB entries, prevent TLB flush at migration by keeping both source and destination folios, which will be handled later at a better time. 2. When any non-writable TLB entry changes to writable e.g. through fault handler, give up migrc mechanism so as to perform TLB flush required right away. The measurement result: Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB) Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) run 'perf stat' using events: 1) itlb.itlb_flush 2) tlb_flush.dtlb_thread 3) tlb_flush.stlb_any 4) dTLB-load-misses 5) dTLB-store-misses 6) iTLB-load-misses run 'cat /proc/vmstat' and pick: 1) numa_pages_migrated 2) pgmigrate_success 3) nr_tlb_remote_flush 4) nr_tlb_remote_flush_received 5) nr_tlb_local_flush_all 6) nr_tlb_local_flush_one BEFORE - mainline v6.6-rc5 ------------------------------------------ $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 20953405 itlb.itlb_flush 114886593 tlb_flush.dtlb_thread 88267015 tlb_flush.stlb_any 115304095543 dTLB-load-misses 163904743 dTLB-store-misses 608486259 iTLB-load-misses 556.787113849 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 3378748 pgmigrate_success 7720310 nr_tlb_remote_flush 751464 nr_tlb_remote_flush_received 10742115 nr_tlb_local_flush_all 21899 nr_tlb_local_flush_one 740157 ... AFTER - mainline v6.6-rc5 + migrc ------------------------------------------ $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 4353555 itlb.itlb_flush 72482780 tlb_flush.dtlb_thread 68226458 tlb_flush.stlb_any 114331610808 dTLB-load-misses 116084771 dTLB-store-misses 377180518 iTLB-load-misses 552.667718220 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 3339325 pgmigrate_success 7642363 nr_tlb_remote_flush 192913 nr_tlb_remote_flush_received 2327426 nr_tlb_local_flush_all 25759 nr_tlb_local_flush_one 740454 ... Signed-off-by: Byungchul Park --- include/linux/mm_types.h | 21 ++++ include/linux/mmzone.h | 9 ++ include/linux/page-flags.h | 4 + include/linux/sched.h | 6 + include/trace/events/mmflags.h | 3 +- mm/internal.h | 57 +++++++++ mm/memory.c | 11 ++ mm/migrate.c | 215 +++++++++++++++++++++++++++++++++ mm/page_alloc.c | 17 ++- mm/rmap.c | 11 +- 10 files changed, 349 insertions(+), 5 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 36c5b43999e6..202de9b09bd6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1372,4 +1372,25 @@ enum { /* See also internal only FOLL flags in mm/internal.h */ }; +struct migrc_req { + /* + * folios pending for TLB flush + */ + struct list_head folios; + + /* + * for hanging to the associated numa node + */ + struct llist_node llnode; + + /* + * architecture specific data for batched TLB flush + */ + struct arch_tlbflush_unmap_batch arch; + + /* + * associated numa node + */ + int nid; +}; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..b79ac8053c6a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -980,6 +980,11 @@ struct zone { /* Zone statistics */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; + + /* + * the number of folios pending for TLB flush in the zone + */ + atomic_t migrc_pending_nr; } ____cacheline_internodealigned_in_smp; enum pgdat_flags { @@ -1398,6 +1403,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif + /* + * migrc requests including folios pending for TLB flush + */ + struct llist_head migrc_reqs; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5c02720c53a5..ec7c178bfb49 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -136,6 +136,7 @@ enum pageflags { PG_arch_2, PG_arch_3, #endif + PG_migrc, /* Page is under migrc's control */ __NR_PAGEFLAGS, PG_readahead = PG_reclaim, @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +TESTCLEARFLAG(Migrc, migrc, PF_ANY) +__PAGEFLAG(Migrc, migrc, PF_ANY) + /* * PageReported() is used to track reported free pages within the Buddy * allocator. We can use the non-atomic version of the test and set diff --git a/include/linux/sched.h b/include/linux/sched.h index 8a31527d9ed8..a2c16d21e365 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1326,6 +1326,12 @@ struct task_struct { struct tlbflush_unmap_batch tlb_ubc; struct tlbflush_unmap_batch tlb_ubc_ro; + /* + * if all the mappings of a folio during unmap are RO so that + * migrc can work on it + */ + bool can_migrc; + /* Cache last used pipe for splice(): */ struct pipe_inode_info *splice_pipe; diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 1478b9dd05fa..dafe302444d9 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -118,7 +118,8 @@ DEF_PAGEFLAG_NAME(mappedtodisk), \ DEF_PAGEFLAG_NAME(reclaim), \ DEF_PAGEFLAG_NAME(swapbacked), \ - DEF_PAGEFLAG_NAME(unevictable) \ + DEF_PAGEFLAG_NAME(unevictable), \ + DEF_PAGEFLAG_NAME(migrc) \ IF_HAVE_PG_MLOCK(mlocked) \ IF_HAVE_PG_UNCACHED(uncached) \ IF_HAVE_PG_HWPOISON(hwpoison) \ diff --git a/mm/internal.h b/mm/internal.h index 9764b240e259..a2b6f0321729 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1158,4 +1158,61 @@ struct vma_prepare { struct vm_area_struct *remove; struct vm_area_struct *remove2; }; + +/* + * Initialize the page when allocated from buddy allocator. + */ +static inline void migrc_init_page(struct page *p) +{ + __ClearPageMigrc(p); +} + +/* + * Check if the folio is pending for TLB flush and then clear the flag. + */ +static inline bool migrc_unpend_if_pending(struct folio *f) +{ + return folio_test_clear_migrc(f); +} + +/* + * Reset the indicator indicating there are no writable mappings at the + * beginning of every rmap traverse for unmap. Migrc can work only when + * all the mappings are RO. + */ +static inline void can_migrc_init(void) +{ + current->can_migrc = true; +} + +/* + * Mark the folio is not applicable to migrc once it found a writble or + * dirty pte during rmap traverse for unmap. + */ +static inline void can_migrc_fail(void) +{ + current->can_migrc = false; +} + +/* + * Check if all the mappings are RO and RO mappings even exist. + */ +static inline bool can_migrc_test(void) +{ + return current->can_migrc && current->tlb_ubc_ro.flush_required; +} + +/* + * Return the number of folios pending TLB flush that have yet to get + * freed in the zone. + */ +static inline int migrc_pending_nr_in_zone(struct zone *z) +{ + return atomic_read(&z->migrc_pending_nr); +} + +/* + * Perform TLB flush needed and free the folios in the node. + */ +bool migrc_flush_free_folios(nodemask_t *nodes); #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 6c264d2f969c..5287ea1639cc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3359,6 +3359,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (vmf->page) folio = page_folio(vmf->page); + /* + * This folio has its read copy to prevent inconsistency while + * deferring TLB flushes. However, the problem might arise if + * it's going to become writable. + * + * To prevent it, give up the deferring TLB flushes and perform + * TLB flush right away. + */ + if (folio && migrc_unpend_if_pending(folio)) + migrc_flush_free_folios(NULL); + /* * Shared mapping: we are guaranteed to have VM_WRITE and * FAULT_FLAG_WRITE set at this point. diff --git a/mm/migrate.c b/mm/migrate.c index 2053b54556ca..9ab7794b0390 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -57,6 +57,162 @@ #include "internal.h" +/* + * Marks the folios as pending for TLB flush. + */ +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst) +{ + folio_get(fsrc); + __folio_set_migrc(fsrc); + __folio_set_migrc(fdst); +} + +static bool migrc_under_processing(struct folio *fsrc, struct folio *fdst) +{ + /* + * case1. folio_test_migrc(fsrc) && !folio_test_migrc(fdst): + * + * fsrc was already under migrc's control even before the + * current migration. Migrc doesn't work with it this time. + * + * case2. !folio_test_migrc(fsrc) && !folio_test_migrc(fdst): + * + * This is the normal case that is not migrc's interest. + * + * case3. folio_test_migrc(fsrc) && folio_test_migrc(fdst): + * + * Only the case that migrc works on. + */ + return folio_test_migrc(fsrc) && folio_test_migrc(fdst); +} + +static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) +{ + /* + * TLB flushes needed are already done at this moment so the + * flag doesn't have to be kept. + */ + __folio_clear_migrc(fsrc); + __folio_clear_migrc(fdst); + folio_put(fsrc); +} + +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst, + struct migrc_req *req) +{ + if (req->nid == -1) + req->nid = folio_nid(fsrc); + + /* + * All the nids in a req should be the same. + */ + WARN_ON(req->nid != folio_nid(fsrc)); + + list_add(&fsrc->lru, &req->folios); + atomic_inc(&folio_zone(fsrc)->migrc_pending_nr); +} + +/* + * Prepares for gathering folios pending for TLB flushes, try to + * allocate objects needed, initialize them and make them ready. + */ +static struct migrc_req *migrc_req_start(void) +{ + struct migrc_req *req; + + req = kmalloc(sizeof(struct migrc_req), GFP_KERNEL); + if (!req) + return NULL; + + arch_tlbbatch_clear(&req->arch); + INIT_LIST_HEAD(&req->folios); + req->nid = -1; + + return req; +} + +/* + * Hang the request with the collected folios to the corresponding node. + */ +static void migrc_req_end(struct migrc_req *req) +{ + if (!req) + return; + + if (list_empty(&req->folios)) { + kfree(req); + return; + } + + llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs); +} + +/* + * Gather folios and architecture specific data to handle. + */ +static void migrc_gather(struct list_head *folios, + struct arch_tlbflush_unmap_batch *arch, + struct llist_head *reqs) +{ + struct llist_node *nodes; + struct migrc_req *req; + struct migrc_req *req2; + + nodes = llist_del_all(reqs); + if (!nodes) + return; + + llist_for_each_entry_safe(req, req2, nodes, llnode) { + arch_tlbbatch_fold(arch, &req->arch); + list_splice(&req->folios, folios); + kfree(req); + } +} + +bool migrc_flush_free_folios(nodemask_t *nodes) +{ + struct folio *f, *f2; + int nid; + struct arch_tlbflush_unmap_batch arch; + LIST_HEAD(folios); + + if (!nodes) + nodes = &node_possible_map; + arch_tlbbatch_clear(&arch); + + for_each_node_mask(nid, *nodes) + migrc_gather(&folios, &arch, &NODE_DATA(nid)->migrc_reqs); + + if (list_empty(&folios)) + return false; + + arch_tlbbatch_flush(&arch); + list_for_each_entry_safe(f, f2, &folios, lru) { + atomic_dec(&folio_zone(f)->migrc_pending_nr); + folio_put(f); + } + return true; +} + +static void fold_ubc_ro_to_migrc(struct migrc_req *req) +{ + struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro; + + if (!tlb_ubc_ro->flush_required) + return; + + /* + * Fold tlb_ubc_ro's data to the request. + */ + arch_tlbbatch_fold(&req->arch, &tlb_ubc_ro->arch); + + /* + * Reset tlb_ubc_ro's data. + */ + arch_tlbbatch_clear(&tlb_ubc_ro->arch); + tlb_ubc_ro->flush_required = false; +} + bool isolate_movable_page(struct page *page, isolate_mode_t mode) { struct folio *folio = folio_get_nontail_page(page); @@ -379,6 +535,7 @@ static int folio_expected_refs(struct address_space *mapping, struct folio *folio) { int refs = 1; + if (!mapping) return refs; @@ -406,6 +563,12 @@ int folio_migrate_mapping(struct address_space *mapping, int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); + /* + * Migrc mechanism increased the reference count. + */ + if (migrc_under_processing(folio, newfolio)) + expected_count++; + if (!mapping) { /* Anonymous page without mapping */ if (folio_ref_count(folio) != expected_count) @@ -1620,16 +1783,25 @@ static int migrate_pages_batch(struct list_head *from, LIST_HEAD(unmap_folios); LIST_HEAD(dst_folios); bool nosplit = (reason == MR_NUMA_MISPLACED); + struct migrc_req *mreq = NULL; VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC && !list_empty(from) && !list_is_singular(from)); + /* + * Apply migrc only to numa migration for now. + */ + if (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED) + mreq = migrc_req_start(); + for (pass = 0; pass < nr_pass && retry; pass++) { retry = 0; thp_retry = 0; nr_retry_pages = 0; list_for_each_entry_safe(folio, folio2, from, lru) { + bool can_migrc; + is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio); nr_pages = folio_nr_pages(folio); @@ -1657,9 +1829,21 @@ static int migrate_pages_batch(struct list_head *from, continue; } + can_migrc_init(); rc = migrate_folio_unmap(get_new_folio, put_new_folio, private, folio, &dst, mode, reason, ret_folios); + /* + * can_migrc is true only if: + * + * 1. struct migrc_req has been allocated && + * 2. There's no writable mapping at all && + * 3. There's read-only mapping found && + * 4. Not under migrc's control already + */ + can_migrc = mreq && can_migrc_test() && + !folio_test_migrc(folio); + /* * The rules are: * Success: folio will be freed @@ -1720,6 +1904,19 @@ static int migrate_pages_batch(struct list_head *from, case MIGRATEPAGE_UNMAP: list_move_tail(&folio->lru, &unmap_folios); list_add_tail(&dst->lru, &dst_folios); + + if (can_migrc) { + /* + * To use ->lru exclusively, just + * mark the page flag for now. + * + * The folio will be queued to + * the current migrc request on + * move success below. + */ + migrc_mark_pending(folio, dst); + fold_ubc_ro_to_migrc(mreq); + } break; default: /* @@ -1733,6 +1930,11 @@ static int migrate_pages_batch(struct list_head *from, stats->nr_failed_pages += nr_pages; break; } + /* + * Done with the current folio. Fold the ro + * batch data gathered, to the normal batch. + */ + fold_ubc_ro(); } } nr_failed += retry; @@ -1774,6 +1976,14 @@ static int migrate_pages_batch(struct list_head *from, case MIGRATEPAGE_SUCCESS: stats->nr_succeeded += nr_pages; stats->nr_thp_succeeded += is_thp; + + /* + * Now that it's safe to use ->lru, + * queue the folio to the current migrc + * request. + */ + if (migrc_under_processing(folio, dst)) + migrc_expand_req(folio, dst, mreq); break; default: nr_failed++; @@ -1791,6 +2001,8 @@ static int migrate_pages_batch(struct list_head *from, rc = rc_saved ? : nr_failed; out: + migrc_req_end(mreq); + /* Cleanup remaining folios */ dst = list_first_entry(&dst_folios, struct folio, lru); dst2 = list_next_entry(dst, lru); @@ -1798,6 +2010,9 @@ static int migrate_pages_batch(struct list_head *from, int page_was_mapped = 0; struct anon_vma *anon_vma = NULL; + if (migrc_under_processing(folio, dst)) + migrc_undo_folios(folio, dst); + __migrate_folio_extract(dst, &page_was_mapped, &anon_vma); migrate_folio_undo_src(folio, page_was_mapped, anon_vma, true, ret_folios); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f376302..914e93ab598e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_owner(page, order, gfp_flags); page_table_check_alloc(page, order); + + for (i = 0; i != 1 << order; ++i) + migrc_init_page(page + i); } static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, @@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, long min = mark; int o; + free_pages += migrc_pending_nr_in_zone(z); + /* free_pages may go negative - that's OK */ free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); @@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, long usable_free; long reserved; - usable_free = free_pages; + usable_free = free_pages + migrc_pending_nr_in_zone(z); reserved = __zone_watermark_unusable_free(z, 0, alloc_flags); /* reserved may over estimate high-atomic reserves. */ @@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, gfp_mask)) { int ret; + /* + * Free the pending folios so that the remaining + * code can use the updated vmstats and check + * zone_watermark_fast() again. + */ + migrc_flush_free_folios(ac->nodemask); + if (zone_watermark_fast(zone, order, mark, + ac->highest_zoneidx, alloc_flags, gfp_mask)) + goto try_this_zone; + if (has_unaccepted_memory()) { if (try_to_accept_memory(zone, order)) goto try_this_zone; diff --git a/mm/rmap.c b/mm/rmap.c index c787ae94b4c6..8786c14e08c9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -605,7 +605,6 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, } #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH - void fold_ubc_ro(void) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; @@ -675,9 +674,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, if (!pte_accessible(mm, pteval)) return; - if (pte_write(pteval) || writable) + if (pte_write(pteval) || writable) { tlb_ubc = ¤t->tlb_ubc; - else + + /* + * migrc cannot work with the folio once it found a + * writable or dirty mapping of it. + */ + can_migrc_fail(); + } else tlb_ubc = ¤t->tlb_ubc_ro; arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); From patchwork Thu Nov 9 04:59:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Byungchul Park X-Patchwork-Id: 13450610 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57627C4167D for ; Thu, 9 Nov 2023 04:59:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF63C8002E; Wed, 8 Nov 2023 23:59:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA02C8002D; Wed, 8 Nov 2023 23:59:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A41948002E; Wed, 8 Nov 2023 23:59:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 86B618D00D1 for ; Wed, 8 Nov 2023 23:59:26 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 55CE1802DA for ; Thu, 9 Nov 2023 04:59:26 +0000 (UTC) X-FDA: 81437212332.23.718502C Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf01.hostedemail.com (Postfix) with ESMTP id 775AC40003 for ; Thu, 9 Nov 2023 04:59:24 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf01.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699505964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=rLhUwOKcQgap8iEBCStvXSBnrbq+HZfC22eM83Cjt+I=; b=3oVXstjqieOkscYo+fRMychnkDPhw45VGrYzE82KDCoOvm+j93l8M66rTWyBuLzEtobXvq 0sn4GvRyKlA8IHIVwtc/uUtoC5EprxwHc/KWnmEBW0YEkD47WHshHR9cwep/iBHFGFECX6 +Z9TF8CX+U/ZfzqffB76JCpQMKBuE9o= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf01.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699505964; a=rsa-sha256; cv=none; b=UZIepWiTHnGVq40bg+ktaN1tROiNWrctxHaX929u5YU57DzI7MHbbCB90Qx5M1ooRNTm79 o5YEtbCWl9J3Y8gJHCzMR7FVbGKUPw041RAwrKtOBmw9oSbRtOYBBqy5ibcGUwS+sDtPcT wbDWdojC8tCADJZxOAST6Ah1mqTc1rg= X-AuditID: a67dfc5b-d6dff70000001748-08-654c67260dbd From: Byungchul Park To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com Subject: [v4 3/3] mm: Pause migrc mechanism at high memory pressure Date: Thu, 9 Nov 2023 13:59:08 +0900 Message-Id: <20231109045908.54996-4-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20231109045908.54996-1-byungchul@sk.com> References: <20231109045908.54996-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrKLMWRmVeSWpSXmKPExsXC9ZZnka5auk+qwelPlhZz1q9hs/i84R+b xYsN7YwWX9f/YrZ4+qmPxeLyrjlsFvfW/Ge1OL9rLavFjqX7mCwuHVjAZHF910NGi+O9B5gs Nm+aymzx+wdQ3ZwpVhYnZ01mcRDw+N7ax+KxYFOpx+YVWh6L97xk8ti0qpPNY9OnSewe786d Y/c4MeM3i8fOh5Ye804Gerzfd5XNY+svO4/Pm+Q83s1/yxbAF8Vlk5Kak1mWWqRvl8CV0dZ5 hbngkkbFhmmTmRsY9yp0MXJySAiYSJx4/Jsdxj6y+QsriM0moC5x48ZPZhBbRMBM4mDrH6Aa Lg5mgQdMEnPfrmAESQgLOEtcWbYUrIFFQFXi0ddLYA28AqYSp5bvZ4MYKi+xesMBsDgn0KA/ U7eD2UJANVM/72cEGSoh8J5N4lrPJhaIBkmJgytusExg5F3AyLCKUSgzryw3MTPHRC+jMi+z Qi85P3cTIzDwl9X+id7B+OlC8CFGAQ5GJR7eG3+9U4VYE8uKK3MPMUpwMCuJ8F4w8UkV4k1J rKxKLcqPLyrNSS0+xCjNwaIkzmv0rTxFSCA9sSQ1OzW1ILUIJsvEwSnVwGhpcOxog7Usc6nS 3ZPG2gc+lHz/ESA8N0roMu+TXyyvLu1e4hAnekVFJrnt2ulUb+bNd9cG6eYv/vOj4GLvpnrr t9/ifvfpsj+vSf3hta7NxO6e6voVOg9C31bduJcfdib927m7qzyu6jsE7hXL/qv0OSBg8u7/ d37b3Hp/bEJKr76K+1vxc8uVWIozEg21mIuKEwGeu2x4eAIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrDLMWRmVeSWpSXmKPExsXC5WfdrKuW7pNq8Hm/psWc9WvYLD5v+Mdm 8WJDO6PF1/W/mC2efupjsTg89ySrxeVdc9gs7q35z2pxftdaVosdS/cxWVw6sIDJ4vquh4wW x3sPMFls3jSV2eL3D6C6OVOsLE7OmsziIOjxvbWPxWPBplKPzSu0PBbvecnksWlVJ5vHpk+T 2D3enTvH7nFixm8Wj50PLT3mnQz0eL/vKpvH4hcfmDy2/rLz+LxJzuPd/LdsAfxRXDYpqTmZ ZalF+nYJXBltnVeYCy5pVGyYNpm5gXGvQhcjJ4eEgInEkc1fWEFsNgF1iRs3fjKD2CICZhIH W/+wdzFycTALPGCSmPt2BSNIQljAWeLKsqVgDSwCqhKPvl4Ca+AVMJU4tXw/G8RQeYnVGw6A xTmBBv2Zuh3MFgKqmfp5P+MERq4FjAyrGEUy88pyEzNzTPWKszMq8zIr9JLzczcxAsN4We2f iTsYv1x2P8QowMGoxMObMMU7VYg1say4MvcQowQHs5II7wUTn1Qh3pTEyqrUovz4otKc1OJD jNIcLErivF7hqQlCAumJJanZqakFqUUwWSYOTqkGRoumyRe5bvQu2mp/qXhmFv/C3ycP91or 7ah/v/pgyuy8yI6gN8cLeF8mBvlcE2Yu+3HnqP+xsOvTG0X1fZbslFjx5YOO7hSOky0FvqYf P81xC140Y1ZYhAfXmnr1hPYFsc6czwslvvfG9BdNfCASuGfVjHN/2/kmmocLv895sD80KfC2 KcOjCUosxRmJhlrMRcWJAJtDSMRfAgAA X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 775AC40003 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: bc8ef4ayzunugmjsk3r87rrwtm7nsb63 X-HE-Tag: 1699505964-634349 X-HE-Meta: U2FsdGVkX1+BI9GeF+hPN81ioF5OsJePcuRrM02jUNLslU7m+MS8KsEvT0D9PoIzJLK1yICSZq0q5lH7M/1TTt901sn4Ya0YEJnxyXDE91HBZduJhN+ET8COD0jhwkTYicER5k7wAHD+wcdAZl4YmG+b7rY0LQAuMmA+kaPARqsNof7qydp5xSc998Ldcua19WPSQj9CcMbwz9+8N6B4BLmDYAHwPipTfmk1ba0r9r7NoNpP7urwIrVkIMcG3ORuEgvEHbrP4g/FZyk2N2bMFQiKe8vxbreXqydCamQsijhaJa+9nUhn9bYQYrq/rz/4m7t/BMd985qh8M+N3pk3Vz85hN+RH8OVSBjbagmrjCk5UdMDOr6957koX6WItLVbLNtDp8hr5vdNMTGPIIgqcU7AzkSXABIqDRh1KSwMPTdZQWlHwvB/UeSshmGJsr7EcHgIGS8k0sDAhHeUTJk5qp8zCO0kNYeKdp1mAh6Y0r7/VUXTmVGqLT9hAHrBNZR/RYRPeBnWZ7SEj5Wa5j+fKoekooxyouC+l9UUkFnpdhyxHgwUEQA8Jvkqo6x9UjFGeny0DWCpSRG53ADqUwQEZ/fBrq7QDaIjCG5p/+u7ffZct6kk2pmNhgXHKo88HdZ/EOMlV828my93ADLUPerIVnwDtQhqRWYf29iRElgO8QfN+P81MnQsNgz8VQwRK7dDJGQnbV6UuLKHq54FeMoirTVNukH+bioQdCwRVt/zJGx0Xek8va5k2Fh/iEYwXD/7gtkDkxH+eKSfoqu9fqR/dh3ffkOwuicXvwiNdpyxNheaiBnoKDet7DhSA9mvPSXeta0p3XjcTArjUsSHvgRgSHirPHbyVJEOTeFORAuGOnn12XOPhfV8CylHdV40YmGdiAQ7AjSAKoRbq3ZJu0xUo7z3gkVfydL5SUjcqsfG9rR39SSescMDnZF9m07Ro6eEswBGpnusU5Th9PPXzPD jdg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Regression was observed when the system is in high memory pressure with swap on, where migrc keeps expanding its pending queue and the page allocator keeps flushing the queue and freeing folios at the same time, which is meaningless. So temporarily prevented migrc from expanding its pending queue on that condition. Signed-off-by: Byungchul Park --- mm/internal.h | 17 ++++++++++++++++ mm/migrate.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++- mm/page_alloc.c | 13 ++++++++++++ 3 files changed, 82 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index a2b6f0321729..971f2dded4a6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1159,6 +1159,8 @@ struct vma_prepare { struct vm_area_struct *remove2; }; +extern atomic_t migrc_pause_cnt; + /* * Initialize the page when allocated from buddy allocator. */ @@ -1202,6 +1204,21 @@ static inline bool can_migrc_test(void) return current->can_migrc && current->tlb_ubc_ro.flush_required; } +static inline void migrc_pause(void) +{ + atomic_inc(&migrc_pause_cnt); +} + +static inline void migrc_resume(void) +{ + atomic_dec(&migrc_pause_cnt); +} + +static inline bool migrc_paused(void) +{ + return !!atomic_read(&migrc_pause_cnt); +} + /* * Return the number of folios pending TLB flush that have yet to get * freed in the zone. diff --git a/mm/migrate.c b/mm/migrate.c index 9ab7794b0390..bde4f49d0144 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -100,6 +100,16 @@ static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) static void migrc_expand_req(struct folio *fsrc, struct folio *fdst, struct migrc_req *req) { + /* + * If migrc has been paused in the middle of unmap because of + * high memory pressure, then the folios that have already been + * marked as pending should get back. + */ + if (!req) { + migrc_undo_folios(fsrc, fdst); + return; + } + if (req->nid == -1) req->nid = folio_nid(fsrc); @@ -147,6 +157,12 @@ static void migrc_req_end(struct migrc_req *req) llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs); } +/* + * Increase on entry of handling high memory pressure e.g. direct + * reclaim, decrease on the exit. See __alloc_pages_slowpath(). + */ +atomic_t migrc_pause_cnt = ATOMIC_INIT(0); + /* * Gather folios and architecture specific data to handle. */ @@ -213,6 +229,31 @@ static void fold_ubc_ro_to_migrc(struct migrc_req *req) tlb_ubc_ro->flush_required = false; } +static void fold_migrc_to_ubc(struct migrc_req *req) +{ + struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + + if (!req) + return; + + /* + * Fold the req's data to tlb_ubc. + */ + arch_tlbbatch_fold(&tlb_ubc->arch, &req->arch); + + /* + * Reset the req's data. + */ + arch_tlbbatch_clear(&req->arch); + + /* + * req->arch might be empty. However, conservatively set + * ->flush_required to true so that try_to_unmap_flush() can + * check it anyway. + */ + tlb_ubc->flush_required = true; +} + bool isolate_movable_page(struct page *page, isolate_mode_t mode) { struct folio *folio = folio_get_nontail_page(page); @@ -1791,7 +1832,7 @@ static int migrate_pages_batch(struct list_head *from, /* * Apply migrc only to numa migration for now. */ - if (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED) + if (!migrc_paused() && (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED)) mreq = migrc_req_start(); for (pass = 0; pass < nr_pass && retry; pass++) { @@ -1829,6 +1870,16 @@ static int migrate_pages_batch(struct list_head *from, continue; } + /* + * In case that the system is in high memory + * pressure, give up migrc mechanism this turn. + */ + if (unlikely(mreq && migrc_paused())) { + fold_migrc_to_ubc(mreq); + migrc_req_end(mreq); + mreq = NULL; + } + can_migrc_init(); rc = migrate_folio_unmap(get_new_folio, put_new_folio, private, folio, &dst, mode, reason, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 914e93ab598e..c920ad48f741 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3926,6 +3926,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; unsigned int zonelist_iter_cookie; int reserve_flags; + bool migrc_paused = false; restart: compaction_retries = 0; @@ -4057,6 +4058,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; + /* + * The system is in very high memory pressure. Pause migrc from + * expanding its pending queue temporarily. + */ + if (!migrc_paused) { + migrc_pause(); + migrc_paused = true; + migrc_flush_free_folios(NULL); + } + /* Caller is not willing to reclaim, we can't balance anything */ if (!can_direct_reclaim) goto nopage; @@ -4184,6 +4195,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + if (migrc_paused) + migrc_resume(); return page; }