From patchwork Wed Jul 12 04:38:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hugh Dickins X-Patchwork-Id: 13309564 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EDB0DEB64D9 for ; Wed, 12 Jul 2023 04:38:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A4BA6B0075; Wed, 12 Jul 2023 00:38:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8548C6B0078; Wed, 12 Jul 2023 00:38:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F5D66B007B; Wed, 12 Jul 2023 00:38:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6308C6B0075 for ; Wed, 12 Jul 2023 00:38:44 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2C4BC1A0505 for ; Wed, 12 Jul 2023 04:38:44 +0000 (UTC) X-FDA: 81001704168.16.C39778B Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) by imf29.hostedemail.com (Postfix) with ESMTP id 4A78F12000A for ; Wed, 12 Jul 2023 04:38:42 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=VScKSufr; spf=pass (imf29.hostedemail.com: domain of hughd@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689136722; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1rHJ9EBJseZz5NJPKd2al59RdAa4H3OEyFlgk3ZwFK8=; b=8lcgg+LSuOh8zxuBAdA4fVjOK/K5rFQWgNdVHdm49ZuMff6EQNXSu0R7Uh7vzj9VJKcwHo wXzmb4e5YtT8eCkxzABEm2EfGNgSnUgkjWvOeqe37ChX0e5pmRTSGE/30fwQPtdEQdShC0 KtzMTc7aVl1HYZtBUCi5pb6ulTZQqU4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689136722; a=rsa-sha256; cv=none; b=JCQxaK9H2U5Gj18mi+4OMdozAtt1L2KJmPsE+FyXMe44jf0kgHlcA4/MYWk57s3JZL3Pdd oeMBDIFs4qpAU3AXtVvBE+0G2HPRHDkYMpsODQd52OgdY9ePhxk8Q10khZ1SzcZK1r2F3z 4mvglAduKwSWFZN8TqzGBXucBG3Lpig= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=VScKSufr; spf=pass (imf29.hostedemail.com: domain of hughd@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-5776312eaddso75105647b3.3 for ; Tue, 11 Jul 2023 21:38:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689136721; x=1691728721; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=1rHJ9EBJseZz5NJPKd2al59RdAa4H3OEyFlgk3ZwFK8=; b=VScKSufrGoiqmfI5RZZYxJxerglEQwWGwaRIv4A99gzC5wJzMKOiB64T2ejMrKgdKr uIsr2UxYwE3cYtc0YUlrcv6XMngYfWOsRl2pmomrgY/4Edb1ge3kiot7EKm/tncs7eVO cviypSqX9Re7SHqz2WACixffhKyzowkQkyHwSCtoVoDAd6Q8rADqqRRBehW9D7VhpQdy jfgIrKIpQWTLP3aZ5s4wAOm/AaPgQQr3FnxMyz0tHHSwgjN2BrSIYEujRK8n8SFaFHmr K+rhj1orR1amsxhQKit13Cu51WvZx8T3KsnjlJzFYQueSSTzV4TQi444dJi6Q49LPgkz NF3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689136721; x=1691728721; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1rHJ9EBJseZz5NJPKd2al59RdAa4H3OEyFlgk3ZwFK8=; b=gJsFWzT06XZsCc/P6ikoHXqoLOeqMZaQ02lau0c4DX7b2QVXCcNg53JoYeIpkk3WYA cZd42/opZYOEijWYxDho+jAwMFBQbg1isRuz7v3YWLUpmVmaOIFPXl90dAha3zoTbwsQ wP6wE1qkMGMtBU9VkCzls6LsFUnDLnAxXogWjpEzrta4qMafOghZljLNOVbChTRc0Un3 dXY6YOvmykx1Y9tJYY+zUp88fcnehz6nFESOFv+8iHJvFJQCx9Bag8UniVu4G3XzHZGv KMf2EFg4rubeR8MTf62nugSRVI0DJ9BOB+ae4aQ/NtElwmvbhDrX7yz3ogZZMJa7gpeT gZPg== X-Gm-Message-State: ABy/qLb2ZHW7pv3Pdqnsrx60+C65B3OxKd0SvXomkvD+XSgyp0kKK3cC 25Z/vcnOFJbXFJnnkf5qqtYZFQ== X-Google-Smtp-Source: APBJJlETkafvYhmOQMedWmY36OTxgXTWTltj88mfBOb4d+VWzX0jPtePMxWmYiYJC8wVDqt9xdrh6g== X-Received: by 2002:a81:7105:0:b0:573:d3cd:3d2a with SMTP id m5-20020a817105000000b00573d3cd3d2amr17575683ywc.28.1689136721012; Tue, 11 Jul 2023 21:38:41 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id b126-20020a0dc084000000b0056d443372f0sm977038ywd.119.2023.07.11.21.38.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 21:38:40 -0700 (PDT) Date: Tue, 11 Jul 2023 21:38:35 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Jann Horn , Vishal Moola , Vlastimil Babka , Zi Yan , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page In-Reply-To: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> Message-ID: <94eccf5f-264c-8abe-4567-e77f4b4e14a@google.com> References: <7cd843a9-aa80-14f-5eb2-33427363c20@google.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 4A78F12000A X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: p8mskr7ymzh7jjnah4h1mm3zcepas1ah X-HE-Tag: 1689136722-495086 X-HE-Meta: U2FsdGVkX18XX86iOlPpOb2eKMUUGYJdGqY1nKps7ShQxdhVQZU9PYctQT6tqd3NUUQ3gKLySNOIyrqe7xVXl+VyKG/SwqXFWuGJVGEp2h6K1btA5n1CA3jt8nZ2EwUoEgP3u50Reo43imAMxBapAxEVZ66guas8Uii0bL7y6ldgnwFkSipQy0hccEEFf8jk7m+EmilRaTc7I35lZqTY5HucZx50oRllUz+SRR5UVu2eNrA4Sp1sVpEp45SrC2uodP2H3URjfiNLGetE8mHZ+13gBOoXpIGTL7FydN1kzQZkdlmz4hPGwaoRymuzawZsKXvL6hwW6Z0yQlVBnIi3/GOSnjKXBqJEZaDX2UtwavVMXAE9XW24lFdNT2yRSKFL67soIiFqUtSCZgCQFfUyfc8dTdRG8HauW3HkjmhSJXCFAwDGinWUoUk+MBHo2v2llChTnyxQxzyFc3/6u9yYAm+si4AdA280IeWuJ94puQpKqNVmc+NQpo8RydKkSEWFNyYcBWN3JJyRhKW2znx7UpbqxnkCh67sds9zA7+nej52BIEKkbDSlW3u/ZxDoHCuEE9mj2a3J2/RF+jCslX38f3zlY5mqIYtLsDZIZLBCZ/2hVHeXtkwMAey1vc6RlsC12RmcAoosp055itcADCo5Ds5tHosFcpHW2N9NGZbOL1Ara2c12MNAv6RpkjoNfZtBuADDUK3v+ImOMcgMU1sbvyAq37o9knnuwbb2PUe3ddKZz4RcqWSCnya9cTyCO2Hw+TTKJFQkvvh0b11EM5DDLp0Ldy+7X5Z9S6927jtagpZgrMWw5Ihq7DAk7rrTnYYVT5WeTo1TzxoVDcc8A3IvXFMyhyBqmSyATGsz6WCzmERDKhRbdRdJRLb1XVZQkvUldOj0tctNh8YfwF4V/6caVwKsYg2HUFs8k16oXRwVbvwON2m9pnMXktM/mnQS+fX1/ELMdfIs9a+eNicp3w qzwGy2DY NS5lM3vTuRnJY1aSSwwLzVTWuKvTmuTxk++i+r5gtiviOxyUUzT1gDYV7GNKWxf0s+/d72jBkgEd8R33O6Ex8ES41/AYxqWT4UobyU4vA8OYrm08EhdIrQOlame3HzIYnr4B63gbOyTpbbjdza7RpXJ1fF1wTBqsipe9ZZYOkpsf0O1SUa+KSJbGbQ5PEXvy84hABEllyFGO9oR3Z/q+hbYch4bDm5YDCo9kv4IFsTafYcKNbn1zUhLtqRx4hMI8UvbonbIpuomHJt2pEjEmINAR9szLKc7bxktjYccq9G0HD4/phKwfWi2PUAAAmThfMA1xMPtUclsdDd9I06f3zF/H6u0Rf2WtkPnkjgqqPCzWfQ1R99rj7i8XdLSE6OXLj19HS2L6VqsmMm1RLcssL8JqAZmz5/QLtpYCVGsbGQ2hAPTfBbO5fasqOTrXfWhnHy+eZ5VFTEZza1PF+P7aMEEmXjruAR3YI0yRd6Tuhd20+9AQ8MX8Fg8sdCVVvCdDihEUtl/ejZkb8XLxYDeSKZLfKDQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add s390-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This version is more complicated than others: because s390 fits two 2K page tables into one 4K page (so page->rcu_head must be shared between both halves), and already uses page->lru (which page->rcu_head overlays) to list any free halves; with clever management by page->_refcount bits. Build upon the existing management, adjusted to follow a new rule: that a page is never on the free list if pte_free_defer() was used on either half (marked by PageActive). And for simplicity, delay calling RCU until both halves are freed. Not adding back unallocated fragments to the list in pte_free_defer() can result in wasting some amount of memory for pagetables, depending on how long the allocated fragment will stay in use. In practice, this effect is expected to be insignificant, and not justify a far more complex approach, which might allow to add the fragments back later in __tlb_remove_table(), where we might not have a stable mm any more. Signed-off-by: Hugh Dickins Reviewed-by: Gerald Schaefer Tested-by: Alexander Gordeev Acked-by: Alexander Gordeev --- arch/s390/include/asm/pgalloc.h | 4 ++ arch/s390/mm/pgalloc.c | 80 +++++++++++++++++++++++++++++------ 2 files changed, 72 insertions(+), 12 deletions(-) diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h index 17eb618f1348..89a9d5ef94f8 100644 --- a/arch/s390/include/asm/pgalloc.h +++ b/arch/s390/include/asm/pgalloc.h @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm, #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte) #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte) +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + void vmem_map_init(void); void *vmem_crst_alloc(unsigned long val); pte_t *vmem_pte_alloc(void); diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index 66ab68db9842..760b4ace475e 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable * while the PP bits are never used, nor such a page is added to or removed * from mm_context_t::pgtable_list. + * + * pte_free_defer() overrides those rules: it takes the page off pgtable_list, + * and prevents both 2K fragments from being reused. pte_free_defer() has to + * guarantee that its pgtable cannot be reused before the RCU grace period + * has elapsed (which page_table_free_rcu() does not actually guarantee). + * But for simplicity, because page->rcu_head overlays page->lru, and because + * the RCU callback might not be called before the mm_context_t has been freed, + * pte_free_defer() in this implementation prevents both fragments from being + * reused, and delays making the call to RCU until both fragments are freed. */ unsigned long *page_table_alloc(struct mm_struct *mm) { @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table += PTRS_PER_PTE; atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24)); - list_del(&page->lru); + list_del_init(&page->lru); } } spin_unlock_bh(&mm->context.lock); @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table = (unsigned long *) page_to_virt(page); if (mm_alloc_pgste(mm)) { /* Return 4K page table with PGSTEs */ + INIT_LIST_HEAD(&page->lru); atomic_xor_bits(&page->_refcount, 0x03U << 24); memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE); memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE); @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table, { char msg[128]; - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask) + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + if (!mask && list_empty(&page->lru)) return; snprintf(msg, sizeof(msg), "Invalid pgtable %p release half 0x%02x mask 0x%02x", @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table, dump_page(page, msg); } +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + pgtable_pte_page_dtor(page); + __free_page(page); +} + void page_table_free(struct mm_struct *mm, unsigned long *table) { unsigned int mask, bit, half; @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table) */ mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); mask >>= 24; - if (mask & 0x03U) + if ((mask & 0x03U) && !PageActive(page)) { + /* + * Other half is allocated, and neither half has had + * its free deferred: add page to head of list, to make + * this freed half available for immediate reuse. + */ list_add(&page->lru, &mm->context.pgtable_list); - else - list_del(&page->lru); + } else { + /* If page is on list, now remove it. */ + list_del_init(&page->lru); + } spin_unlock_bh(&mm->context.lock); mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24)); mask >>= 24; @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table) } page_table_release_check(page, table, half, mask); - pgtable_pte_page_dtor(page); - __free_page(page); + if (TestClearPageActive(page)) + call_rcu(&page->rcu_head, pte_free_now); + else + pte_free_now(&page->rcu_head); } void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table, @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table, */ mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); mask >>= 24; - if (mask & 0x03U) + if ((mask & 0x03U) && !PageActive(page)) { + /* + * Other half is allocated, and neither half has had + * its free deferred: add page to end of list, to make + * this freed half available for reuse once its pending + * bit has been cleared by __tlb_remove_table(). + */ list_add_tail(&page->lru, &mm->context.pgtable_list); - else - list_del(&page->lru); + } else { + /* If page is on list, now remove it. */ + list_del_init(&page->lru); + } spin_unlock_bh(&mm->context.lock); table = (unsigned long *) ((unsigned long) table | (0x01U << bit)); tlb_remove_table(tlb, table); @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table) } page_table_release_check(page, table, half, mask); - pgtable_pte_page_dtor(page); - __free_page(page); + if (TestClearPageActive(page)) + call_rcu(&page->rcu_head, pte_free_now); + else + pte_free_now(&page->rcu_head); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = virt_to_page(pgtable); + SetPageActive(page); + page_table_free(mm, (unsigned long *)pgtable); + /* + * page_table_free() does not do the pgste gmap_unlink() which + * page_table_free_rcu() does: warn us if pgste ever reaches here. + */ + WARN_ON_ONCE(mm_alloc_pgste(mm)); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + /* * Base infrastructure required to generate basic asces, region, segment, * and page tables that do not make use of enhanced features like EDAT1.