From patchwork Mon Nov 7 15:39:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joao Martins X-Patchwork-Id: 13034572 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF061C433FE for ; Mon, 7 Nov 2022 15:39:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 661BC6B0071; Mon, 7 Nov 2022 10:39:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5EADB6B0072; Mon, 7 Nov 2022 10:39:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 465406B0073; Mon, 7 Nov 2022 10:39:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 311D66B0071 for ; Mon, 7 Nov 2022 10:39:50 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BF366802AF for ; Mon, 7 Nov 2022 15:39:49 +0000 (UTC) X-FDA: 80107056498.13.B50FEE0 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by imf18.hostedemail.com (Postfix) with ESMTP id 1B02B1C000A for ; Mon, 7 Nov 2022 15:39:48 +0000 (UTC) Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A7FVHeO014522; Mon, 7 Nov 2022 15:39:45 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=corp-2022-7-12; bh=zS6RrKKJZJG3R8rcr/unLjSJ0KkYEp5ic/zN2fYblXI=; b=SkTLAOPXn4WXGPGWIwlWLr8IojjgC2vqwjI16Z784r4eGse8+bFHwLbI/uvs0DEIg31U ksLhJlGJsU2vrUnv/PjF6K01gsTnYQI8iX92WiHDs+Bq17OjlKbY/VQiKqEnLKeeEMKb iqHAuZ/dj/z+UvTglEsAtNUkqswK5rpRzOfWhWlKUFB/KYR4x/uUuUZA3bvCzRgND5Gc 32yKjTpzSBPPAkw/Op7Fvz1GzJG5swU8TYxpVQBB9braHqxI3TzWxzwI3/ElLRnpYsaJ UvPJgdelECGdJob3aTb42orjPRf8YbO7NHHL1rZuqtDIji5CR85ijw0DKOTaNokq/uux 7w== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3kngkw472q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 07 Nov 2022 15:39:45 +0000 Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.5/8.17.1.5) with ESMTP id 2A7EhiI0014375; Mon, 7 Nov 2022 15:39:44 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3kpcq0w8g0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 07 Nov 2022 15:39:44 +0000 Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2A7FcoBm001096; Mon, 7 Nov 2022 15:39:43 GMT Received: from joaomart-mac.uk.oracle.com (dhcp-10-175-201-115.vpn.oracle.com [10.175.201.115]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 3kpcq0w8ek-1; Mon, 07 Nov 2022 15:39:43 +0000 From: Joao Martins To: linux-mm@kvack.org Cc: Muchun Song , Mike Kravetz , Andrew Morton , Joao Martins Subject: [PATCH v2] mm/hugetlb_vmemmap: remap head page to newly allocated page Date: Mon, 7 Nov 2022 15:39:22 +0000 Message-Id: <20221107153922.77094-1-joao.m.martins@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-07_08,2022-11-07_02,2022-06-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 adultscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 phishscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2211070125 X-Proofpoint-GUID: vAYcSOfmYTyvbttBAvLT7Zu66hVGWAdY X-Proofpoint-ORIG-GUID: vAYcSOfmYTyvbttBAvLT7Zu66hVGWAdY ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2022-7-12 header.b=SkTLAOPX; spf=pass (imf18.hostedemail.com: domain of joao.m.martins@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=joao.m.martins@oracle.com; dmarc=pass (policy=none) header.from=oracle.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667835589; a=rsa-sha256; cv=none; b=ipM/NrM36F5CjJbIek3C6NLKIzeaqarxlw3MGOjfIFmcz7JtrIBr6yyJmP2p34dsfp1stm qftSMXTQTZJ3k7aCb1hHI/Z34IVdRcsY8nfqZ6a5dy9qUMg2fs8wTqPx1lLxM2iaGMb4Cy WrdaovCuFirCX3Zm1rSgOpbtgd0Ey8o= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667835589; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=zS6RrKKJZJG3R8rcr/unLjSJ0KkYEp5ic/zN2fYblXI=; b=mSHCupre8HQkuAVp88SxwHXc6hhmEyXUXbj8BEQfLQ+xxLOlM+RWNIghmyNQFlDT2L8XJr Z8mUuT2LE8XEbIhpYc9Ql//15LUis0ISX07eOV2WQKoPh2qmCwpH24nJBrBIMcVHYO4Wwf dnZfd4wWMsdkDBB/lAWHhEM7NsTxKU0= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 1B02B1C000A Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2022-7-12 header.b=SkTLAOPX; spf=pass (imf18.hostedemail.com: domain of joao.m.martins@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=joao.m.martins@oracle.com; dmarc=pass (policy=none) header.from=oracle.com X-Stat-Signature: hnpmpewpzbu9sxqun9iic59i7er43t8o X-Rspam-User: X-HE-Tag: 1667835588-830024 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 340 100 32 15 1 2 0 0 0 1 15558 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31987 * After: Node 0, zone Normal, type Movable 30893 32006 31515 7 0 0 0 0 0 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31974 pages (63948M). To fix this behaviour rather than remapping one page (thus breaking the contiguous block of memory backing the struct pages) repopulate with a new page for the head vmemmap page. It will copying the data from the currently mapped vmemmap page, and then remap it to this new page. Additionally, change the remap_pte callback to look at the newly added walk::head_page which needs to be mapped as r/w compared to the tail page vmemmap reuse that uses r/o. The new head page is allocated by the caller of vmemmap_remap_free() given that on restore it should still be using the same code path as before. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 1 1 1 0 0 1 0 0 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32394 * After: Node 0, zone Normal, type Movable 0 50 97 108 96 81 70 46 18 0 0 In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out of the 32394 (64796M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no massive order-0..order-2 pages accumulated unused. Signed-off-by: Joao Martins --- Changes since v1[0]: * Drop rw argument and check walk::head_page directly when there's no reuse_page set (similar suggestion by Muchun Song to adjust inside the remap_pte callback) * Adjust TLB flush to cover the head page vaddr too (Muchun Song) * Simplify the remap of head page in vmemmap_pte_range() * Check start is aligned to PAGE_SIZE in vmemmap_remap_free() I've kept the same structure as in v1 compared to a chunk Muchun pasted in the v1 thread[1] and thus I am not altering the calling convention of vmemmap_remap_free()/vmemmap_restore_pte(). The remapping of head page is not exactly a page that is reused, compared to the r/o tail vmemmap pages remapping. So tiny semantic change, albeit same outcome in pratice of changing the PTE and freeing the page, with different permissions. It also made it simpler to gracefully fail in case of page allocation failure, and logic simpler to follow IMHO. Let me know otherwise if I followed the wrong thinking. [0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/ --- mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 7898c2c75e35..4298c44578e3 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -22,6 +22,7 @@ * * @remap_pte: called for each lowest-level entry (PTE). * @nr_walked: the number of walked pte. + * @head_page: the page which replaces the head vmemmap page. * @reuse_page: the page which is reused for the tail vmemmap pages. * @reuse_addr: the virtual address of the @reuse_page page. * @vmemmap_pages: the list head of the vmemmap pages that can be freed @@ -31,6 +32,7 @@ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk); unsigned long nr_walked; + struct page *head_page; struct page *reuse_page; unsigned long reuse_addr; struct list_head *vmemmap_pages; @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, * remapping (which is calling @walk->remap_pte). */ if (!walk->reuse_page) { - walk->reuse_page = pte_page(*pte); + struct page *page = pte_page(*pte); + + /* + * Copy the data from the original head, and remap to + * the newly allocated page. + */ + if (walk->head_page) { + memcpy(page_address(walk->head_page), + page_address(page), PAGE_SIZE); + walk->remap_pte(pte, addr, walk); + page = walk->head_page; + } + + walk->reuse_page = page; + /* - * Because the reuse address is part of the range that we are - * walking, skip the reuse address range. + * Because the reuse address is part of the range that + * we are walking or the head page was remapped to a + * new page, skip the reuse address range. + * . */ addr += PAGE_SIZE; pte++; @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, } while (pgd++, addr = next, addr != end); /* - * We only change the mapping of the vmemmap virtual address range - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which + * We change the mapping of the vmemmap virtual address range + * [@start, end], so we only need to flush the TLB which * belongs to the range. */ - flush_tlb_kernel_range(start + PAGE_SIZE, end); + flush_tlb_kernel_range(start, end); return 0; } @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * to the tail pages. */ pgprot_t pgprot = PAGE_KERNEL_RO; - pte_t entry = mk_pte(walk->reuse_page, pgprot); + struct page *reuse = walk->reuse_page; struct page *page = pte_page(*pte); + pte_t entry; + /* + * When there's no walk::reuse_page, it means we allocated a new head + * page (stored in walk::head_page) and copied from the old head page. + * In that case use the walk::head_page as the page to remap. + */ + if (!reuse) { + pgprot = PAGE_KERNEL; + reuse = walk->head_page; + } + + entry = mk_pte(reuse, pgprot); list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + int nid = page_to_nid((struct page *)start); + struct page *page = NULL; + + /* + * Allocate a new head vmemmap page to avoid breaking a contiguous + * block of struct page memory when freeing it back to page allocator + * in free_vmemmap_page_list(). This will allow the likely contiguous + * struct page backing memory to be kept contiguous and allowing for + * more allocations of hugepages. Fallback to the currently + * mapped head page in case should it fail to allocate. + */ + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE)) + page = alloc_pages_node(nid, gfp_mask, 0); + walk.head_page = page; /* * In order to make remapping routine most efficient for the huge pages,