From patchwork Fri Mar 15 09:18:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 13593191 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 420A6C54E69 for ; Fri, 15 Mar 2024 09:18:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B0B048010C; Fri, 15 Mar 2024 05:18:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ABB5C800B4; Fri, 15 Mar 2024 05:18:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 983928010C; Fri, 15 Mar 2024 05:18:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 88972800B4 for ; Fri, 15 Mar 2024 05:18:56 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5B601C0FAE for ; Fri, 15 Mar 2024 09:18:56 +0000 (UTC) X-FDA: 81898723872.01.5FD0ECF Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 4085110001C for ; Fri, 15 Mar 2024 09:18:52 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=LJR7X1s0; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710494334; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=97D5T+0qgCM53qnfwUSxUE2IuUc3aYh/5mhyW+Pe1Ds=; b=qWGJYUi1G9NRtAWLEQuD1sRQJZHM4dlCpT9+DiiSBP3IwQzK8tkcmfvQTSoLHofia7OgEI MQlfbmjgU3M65/engvzrdSIs2iqA34momiHxTF1gcX16vkUr8CPyR9e4FGhWF/8HvUvahc Z6H/4zFPdexfEyixlRfYB6iGpQlw7xQ= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=LJR7X1s0; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710494334; a=rsa-sha256; cv=none; b=DabqVc1bCLJ7F/uTe2p2kdSx7tDjb0EjOshkVvBUZ8rJaQBAL1F+8Ou2tpJeebEXks2QUs QROnq3kFpNho1na8P2d5b8K48pbo8V68YvOV8BhhLG9Smw8uhSKc0QgYXyzoT8ox3sFdCB jdDSRAiqSP3f5VMHkFLJ7aUfLfF8CzI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1710494330; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=97D5T+0qgCM53qnfwUSxUE2IuUc3aYh/5mhyW+Pe1Ds=; b=LJR7X1s0KgxeW9yAqoejyoV8JkXRvAbWEMQqzANDkB42n06kIB3fs+St7YpvAUK1LU0FqkSTVKUf5BtkD/JHzevGNRJivZ7uOJX6K1fKVSemljtuY8v4pBkdCp/bL8/VDyb4F0E3EKOFX0RiCLzMf+3cKLt+065RxhKruQiWYE4= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R471e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0W2W474o_1710494328; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W2W474o_1710494328) by smtp.aliyun-inc.com; Fri, 15 Mar 2024 17:18:49 +0800 From: Baolin Wang To: akpm@linux-foundation.org Cc: david@redhat.com, mgorman@techsingularity.net, wangkefeng.wang@huawei.com, jhubbard@nvidia.com, ying.huang@intel.com, 21cnbao@gmail.com, ryan.roberts@arm.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2] mm: support multi-size THP numa balancing Date: Fri, 15 Mar 2024 17:18:14 +0800 Message-Id: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 X-Rspamd-Queue-Id: 4085110001C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 5btn44zhcpk1t1uzft7y5ppm1prcqpsn X-HE-Tag: 1710494332-235689 X-HE-Meta: U2FsdGVkX1+5pEdBbC7kv+schAGI8CaABOVaTxdZO+ksd2quJcRbC2dTy7q06zm6PZTMNgwJRUL47ysMEV2IsrXGNt3ZFU3jNrPbxHt7lOCbPxSitl1jT7jFiZmjCBTPT0Us13ckrwVFebZsltqB2na7ejFmvSX7tzI04/AkO8KhLwmMpRCx8pTrHWqsxoWner9n+OGEEP4moIZyDo7IpO72jRigYYozxRs30EF4EcxQlnzTaemh3z1RT8y//Mi8GUM8MjN+9lltpLZk+LlC1b3piTwbKgRKrol7+xN7gdGnm6W98CHL59/985PS+XIRrqukPP35m/1NMNmOui2l1Xfb0eQr+uZt+K3BUH1BPF1X6OXujjHVOeFp2YCp7HOGOYTxg/0DCKXeohq7+irIb6xwNA4/w7EkwBZEdOPXtTDt8FFNyGpHTsW8F/7+l1pVaXye1/d7qtq9asC1/XZzSOycbSE4MkdawHPxXb9yjii5VwFTGEmV0tq0AiIYe0jEJdkP+G2Yg4ADUzBF0eFBhp131zQKtz1CBAt0YO3xPOjL82N8AFnvTT3jjWUe/Bht2F+gzyyqMGuoWgpPnZw8yomJvlACjYsIQVUQvbl9uNxVyeZIhCeC0LZWXttEJHCwqAS30s4jq7Ny9fpDuJEBrUt70O0Uu6rwRPhjQQ8sjx0BRw3KWztLAamo+8VObGSC7B1iVFfKoy07wXHAhOncUgSqHJzehVI6/p47hrUUS7hvMELVGh4qEmZBoEC1Ham85FVAL4VFNBaszpz2dxd8terCx3pakZbDJ3YjqhPLjEJ7eIbjwoJ3TrUW+De7p0diAo4qbtpt4cDlJ/pqqOglcAFrQ5BLV7Ue36fjfD6s3roXFvPWwpYIYcZCAXNUViCDxOHFKau6NUZ9XmhoxaXWQJ28f5ZxgaqxmJMHo+YOCUGjlftgs92bTyDk/FUZ0+tS64MApUUVpGn/Np5U2zP ZkKqBeC+ SsbOt2p+SNsKXeIQNhZYmIa554Jso2lr/r6F/fyunnLohWklH/uUc2oLyowuDB1N0RkPWbUB5gL90LptJmV2L+3Kjj8qU1zuoPINgmwP/omtErTJPJP7xZtyxeXzP/6iSQDmp/OJ8S5C18j/EDlPPVkHlWFQ9OH8q0d41vAZksn4Vv5qDNzTOcb4kbUIn6InjC69B77Y4xYaNIar+uhnvy5T8N1cvWvzPuhSDjWSc+b4EBbMPQLgI3IupmElFZCu9PYvp2I6qZsDM6axk+XUOzsTUletebjhIx3sNoit+cFWiLKJiRWC3x4bxYyxyKrjjXJddQPNu4wQm6uYIK1yRR4/qgA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable. Thus let's support the exclusive mTHP numa balancing firstly. Allow scanning mTHP: Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages") skips shared CoW pages' NUMA page migration to avoid shared data segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") change to use page_count() to avoid GUP pages migration, that will also skip the mTHP numa scaning. Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP issue, although there is still a GUP race, the issue seems to have been resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers() to skip shared CoW pages though this is not a precise sharers count. To check if the folio is shared, ideally we want to make sure every page is mapped to the same process, but doing that seems expensive and using the estimated mapcount seems can work when running autonuma benchmark. Allow migrating mTHP: As mentioned in the previous thread[1], large folios are more susceptible to false sharing issues, leading to pages ping-pong back and forth during numa balancing, which is currently hard to resolve. Therefore, as a start to support mTHP numa balancing, only exclusive mappings are allowed to perform numa migration to avoid the false sharing issues with large folios. Similarly, use the estimated mapcount to skip shared mappings, which seems can work in most cases (?), and we've used folio_estimated_sharers() to skip shared mappings in migrate_misplaced_folio() for numa balancing, seems no real complaints. Performance data: Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum Base: 2024-3-15 mm-unstable branch Enable mTHP=64K to run autonuma-benchmark Base without the patch: numa01 222.97 numa01_THREAD_ALLOC 115.78 numa02 13.04 numa02_SMT 14.69 Base with the patch: numa01 125.36 numa01_THREAD_ALLOC 44.58 numa02 9.22 numa02_SMT 7.46 [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/ Signed-off-by: Baolin Wang --- Changes from RFC v1: - Add some preformance data per Huang, Ying. - Allow mTHP scanning per David Hildenbrand. - Avoid sharing mapping for numa balancing to avoid false sharing. - Add more commit message. --- mm/memory.c | 9 +++++---- mm/mprotect.c | 3 ++- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f2bc6dd15eb8..b9d5d88c5a76 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) int last_cpupid; int target_nid; pte_t pte, old_pte; - int flags = 0; + int flags = 0, nr_pages = 0; /* * The pte cannot be used safely until we verify, while holding the page @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map; - /* TODO: handle PTE-mapped THP */ - if (folio_test_large(folio)) + /* Avoid large folio false sharing */ + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1) goto out_map; /* @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_SHARED; nid = folio_nid(folio); + nr_pages = folio_nr_pages(folio); /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) out: if (nid != NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, 1, flags); + task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; out_map: /* diff --git a/mm/mprotect.c b/mm/mprotect.c index f8a4544b4601..f0b9c974aaae 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, /* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && - folio_ref_count(folio) != 1) + (folio_maybe_dma_pinned(folio) || + folio_estimated_sharers(folio) > 1)) continue; /*