From patchwork Thu Oct 20 07:49:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 13012764 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22BE5C4332F for ; Thu, 20 Oct 2022 07:49:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B62B66B0073; Thu, 20 Oct 2022 03:49:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AEA446B0074; Thu, 20 Oct 2022 03:49:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9638E6B0075; Thu, 20 Oct 2022 03:49:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 899476B0073 for ; Thu, 20 Oct 2022 03:49:19 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 60D5A1C6621 for ; Thu, 20 Oct 2022 07:49:19 +0000 (UTC) X-FDA: 80040552438.18.F54F633 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf12.hostedemail.com (Postfix) with ESMTP id 942FE40023 for ; Thu, 20 Oct 2022 07:49:18 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045168;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VSeILvE_1666252153; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VSeILvE_1666252153) by smtp.aliyun-inc.com; Thu, 20 Oct 2022 15:49:14 +0800 From: Baolin Wang To: akpm@linux-foundation.org Cc: david@redhat.com, ying.huang@intel.com, ziy@nvidia.com, shy828301@gmail.com, baolin.wang@linux.alibaba.com, jingshan@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt Date: Thu, 20 Oct 2022 15:49:01 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: In-Reply-To: References: ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666252159; a=rsa-sha256; cv=none; b=DkLYUbmoSITFypWDoVb96nnNyD5mItrPrC6VCeIG1w1EoN/QhkEddxhikCHLQGdnxTjJ14 ZGphKkq7Dnb6Ald5us1xVFC2lchLftQ1UKbXR5rSsLCNizDdg9elEWMakDAkTBBSkUI9rc upcXBcmFKorFgS0BDj+1yv2UbAAOHzg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666252159; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to:in-reply-to: references:references:references; bh=ZltSmbhiU0ljivOsJWaxq+ZuWDtrgMfiiTpfEGjto18=; b=6t8BKLo1eBp2lSobZULgbeTs3B1iYFCCI2h6zkHZQvdn56ECwaL+VgQB7zEClBfEws5p41 Ajhcj5YwZJh4mULWg4UhYlCtbsExrbKzjvd7LOXOVcFYUcgSTEqS9ar4qGrOKgzI/kR28O sJNFvVfbl7oxcW4/ebuAtWZr89wrKto= Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Stat-Signature: 1i6htpgo5diddnaotbarsw7it8dawuey X-Rspamd-Queue-Id: 942FE40023 X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1666252158-921212 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When creating a virtual machine, we will use memfd_create() to get a file descriptor which can be used to create share memory mappings using the mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to allocate physical pages for the virtual machine. When allocating physical pages for the guest, the host can fallback to allocate some CMA pages for the guest when over half of the zone's free memory is in the CMA area. In guest os, when the application wants to do some data transaction with DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and create IOMMU mappings for the DMA pages. However, when calling VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be failed to longterm-pin sometimes. After some invetigation, we found the pages used to do DMA mapping can contain some CMA pages, and these CMA pages will cause a possible failure of the longterm-pin, due to failed to migrate the CMA pages. The reason of migration failure may be temporary reference count or memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error, which makes the application failed to start. I observed one migration failure case (which is not easy to reproduce) is that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' count is also 1. That means when migrating a THP which is in CMA area, but can not allocate a new THP due to memory fragmentation, so it will split the THP. However THP split is also failed, probably the reason is temporary reference count of this THP. And the temporary reference count can be caused by dropping page caches (I observed the drop caches operation in the system), but we can not drop the shmem page caches due to they are already dirty at that time. Especially for THP split failure, which is caused by temporary reference count, we can try again to mitigate the failure of migration in this case according to previous discussion [1]. [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/ Signed-off-by: Baolin Wang --- mm/huge_memory.c | 4 ++-- mm/migrate.c | 18 +++++++++++++++--- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ad17c8d..a79f03b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * split PMDs */ if (!can_split_folio(folio, &extra_pins)) { - ret = -EBUSY; + ret = -EAGAIN; goto out_unlock; } @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) xas_unlock(&xas); local_irq_enable(); remap_page(folio, folio_nr_pages(folio)); - ret = -EBUSY; + ret = -EAGAIN; } out_unlock: diff --git a/mm/migrate.c b/mm/migrate.c index 8e5eb6e..55c7855 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, if (is_thp) { nr_thp_failed++; /* THP NUMA faulting doesn't split THP to retry. */ - if (!nosplit && !try_split_thp(page, &thp_split_pages)) { - nr_thp_split++; - break; + if (!nosplit) { + rc = try_split_thp(page, &thp_split_pages); + if (!rc) { + nr_thp_split++; + break; + } else if (reason == MR_LONGTERM_PIN && + rc == -EAGAIN) { + /* + * Try again to split THP to mitigate + * the failure of longterm pinning. + */ + thp_retry++; + nr_retry_pages += nr_subpages; + break; + } } } else if (!no_subpage_counting) { nr_failed++;