From patchwork Tue Sep 10 23:43:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ackerley Tng X-Patchwork-Id: 13799498 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16E8DEE01F1 for ; Tue, 10 Sep 2024 23:45:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88EB58D00E4; Tue, 10 Sep 2024 19:45:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 814BF8D00E2; Tue, 10 Sep 2024 19:45:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 641128D00E4; Tue, 10 Sep 2024 19:45:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 320AC8D00E2 for ; Tue, 10 Sep 2024 19:45:20 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id EC03E1A0C1F for ; Tue, 10 Sep 2024 23:45:19 +0000 (UTC) X-FDA: 82550462358.21.17671C1 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf05.hostedemail.com (Postfix) with ESMTP id 25C51100005 for ; Tue, 10 Sep 2024 23:45:17 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eG0mz4VC; spf=pass (imf05.hostedemail.com: domain of 3DNrgZgsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3DNrgZgsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726011890; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dBcAwVvZIYzGXNkByOPNSzS/2ZaPoaCsQUMEIRBg/WU=; b=L3A3F6P7Ad766XeUGX9q/oU809Ml8NitUZhy5eoiw16iESM3xEp7iGVNiaeelJ1bS6SfT/ T6926zLZ/KRGX0hR8dM/vvoM1uOxkJuadvvalmtMjT/zpNbvk3czKj9qxBVeW8RXx9+s3l d1O/0orl9uyQzpuDzVQ6i4dAguvKoWw= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eG0mz4VC; spf=pass (imf05.hostedemail.com: domain of 3DNrgZgsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3DNrgZgsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726011890; a=rsa-sha256; cv=none; b=Ug/XSzOJpbOcmlGRgnptbHT8uHV9+2lV3siYQDCbkDBtZrpXMn++v6Virq6VPW+Pbke5nS hmiO0pIm426tp6TJjdSBQ/HFc4vankXMZZAVhqY4iveuhCPdg862N8jhCS5Ado/dlPZwHv N1eowTllNZ5XjfEH96YRz0v/JF9DHUQ= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-20556f1cfebso88215085ad.1 for ; Tue, 10 Sep 2024 16:45:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1726011917; x=1726616717; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=dBcAwVvZIYzGXNkByOPNSzS/2ZaPoaCsQUMEIRBg/WU=; b=eG0mz4VC85YeZSkNXh9WbbMDDrqg4wUbEVmh9oFv1yk4H+Ght4dj1p0sfC3OGqzFV+ x9EvAFPIcmba89fAoEWE3RUSCbpD4ccoeOizu02Yd15IRIYVEBVApmjE06AGIXcJSz5Y 39G+wbI8RixRCbDA6/GKzrdH9GQuPju9vfj/ATEWWwxmsOw0WobnCO1rQfdKtvM01YHs cJvD1JXfHT33ZcMLNdWPPT4ze8kaNmHzvWf3e86XagbbWHpf/7GRdVzVtVxySYmRdet8 Z0aWxsem5RKRwckokCQvuKu2P56ImZz6Z7hzRR3RgM3C0yQvEldPvQuZ7CNMPKiPLNRy tiCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726011917; x=1726616717; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dBcAwVvZIYzGXNkByOPNSzS/2ZaPoaCsQUMEIRBg/WU=; b=ny6SxiytjTgRdJ+vw/JBgzE6/39HdnvqsoXzk5jRSQmBF5Z6UFhX0P3XJHXcwTc9ED m9DAj6BAffwfECk+277RQEUnRopngtWY038JkjCKDuc9nKnBeLGrUEyw7Ng/cnQMMeyr wOkh4YulY0sZ30rrIv4S7nHEAUibLT7FRQo2JYjFndzBCmU5D+dF1PzN+jxWw0KuKJjg 4QXV7AGrpVLLazQsKZSUsqHhYgwn/qxHUHaL5g35o773WTTew/+a5/ZETaTCyIXEQjUg a5VJhXcae7W2jo5c33xgsiDqfUNLdBnq9fbzx0h+Iy8L2pvD+Li34VdWSeBpt7XcyPVr nW5Q== X-Forwarded-Encrypted: i=1; AJvYcCWKmbPjxf2ft929eH+qRDZ7xIFTswiG5R4SYhAaW0DpG0NrdyPHxoAoeR5S9hGFnFQQ0Z20IJnM8w==@kvack.org X-Gm-Message-State: AOJu0YyEvPw+IuMBkmtLmfFTlNLw4PjNgeXWDqZGxPcEL5nOmJVctutN qUnGRi/gxvKeccvebj2+zf7f2mC8evoZQUZ7vT2a4fetVAubR2CXJsHFTrTgdZ0pRc5B5zn+Ruh qrJugTj/SC0Y5t1FTc9TP1g== X-Google-Smtp-Source: AGHT+IHPvqQuW1nzaRxrJX4GIRZi5NgMOC89C+YN+0vpncrauk1fada8rcGXAL5RiJtXK/Fd4loftkEuOb/GRSTx+A== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a17:902:e5c6:b0:205:799f:124f with SMTP id d9443c01a7336-2074c5f3b46mr1590105ad.5.1726011916743; Tue, 10 Sep 2024 16:45:16 -0700 (PDT) Date: Tue, 10 Sep 2024 23:43:56 +0000 In-Reply-To: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.46.0.598.g6f2099f65c-goog Message-ID: Subject: [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use From: Ackerley Tng To: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com Cc: erdemaktas@google.com, vannapurve@google.com, ackerleytng@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org X-Rspam-User: X-Stat-Signature: spkpbwhwz7t7xonp365ucraa5qktqmog X-Rspamd-Queue-Id: 25C51100005 X-Rspamd-Server: rspam11 X-HE-Tag: 1726011917-399442 X-HE-Meta: U2FsdGVkX18EGBI6wHwHwEsdV1eRFOnvM0osyDz9y3UTEQKKx2aJEHSvkxC8m81TXqgdBiK3SyMYKHEY20JFxk2AZvKogWpATyt5SEEAY+L+9GyQAhkGoCVPYp/lARnEIb3lNdmEu9xZelLKe7klIJPTXnQYI7Q2m2QL1BVOPNTaUXu1AplhkNWuClxyVWcPz7P4DVxgUQ1qNcKTX8NFXN9Xq2AIFAa53p+XF9vPnbVbNcVRoNRNEudVy4MYAMVvG98niofKxMuAvvIqUvVxv9m+j9qfBaD34QN0dzdvHpbECbvk1hNMc8ohG/ar9zzqOa2kczoJHPIgWCCpyqIPObXI97sB4FFO2jiwmSWewXfnpL0yVeRrtLMHlMPPi75lhUXxIB2/y0tQYFKUQjmTkBJiSZf5x817qSRrj9dMbG9gsMN9zu53FCjgtJ5qmefAExUuq9B/RPb9nAmwq891H0SX1ER3dlzb49SB5Nfwfr5rF8DPnA3AH+XyyWWMv4EapcAHahM4Mtj9dMuviLB9Q7XcYCIRBQbafNpkrzvnD5aWC53EIoCoeRkyfCp4kXNxJBJoqwPgIH97pbtUfrxJaFPGOjeIS56tdgJ5m4brgltgUVPP+uR0FcVLFCpLjo85m7QXJnG9eW2uGcE15izZIhHQWFLtE/kmS2ImDid52oeE1HxC3fovMvK24L+H+9LkFtjk7z+MSDpbH7FTz1K5pU1zZaRN8gmPA9wX0hE+hWKteogVA2BwBTny3+EL59qoy5kFbCBJgWY1kWVlqHSHjd507OCfDqyjTBSa/HxiZPMOJFxIzkRzrvPIciZElur0f1PO1o73C3weFJyqjvk6FluGMX+IrjnE2y3+lZ9v/gLE9v5HGbNRjmWHRTGuX38VGm5axXMEWQvChNnLuaQ8k1QoqHCK1UF4bv1+xT4Mw4Ig6dsYBY2uPpNtUy9Po2F6teSIggUMTkArF1hj43Q MoMrpSNl Qjoy46sOIQSbt+oa0ZUkqacX9h2b/b8VGAuGs2Mrcre8Mgcr9Pg5W/Vpq7WnU8OwNawaPHEAbEdYBYETDxNN01MtgmqPsPdznaBrtZVjyGmROF/WBc9/sMWYyBLCMCJS3bQQB8q9GIutIT1pCwyzYBW9d3VnTTHubIvEeSU6BCEM87SjqoM0pOpcRnnYj+lx369ZouIVCx9NwY29vhaAs0t4GFVV6YVAHJ7XCx+dDDKX3WXzOfET9AnHvrtfdDCXbbsySDV9EsHjNKhEYGYALwoHq3fVU8fhCgh9j+kENwv5wKLdUDL75sXXtM3VWvdH2jmbw9cEaW1nAdguan94mJq+qlrOWrWRDHrKB1n0PXZZo7mzUMpeMI85Kz+ndwDgJ9HPgrPLQUIO/lq2/WB37nHlcdJ+qpm6n3OoDcqu0riIU2mSewsoHeRdjo73te/VZK+tAZ1zxucBbIgUcUfhQAjqod0p4UEH2X5UwKmLpeUigMJ9g8F3JqQDY+MM4zTOimPxfgiWSDNGUG70dJjXouegVffU3G9uCBIx/HBlvpM3w6J0vzX5lVqFGNPBkwSNDalocH6W2N2L2OLjBIzOqzC7QDnEdFUBpCubGAmlMnuZfkfZlgFQhY4UUc6NIAD/eG/XS4MHBYfA/bMWBq2sc3Ud0VR8Pikt7YBb8tMVFaYbhFas+NrUDpahEld45VxTQWe32bkTTAGs/3OU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Vishal Annapurve In this patch, newly allocated HugeTLB pages are split to 4K regular pages before providing them to the requester (fallocate() or KVM). The pages are then reconstructed/merged to HugeTLB pages before the HugeTLB pages are returned to HugeTLB. This is an intermediate step to build page splitting/merging functionality before allowing guest_memfd files to be mmap()ed. Co-developed-by: Ackerley Tng Signed-off-by: Ackerley Tng Co-developed-by: Vishal Annapurve Signed-off-by: Vishal Annapurve --- virt/kvm/guest_memfd.c | 299 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 281 insertions(+), 18 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index eacbfdb950d1..8151df2c03e5 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -229,31 +229,206 @@ static int kvm_gmem_hugetlb_filemap_add_folio(struct address_space *mapping, return 0; } +struct kvm_gmem_split_stash { + struct { + unsigned long _flags_2; + unsigned long _head_2; + + void *_hugetlb_subpool; + void *_hugetlb_cgroup; + void *_hugetlb_cgroup_rsvd; + void *_hugetlb_hwpoison; + }; + void *hugetlb_private; +}; + +static int kvm_gmem_hugetlb_stash_metadata(struct folio *folio) +{ + struct kvm_gmem_split_stash *stash; + + stash = kmalloc(sizeof(*stash), GFP_KERNEL); + if (!stash) + return -ENOMEM; + + stash->_flags_2 = folio->_flags_2; + stash->_head_2 = folio->_head_2; + stash->_hugetlb_subpool = folio->_hugetlb_subpool; + stash->_hugetlb_cgroup = folio->_hugetlb_cgroup; + stash->_hugetlb_cgroup_rsvd = folio->_hugetlb_cgroup_rsvd; + stash->_hugetlb_hwpoison = folio->_hugetlb_hwpoison; + stash->hugetlb_private = folio_get_private(folio); + + folio_change_private(folio, (void *)stash); + + return 0; +} + +static int kvm_gmem_hugetlb_unstash_metadata(struct folio *folio) +{ + struct kvm_gmem_split_stash *stash; + + stash = folio_get_private(folio); + + if (!stash) + return -EINVAL; + + folio->_flags_2 = stash->_flags_2; + folio->_head_2 = stash->_head_2; + folio->_hugetlb_subpool = stash->_hugetlb_subpool; + folio->_hugetlb_cgroup = stash->_hugetlb_cgroup; + folio->_hugetlb_cgroup_rsvd = stash->_hugetlb_cgroup_rsvd; + folio->_hugetlb_hwpoison = stash->_hugetlb_hwpoison; + folio_change_private(folio, stash->hugetlb_private); + + kfree(stash); + + return 0; +} + +/** + * Reconstruct a HugeTLB folio from a contiguous block of folios where the first + * of the contiguous folios is @folio. + * + * The size of the contiguous block is of huge_page_size(@h). All the folios in + * the block are checked to have a refcount of 1 before reconstruction. After + * reconstruction, the reconstructed folio has a refcount of 1. + * + * Return 0 on success and negative error otherwise. + */ +static int kvm_gmem_hugetlb_reconstruct_folio(struct hstate *h, struct folio *folio) +{ + int ret; + + WARN_ON((folio->index & (huge_page_order(h) - 1)) != 0); + + ret = kvm_gmem_hugetlb_unstash_metadata(folio); + if (ret) + return ret; + + if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) { + kvm_gmem_hugetlb_stash_metadata(folio); + return -ENOMEM; + } + + __folio_set_hugetlb(folio); + + folio_set_count(folio, 1); + + hugetlb_vmemmap_optimize_folio(h, folio); + + return 0; +} + +/* Basically folio_set_order(folio, 1) without the checks. */ +static inline void kvm_gmem_folio_set_order(struct folio *folio, unsigned int order) +{ + folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; +#ifdef CONFIG_64BIT + folio->_folio_nr_pages = 1U << order; +#endif +} + +/** + * Split a HugeTLB @folio of size huge_page_size(@h). + * + * After splitting, each split folio has a refcount of 1. There are no checks on + * refcounts before splitting. + * + * Return 0 on success and negative error otherwise. + */ +static int kvm_gmem_hugetlb_split_folio(struct hstate *h, struct folio *folio) +{ + int ret; + + ret = hugetlb_vmemmap_restore_folio(h, folio); + if (ret) + return ret; + + ret = kvm_gmem_hugetlb_stash_metadata(folio); + if (ret) { + hugetlb_vmemmap_optimize_folio(h, folio); + return ret; + } + + kvm_gmem_folio_set_order(folio, 0); + + destroy_compound_gigantic_folio(folio, huge_page_order(h)); + __folio_clear_hugetlb(folio); + + /* + * Remove the first folio from h->hugepage_activelist since it is no + * longer a HugeTLB page. The other split pages should not be on any + * lists. + */ + hugetlb_folio_list_del(folio); + + return 0; +} + static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode, pgoff_t index) { + struct folio *allocated_hugetlb_folio; + pgoff_t hugetlb_first_subpage_index; + struct page *hugetlb_first_subpage; struct kvm_gmem_hugetlb *hgmem; - struct folio *folio; + struct page *requested_page; int ret; + int i; hgmem = kvm_gmem_hgmem(inode); - folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool); - if (IS_ERR(folio)) - return folio; + allocated_hugetlb_folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool); + if (IS_ERR(allocated_hugetlb_folio)) + return allocated_hugetlb_folio; + + requested_page = folio_file_page(allocated_hugetlb_folio, index); + hugetlb_first_subpage = folio_file_page(allocated_hugetlb_folio, 0); + hugetlb_first_subpage_index = index & (huge_page_mask(hgmem->h) >> PAGE_SHIFT); - /* TODO: Fix index here to be aligned to huge page size. */ - ret = kvm_gmem_hugetlb_filemap_add_folio( - inode->i_mapping, folio, index, htlb_alloc_mask(hgmem->h)); + ret = kvm_gmem_hugetlb_split_folio(hgmem->h, allocated_hugetlb_folio); if (ret) { - folio_put(folio); + folio_put(allocated_hugetlb_folio); return ERR_PTR(ret); } + for (i = 0; i < pages_per_huge_page(hgmem->h); ++i) { + struct folio *folio = page_folio(nth_page(hugetlb_first_subpage, i)); + + ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, + folio, + hugetlb_first_subpage_index + i, + htlb_alloc_mask(hgmem->h)); + if (ret) { + /* TODO: handle cleanup properly. */ + pr_err("Handle cleanup properly index=%lx, ret=%d\n", + hugetlb_first_subpage_index + i, ret); + dump_page(nth_page(hugetlb_first_subpage, i), "check"); + return ERR_PTR(ret); + } + + /* + * Skip unlocking for the requested index since + * kvm_gmem_get_folio() returns a locked folio. + * + * Do folio_put() to drop the refcount that came with the folio, + * from splitting the folio. Splitting the folio has a refcount + * to be in line with hugetlb_alloc_folio(), which returns a + * folio with refcount 1. + * + * Skip folio_put() for requested index since + * kvm_gmem_get_folio() returns a folio with refcount 1. + */ + if (hugetlb_first_subpage_index + i != index) { + folio_unlock(folio); + folio_put(folio); + } + } + spin_lock(&inode->i_lock); inode->i_blocks += blocks_per_huge_page(hgmem->h); spin_unlock(&inode->i_lock); - return folio; + return page_folio(requested_page); } static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode, @@ -365,7 +540,9 @@ static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *folio) /** * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapping), - * returning the number of pages freed. + * returning the number of HugeTLB pages freed. + * + * @lend - @lstart must be a multiple of the HugeTLB page size. */ static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping, struct hstate *h, @@ -373,37 +550,69 @@ static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping, { const pgoff_t end = lend >> PAGE_SHIFT; pgoff_t next = lstart >> PAGE_SHIFT; + LIST_HEAD(folios_to_reconstruct); struct folio_batch fbatch; + struct folio *folio, *tmp; int num_freed = 0; + int i; + /* + * TODO: Iterate over huge_page_size(h) blocks to avoid taking and + * releasing hugetlb_fault_mutex_table[hash] lock so often. When + * truncating, lstart and lend should be clipped to the size of this + * guest_memfd file, otherwise there would be too many iterations. + */ folio_batch_init(&fbatch); while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { - int i; for (i = 0; i < folio_batch_count(&fbatch); ++i) { struct folio *folio; pgoff_t hindex; u32 hash; folio = fbatch.folios[i]; + hindex = folio->index >> huge_page_order(h); hash = hugetlb_fault_mutex_hash(mapping, hindex); - mutex_lock(&hugetlb_fault_mutex_table[hash]); + + /* + * Collect first pages of HugeTLB folios for + * reconstruction later. + */ + if ((folio->index & ~(huge_page_mask(h) >> PAGE_SHIFT)) == 0) + list_add(&folio->lru, &folios_to_reconstruct); + + /* + * Before removing from filemap, take a reference so + * sub-folios don't get freed. Don't free the sub-folios + * until after reconstruction. + */ + folio_get(folio); + kvm_gmem_hugetlb_filemap_remove_folio(folio); - mutex_unlock(&hugetlb_fault_mutex_table[hash]); - num_freed++; + mutex_unlock(&hugetlb_fault_mutex_table[hash]); } folio_batch_release(&fbatch); cond_resched(); } + list_for_each_entry_safe(folio, tmp, &folios_to_reconstruct, lru) { + kvm_gmem_hugetlb_reconstruct_folio(h, folio); + hugetlb_folio_list_move(folio, &h->hugepage_activelist); + + folio_put(folio); + num_freed++; + } + return num_freed; } /** * Removes folios in range [@lstart, @lend) from page cache of inode, updates * inode metadata and hugetlb reservations. + * + * @lend - @lstart must be a multiple of the HugeTLB page size. */ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode, loff_t lstart, loff_t lend) @@ -427,6 +636,56 @@ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode, spin_unlock(&inode->i_lock); } +/** + * Zeroes offsets [@start, @end) in a folio from @mapping. + * + * [@start, @end) must be within the same folio. + */ +static void kvm_gmem_zero_partial_page( + struct address_space *mapping, loff_t start, loff_t end) +{ + struct folio *folio; + pgoff_t idx = start >> PAGE_SHIFT; + + folio = filemap_lock_folio(mapping, idx); + if (IS_ERR(folio)) + return; + + start = offset_in_folio(folio, start); + end = offset_in_folio(folio, end); + if (!end) + end = folio_size(folio); + + folio_zero_segment(folio, (size_t)start, (size_t)end); + folio_unlock(folio); + folio_put(folio); +} + +/** + * Zeroes all pages in range [@start, @end) in @mapping. + * + * hugetlb_zero_partial_page() would work if this had been a full page, but is + * not suitable since the pages have been split. + * + * truncate_inode_pages_range() isn't the right function because it removes + * pages from the page cache; this function only zeroes the pages. + */ +static void kvm_gmem_hugetlb_zero_split_pages(struct address_space *mapping, + loff_t start, loff_t end) +{ + loff_t aligned_start; + loff_t index; + + aligned_start = round_up(start, PAGE_SIZE); + + kvm_gmem_zero_partial_page(mapping, start, min(aligned_start, end)); + + for (index = aligned_start; index < end; index += PAGE_SIZE) { + kvm_gmem_zero_partial_page(mapping, index, + min((loff_t)(index + PAGE_SIZE), end)); + } +} + static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) { @@ -442,8 +701,8 @@ static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart, full_hpage_end = round_down(lend, hsize); if (lstart < full_hpage_start) { - hugetlb_zero_partial_page(h, inode->i_mapping, lstart, - full_hpage_start); + kvm_gmem_hugetlb_zero_split_pages(inode->i_mapping, lstart, + full_hpage_start); } if (full_hpage_end > full_hpage_start) { @@ -452,8 +711,8 @@ static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart, } if (lend > full_hpage_end) { - hugetlb_zero_partial_page(h, inode->i_mapping, full_hpage_end, - lend); + kvm_gmem_hugetlb_zero_split_pages(inode->i_mapping, full_hpage_end, + lend); } } @@ -1060,6 +1319,10 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, if (folio_test_hwpoison(folio)) { folio_unlock(folio); + /* + * TODO: this folio may be part of a HugeTLB folio. Perhaps + * reconstruct and then free page? + */ folio_put(folio); return ERR_PTR(-EHWPOISON); }