From patchwork Wed Dec 7 20:30:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13067581 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20EFAC4708D for ; Wed, 7 Dec 2022 20:30:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96D498E0008; Wed, 7 Dec 2022 15:30:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8F68D8E0005; Wed, 7 Dec 2022 15:30:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C5018E0008; Wed, 7 Dec 2022 15:30:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2E7628E0005 for ; Wed, 7 Dec 2022 15:30:48 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F3D68AB67F for ; Wed, 7 Dec 2022 20:30:47 +0000 (UTC) X-FDA: 80216653776.02.4703D46 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 8F65BC0005 for ; Wed, 7 Dec 2022 20:30:47 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YBL683Qo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670445047; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XK/5vqUd1QlikRxebc+rXesIul2oExyAtkR9smyGVX0=; b=bPPSCGqziATJZlez2YTFqZ8RO1iWkhov0t+lbu2KXBLS4NzCBNxdChw2eSNW9bP+u/CaS1 YLZUwCJEngNK+SgJSBJ2f1z0lYoArNZhbTxhKlbILURWWG5rO1GZtNDdcnRjT0yWHt1g/V LcLJkX3clj6NPeSvm6jx7TH/ob7GfTU= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YBL683Qo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670445047; a=rsa-sha256; cv=none; b=VTJD69zeJfrXqdLubwfoAn+rQ6OsAWutGn4dpRVRePmyfJ/ASp4Ufqfm2t0K3M3Yt9YVUU wduZcrxgqFfuzjXQCBlCO00eTlhM0k9z5VsdeiDXvVW2FeL/q/ERFQ+ZBH8Oxl4K4mgmw3 f9QxZOHf5N2M4VAeZJbB0h8IA2Tzay0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670445047; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XK/5vqUd1QlikRxebc+rXesIul2oExyAtkR9smyGVX0=; b=YBL683QoLr3TGrxzVQ4yz2/Iqwhw2O1mFP0n+9bmqVjfrZcrsWKxCL0OT5e4tUpSBPaTTx OHQKh+7qFAQ1yhBuAOD4f48fJiQezMoOuVZ9tgr/NFxu3Mi8k8/82g04FLhjQuR3haAZUU A2fsklmEXNoySCuOpwva2FP1IZTkxm8= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-367-xhHXBTEIMd-D8DO2Mc0D_A-1; Wed, 07 Dec 2022 15:30:45 -0500 X-MC-Unique: xhHXBTEIMd-D8DO2Mc0D_A-1 Received: by mail-qv1-f71.google.com with SMTP id on28-20020a056214449c00b004bbf12d7976so37739937qvb.18 for ; Wed, 07 Dec 2022 12:30:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XK/5vqUd1QlikRxebc+rXesIul2oExyAtkR9smyGVX0=; b=EfzfqwFrBpghvJGXeV6CiSjpfD19A8DTUWBTdBSex5r4OH+MAUqLuLqJYEOUnhI2Wf SfByu9B8P0LPN+GiJGKQuwLMLLZgYpqES5IXKifWaJPiWuLw8i2nUtWIZbtAK2DoX5hO My0qS0s6cOVAtsomqKHc+wgQYHeHa4acKWEwdxeKPBTRVNIxXQK/D26CXmhC1c6PuemN Tkf6ADABb59tkBqYiAY0vMzm/bRr1VZ3NYasoQiz3E91uzov5O7efUj6l2Wo38nrJ5Ed 5qM6Ur8Pey6kFBZqvi8qdTQPFR2NcZhdLJk4h91cIphsoAv6IR5sCY67xsfhOpw9slYi YI1A== X-Gm-Message-State: ANoB5pmRe3jEXtD7ogITTCt2EPnJU21y1ajrt2WMTrxdeQUuh2XRIFru IWmSZzWumAaZhGzYHlC/Uis4ELMPkEDWeyDPT+N4cK+w8XvwjWcPua6e/zDKoFf9iDDdmiVW0Em WFyXPXbxWpiNKfQNQNcbittIFYDn3vidT0gNI4ezly09QhPX8QrbSUeOijvfF X-Received: by 2002:ac8:57cd:0:b0:3a6:7b53:9b20 with SMTP id w13-20020ac857cd000000b003a67b539b20mr2152374qta.12.1670445044556; Wed, 07 Dec 2022 12:30:44 -0800 (PST) X-Google-Smtp-Source: AA0mqf5LbXfVhI/+vq/w+LG4yNKtnSGIgip174q5+ZLxE8DPe02Msdaz4r+WtR6QivWpqpG5VnnIiA== X-Received: by 2002:ac8:57cd:0:b0:3a6:7b53:9b20 with SMTP id w13-20020ac857cd000000b003a67b539b20mr2152353qta.12.1670445044228; Wed, 07 Dec 2022 12:30:44 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id dc53-20020a05620a523500b006fefa5f7fcesm855594qkb.10.2022.12.07.12.30.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 12:30:43 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Muchun Song , John Hubbard , Andrea Arcangeli , James Houghton , Jann Horn , Rik van Riel , Miaohe Lin , Andrew Morton , Mike Kravetz , peterx@redhat.com, David Hildenbrand , Nadav Amit Subject: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted Date: Wed, 7 Dec 2022 15:30:28 -0500 Message-Id: <20221207203034.650899-5-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221207203034.650899-1-peterx@redhat.com> References: <20221207203034.650899-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-type: text/plain X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 8F65BC0005 X-Stat-Signature: h13ryikgsfwtddm1xfr4bxbjg3i1hxxy X-Spamd-Result: default: False [2.10 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.219.71:received]; R_MISSING_CHARSET(2.50)[]; SUSPICIOUS_RECIPS(1.50)[]; MID_CONTAINS_FROM(1.00)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAD_REP_POLICIES(0.10)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[redhat.com:s=mimecast20190719]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_TWELVE(0.00)[14]; FROM_EQ_ENVFROM(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; TAGGED_RCPT(0.00)[]; DMARC_POLICY_ALLOW(0.00)[redhat.com,none]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; DKIM_TRACE(0.00)[redhat.com:+]; R_SPF_ALLOW(0.00)[+ip4:170.10.129.0/24]; RCVD_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-HE-Tag: 1670445047-684177 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Reviewed-by: Mike Kravetz Signed-off-by: Peter Xu Reviewed-by: John Hubbard --- include/linux/swapops.h | 6 ++++-- mm/hugetlb.c | 36 +++++++++++++++--------------------- mm/migrate.c | 25 +++++++++++++++++++++---- 3 files changed, 40 insertions(+), 27 deletions(-) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index a70b5c3a68d7..b134c5eb75cb 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -337,7 +337,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); #ifdef CONFIG_HUGETLB_PAGE -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl); +extern void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl); extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte); #endif /* CONFIG_HUGETLB_PAGE */ #else /* CONFIG_MIGRATION */ @@ -366,7 +367,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } #ifdef CONFIG_HUGETLB_PAGE -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { } +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { } #endif /* CONFIG_HUGETLB_PAGE */ static inline int is_writable_migration_entry(swp_entry_t entry) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c8a6673fe5b4..49f73677a418 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int need_wait_lock = 0; unsigned long haddr = address & huge_page_mask(h); - ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); - if (ptep) { - /* - * Since we hold no locks, ptep could be stale. That is - * OK as we are only making decisions based on content and - * not actually modifying content here. - */ - entry = huge_ptep_get(ptep); - if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait_huge(vma, ptep); - return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) - return VM_FAULT_HWPOISON_LARGE | - VM_FAULT_SET_HINDEX(hstate_index(h)); - } - /* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate @@ -5854,10 +5838,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * Acquire vma lock before calling huge_pte_alloc and hold * until finished with ptep. This prevents huge_pmd_unshare from * being called elsewhere and making the ptep no longer valid. - * - * ptep could have already be assigned via huge_pte_offset. That - * is OK, as huge_pte_alloc will return the same value unless - * something has changed. */ hugetlb_vma_lock_read(vma); ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h)); @@ -5886,8 +5866,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will * properly handle it. */ - if (!pte_present(entry)) + if (!pte_present(entry)) { + if (unlikely(is_hugetlb_entry_migration(entry))) { + /* + * Release fault lock first because the vma lock is + * needed to guard the huge_pte_lockptr() later in + * migration_entry_wait_huge(). The vma lock will + * be released there. + */ + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + migration_entry_wait_huge(vma, ptep); + return 0; + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) + ret = VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; + } /* * If we are going to COW/unshare the mapping later, we examine the diff --git a/mm/migrate.c b/mm/migrate.c index 48584b032ea9..d14f1f3ab073 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -333,24 +333,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, } #ifdef CONFIG_HUGETLB_PAGE -void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) +void __migration_entry_wait_huge(struct vm_area_struct *vma, + pte_t *ptep, spinlock_t *ptl) { pte_t pte; + /* + * The vma read lock must be taken, which will be released before + * the function returns. It makes sure the pgtable page (along + * with its spin lock) not be freed in parallel. + */ + hugetlb_vma_assert_locked(vma); + spin_lock(ptl); pte = huge_ptep_get(ptep); - if (unlikely(!is_hugetlb_entry_migration(pte))) + if (unlikely(!is_hugetlb_entry_migration(pte))) { spin_unlock(ptl); - else + hugetlb_vma_unlock_read(vma); + } else { + /* + * If migration entry existed, safe to release vma lock + * here because the pgtable page won't be freed without the + * pgtable lock released. See comment right above pgtable + * lock release in migration_entry_wait_on_locked(). + */ + hugetlb_vma_unlock_read(vma); migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl); + } } void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte); - __migration_entry_wait_huge(pte, ptl); + __migration_entry_wait_huge(vma, pte, ptl); } #endif