From patchwork Wed Jan 29 11:54:02 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13953677 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C729C02193 for ; Wed, 29 Jan 2025 11:54:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA46D280013; Wed, 29 Jan 2025 06:54:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 99104280011; Wed, 29 Jan 2025 06:54:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80AE4280013; Wed, 29 Jan 2025 06:54:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 456FC280011 for ; Wed, 29 Jan 2025 06:54:43 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E6A5BB0D45 for ; Wed, 29 Jan 2025 11:54:33 +0000 (UTC) X-FDA: 83060332068.17.BE98CDC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id AC685140010 for ; Wed, 29 Jan 2025 11:54:31 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Lza5kFiS; spf=pass (imf26.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738151671; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ttFA6KXnCKLshNnRCUvMdxdw/LdWUHfs73bnazzKqOM=; b=6+DxoxZMNWcPaxccQvFykn6hrFuxrgiKls5xyYeITiiEFgY1MPwZ7TmokSFISh796YRgUD gUDkWuoywdYnC/gI9N35RWP9BiDgWr2fbnQpBFB68IY7OgRlDMS0VyIdLPCkziM9j/WNyJ ymcqecH1UxsFCzpQ7yFJEuVSGxzt+3I= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Lza5kFiS; spf=pass (imf26.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738151671; a=rsa-sha256; cv=none; b=nRzE8pkBa8q3VGzi5bvhlDv86qbGNWPxWBTH0E5bIXmrGQP15h/bAAgWNSAZMPC2lB4CBS dcuK1VthQzchnqZujZjgbHMeHDnAKCtEW+LyRNMnfN9ZEaPB4c3hHOjOPta7kIgUbsw74r HtBPSvvoodHHz3oImpUUgB6IPdE0Ykg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738151671; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ttFA6KXnCKLshNnRCUvMdxdw/LdWUHfs73bnazzKqOM=; b=Lza5kFiSqD7O0mgFvtQ743XqjYkLhOfSKrEbDIvge/0WrugxO7GSvGplkR5tHe8lXfsIHW KizPrAAjbpQyGs5IGx7ZVuXjo6FoXYxAsAvmBJztodJbyMwHvXiAt2lQMIh9pBgRAOmnDR bEqlICEY8IEy3atUy4M3RShXysgYAxA= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-513-BXH1-cXIMiOcJjdNQy79sw-1; Wed, 29 Jan 2025 06:54:27 -0500 X-MC-Unique: BXH1-cXIMiOcJjdNQy79sw-1 X-Mimecast-MFC-AGG-ID: BXH1-cXIMiOcJjdNQy79sw Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-385e00ebb16so2352545f8f.3 for ; Wed, 29 Jan 2025 03:54:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738151666; x=1738756466; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ttFA6KXnCKLshNnRCUvMdxdw/LdWUHfs73bnazzKqOM=; b=XPuuSsW1sXfnZKMLMLtMrTq+kQEkpd4U+jct0MVMIeCNHX3GzOQh+DPMFoTYqw6nkk izHZXcp7759sm1rLobiPuR53OA8XTIPZ8nnjRprNTTg/LX5utrpsvuHIX9Of1pqiuOiQ 1FYwtlt8e7r4r4TrVTnybFOSbNjJhZGx+fl0CULpzLqz45Tz85aSRWXZNV7Zt11aIP6k ojZ39ClJ43e3YtQ4+692HssprjivWfgbJyuNfL3WI+kX31xFS1pDlA5PDlTLd/tsls1Q eo3Uo11v71YRbFhdJX/uWLXFWRKo4GlV9lDn9HjTaBJB4HJEVdx3wTO+ZSusPkeESU83 60Ag== X-Forwarded-Encrypted: i=1; AJvYcCVeqGWK+/ft15l46qt5FkRML7tH2lxh3Y9gTdQtkBlYk+3EPpT+SUyU8jdT6pryt2/oJztT6XKZpQ==@kvack.org X-Gm-Message-State: AOJu0YxfANRwfaQUPo2EDmFoFa+kil32QIZo8pu8Jr2Iu5rRkhAUtdZI aKYHiFKjffgT3Vsm/QwkWZpswSSGnYS8L5D3Q2JOfgjQig+mL906q2sJHoek90jdxIEoOvYyUix XO7RUCih20OP+i5pMd5AqcXa8Q44zQ3dAR5B3LaRtbka9AaI7 X-Gm-Gg: ASbGncsqKDjmMWdyU1MoWCISS3BYZvriV+StvOoKIwSeImLKs/FKze+r9F+Bzaf/78N peZ1OVkg5RMDaazQKqLAy4UbYuKORNyByEG21Vmxn9JJT97NSY+1K5+Oe4SHV74yR1z2inHToIO MGwmpbeY5v41blEqiVOHwu5em6hPMZPtFo8k4WrMJuBIJjn6/q+WFwetJ9rAYtqa66BsZ0sQgKJ OOpNj3SgYI6CYucY/xY11Ug+mmpIKdG5Ay7OPzMgWbRots989PJWPEwY78UROLKiGnXpObpORh+ SgANo/S4wx1Vqgt7isK7rYfaK/qDDGitvx6ECzZskYhcT3rJKId1/deK8CxHiUh+Pw== X-Received: by 2002:a05:6000:11c9:b0:38c:1270:f961 with SMTP id ffacd0b85a97d-38c520b7c7fmr2019778f8f.46.1738151666562; Wed, 29 Jan 2025 03:54:26 -0800 (PST) X-Google-Smtp-Source: AGHT+IExNZpiTenD6CJ57erMOTnO5s4zsyHNUaMa/bCtN9QyTRZVkdeAROPCHCw+IAs2gC6Ff7tvTA== X-Received: by 2002:a05:6000:11c9:b0:38c:1270:f961 with SMTP id ffacd0b85a97d-38c520b7c7fmr2019743f8f.46.1738151666064; Wed, 29 Jan 2025 03:54:26 -0800 (PST) Received: from localhost (p200300cbc7053b0064b867195794bf13.dip0.t-ipconnect.de. [2003:cb:c705:3b00:64b8:6719:5794:bf13]) by smtp.gmail.com with UTF8SMTPSA id ffacd0b85a97d-38c2a1bb0besm17079668f8f.79.2025.01.29.03.54.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 29 Jan 2025 03:54:25 -0800 (PST) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-doc@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, nouveau@lists.freedesktop.org, David Hildenbrand , Andrew Morton , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= , Jonathan Corbet , Alex Shi , Yanteng Si , Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn , Pasha Tatashin , Peter Xu , Alistair Popple , Jason Gunthorpe Subject: [PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk Date: Wed, 29 Jan 2025 12:54:02 +0100 Message-ID: <20250129115411.2077152-5-david@redhat.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250129115411.2077152-1-david@redhat.com> References: <20250129115411.2077152-1-david@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: vFigDQ4gqTJwy_IJV9xerbMguAR_TgiqkbA_K4gzNTg_1738151667 X-Mimecast-Originator: redhat.com content-type: text/plain; charset="US-ASCII"; x-default=true X-Rspamd-Queue-Id: AC685140010 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: octqkiafd5hjr9ebyw35tns7o1f9ru5u X-HE-Tag: 1738151671-774246 X-HE-Meta: U2FsdGVkX1/Tyl9a/hbkqDdw/PRbIY7BLfH1ctR1PAAcBWqlL0T5S5TEB656KiLCGEfbVsa2DcVbWk1X/0zLQyYHHTUdWSZyNnkJuaG7sgaZm1ybg2G78WEa+ZdRxbCybNtG5ZahkzLBQ21CXdmMUCszhF2ro8HVgWZtg/GnILLSPu1I7bSIvdOnSCMQjL+6IaUiKfjYq4t4oQnb42x+bJ3M+WgbMvqzdeznErO5wXmebJhRtoPP8+KclVRMnPxU6FseklDa8DF9KWArE2jlxVye58GaVQ26NslzWilP+dPuIQyFbhe5IInSMGIQ53V4c2vibTvai3XmrRLB1dvI+JsW7yW9Xz0ZSWIp5HD/tbTp834uyRHb16vG21uWbW9+4LB7Edr0JId9dtn7Y+613N8X+lxDXouhXJZpC05neGU82tGNXg6HE9hNuOKpoifzpnAzlK+HtP3Wx812Mp/c6lHiYkqp+dtBcE26l7cT1kaq8zDy/T10GFm+IN/miy5qRDjUeAdHXdDTX2MO4sbSRMFV1LxKpefhkCEZhBrP5SCepUES2sXs5rKYeiyBn/KjJZ8cCpMp4rhx1Nd0AMIdTxqmqErXcp+jJMn0LK5mX0e4Wf8uiJpyxNWKQXoPxZsJpk3ZOAO37bzWQwo4aiIWYEExiD0TdSC0NmiLu3Qr+flUjimubbB6avAXehpUMnJ+sn5grldN3vnj3ryJsoNjIxeIeweKXLhciQgXy8Zc+PD/qggTZcE7s30gS2W0N4C249+02BiTaJ3oFRr1zZCBtU3ljxMp9yYyWUVwA3Q3Syp+9lDgOIJrC2J2SxSF3ULTYOcazPiJOhj7Mb8r5mAxiyv6qJVT5+Nj0v0VHMG5uTLbHWCq4fBUEVkYS/EGY5RUVgNPZIGduNMh6hSH17n8LOjkmbrrqiz3WsD7t7cIa52Ta1uhSWgVPUoChKdISUgLCystEfgVYeLLv+R5WpI f6yNLp+u NQwh5BV4kBdMhVjB9V8mFDTtNgl4lWZOGOPDqcAKu5Z7jll2lhgnyUMMRHRazf13gbkKg7vAw0RoDSb0bdVb0yVQRLNqhn3NsCRu8CJ7Jvbg9EmusX5XqbyWcwND5qImzwvhHRjwmFccqtHJKFN+H0qJqSTUoao8iLGbstTYI4Ud5DqLYweEPPdZBYnZCCunXydjuPs0l1g7fnoMcwRgc2WFPGNahs2baPj7nCkS3jz37ArEJ9nBigC8XX0zu8iX4Qc+n4LUp/N1yHsNxftO0lTPOJhKIE9u2YLHzM3w3aHGWy2Bc6/gyzelhWooLvQHAr3+cXEMEr41s/TSi9lQPsdln69cx97Bdr8FIwffB7qjWGDRvmcUH7KQAvaY9QtCjTqJbX0a8xEamWv4RHTX6g9GlfH3x7FVCGqkyyo4nFT497ky5gbiqlyn8IFKGKREtVo1jjPJMRUdvnad+7ZVtmg8AoXwP3zAiG4l1pEnhkPIiwajzHWwiCzrI4RLo5UzA3RTJeCCJ00XM5wpTlaf4UCEeSg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: We require a writable PTE and only support anonymous folio: we can only have exactly one PTE pointing at that page, which we can just lookup using a folio walk, avoiding the rmap walk and the anon VMA lock. So let's stop doing an rmap walk and perform a folio walk instead, so we can easily just modify a single PTE and avoid relying on rmap/mapcounts. We now effectively work on a single PTE instead of multiple PTEs of a large folio, allowing for conversion of individual PTEs from non-exclusive to device-exclusive -- note that the other way always worked on single PTEs. We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why that is not required: GUP will already take care of the MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive entry) when not finding a present PTE and having to trigger a fault and ending up in remove_device_exclusive_entry(). Note that the PTE is always writable, and we can always create a writable-device-exclusive entry. With this change, device-exclusive is fully compatible with THPs / large folios. We still require PMD-sized THPs to get PTE-mapped, and supporting PMD-mapped THP (without the PTE-remapping) is a different endeavour that might not be worth it at this point. This gets rid of the "folio_mapcount()" usage and let's us fix ordinary rmap walks (migration/swapout) next. Spell out that messing with the mapcount is wrong and must be fixed. Signed-off-by: David Hildenbrand --- mm/rmap.c | 188 ++++++++++++++++-------------------------------------- 1 file changed, 55 insertions(+), 133 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 676df4fba5b0..49ffac6d27f8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags) } #ifdef CONFIG_DEVICE_PRIVATE -struct make_exclusive_args { - struct mm_struct *mm; - unsigned long address; - void *owner; - bool valid; -}; - -static bool page_make_device_exclusive_one(struct folio *folio, - struct vm_area_struct *vma, unsigned long address, void *priv) -{ - struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); - struct make_exclusive_args *args = priv; - pte_t pteval; - struct page *subpage; - bool ret = true; - struct mmu_notifier_range range; - swp_entry_t entry; - pte_t swp_pte; - pte_t ptent; - - mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, - vma->vm_mm, address, min(vma->vm_end, - address + folio_size(folio)), - args->owner); - mmu_notifier_invalidate_range_start(&range); - - while (page_vma_mapped_walk(&pvmw)) { - /* Unexpected PMD-mapped THP? */ - VM_BUG_ON_FOLIO(!pvmw.pte, folio); - - ptent = ptep_get(pvmw.pte); - if (!pte_present(ptent)) { - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; - } - - subpage = folio_page(folio, - pte_pfn(ptent) - folio_pfn(folio)); - address = pvmw.address; - - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(ptent)); - pteval = ptep_clear_flush(vma, address, pvmw.pte); - - /* Set the dirty flag on the folio now the pte is gone. */ - if (pte_dirty(pteval)) - folio_mark_dirty(folio); - - /* - * Check that our target page is still mapped at the expected - * address. - */ - if (args->mm == mm && args->address == address && - pte_write(pteval)) - args->valid = true; - - /* - * Store the pfn of the page in a special migration - * pte. do_swap_page() will wait until the migration - * pte is removed and then restart fault handling. - */ - if (pte_write(pteval)) - entry = make_writable_device_exclusive_entry( - page_to_pfn(subpage)); - else - entry = make_readable_device_exclusive_entry( - page_to_pfn(subpage)); - swp_pte = swp_entry_to_pte(entry); - if (pte_soft_dirty(pteval)) - swp_pte = pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pteval)) - swp_pte = pte_swp_mkuffd_wp(swp_pte); - - set_pte_at(mm, address, pvmw.pte, swp_pte); - - /* - * There is a reference on the page for the swap entry which has - * been removed, so shouldn't take another. - */ - folio_remove_rmap_pte(folio, subpage, vma); - } - - mmu_notifier_invalidate_range_end(&range); - - return ret; -} - -/** - * folio_make_device_exclusive - Mark the folio exclusively owned by a device. - * @folio: The folio to replace page table entries for. - * @mm: The mm_struct where the folio is expected to be mapped. - * @address: Address where the folio is expected to be mapped. - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks - * - * Tries to remove all the page table entries which are mapping this - * folio and replace them with special device exclusive swap entries to - * grant a device exclusive access to the folio. - * - * Context: Caller must hold the folio lock. - * Return: false if the page is still mapped, or if it could not be unmapped - * from the expected address. Otherwise returns true (success). - */ -static bool folio_make_device_exclusive(struct folio *folio, - struct mm_struct *mm, unsigned long address, void *owner) -{ - struct make_exclusive_args args = { - .mm = mm, - .address = address, - .owner = owner, - .valid = false, - }; - struct rmap_walk_control rwc = { - .rmap_one = page_make_device_exclusive_one, - .done = folio_not_mapped, - .anon_lock = folio_lock_anon_vma_read, - .arg = &args, - }; - - rmap_walk(folio, &rwc); - - return args.valid && !folio_mapcount(folio); -} - /** * make_device_exclusive() - Mark an address for exclusive use by a device * @mm: mm_struct of associated target process @@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio, struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, void *owner, struct folio **foliop) { - struct folio *folio; + struct folio *folio, *fw_folio; + struct vm_area_struct *vma; + struct folio_walk fw; struct page *page; - long npages; + swp_entry_t entry; + pte_t swp_pte; mmap_assert_locked(mm); @@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, * Fault in the page writable and try to lock it; note that if the * address would already be marked for exclusive use by the device, * the GUP call would undo that first by triggering a fault. + * + * If any other device would already map this page exclusively, the + * fault will trigger a conversion to an ordinary + * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE. */ - npages = get_user_pages_remote(mm, addr, 1, - FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, - &page, NULL); - if (npages != 1) - return ERR_PTR(npages); + page = get_user_page_vma_remote(mm, addr, + FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, + &vma); + if (IS_ERR(page)) + return page; folio = page_folio(page); if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) { @@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, return ERR_PTR(-EBUSY); } - if (!folio_make_device_exclusive(folio, mm, addr, owner)) { + /* + * Let's do a second walk and make sure we still find the same page + * mapped writable. If we don't find what we expect, we will trigger + * GUP again to fix it up. Note that a page of an anonymous folio can + * only be mapped writable using exactly one page table mapping + * ("exclusive"), so there cannot be other mappings. + */ + fw_folio = folio_walk_start(&fw, vma, addr, 0); + if (fw_folio != folio || fw.page != page || + fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) { + if (fw_folio) + folio_walk_end(&fw, vma); folio_unlock(folio); folio_put(folio); return ERR_PTR(-EBUSY); } + + /* Nuke the page table entry so we get the uptodate dirty bit. */ + flush_cache_page(vma, addr, page_to_pfn(page)); + fw.pte = ptep_clear_flush(vma, addr, fw.ptep); + + /* Set the dirty flag on the folio now the pte is gone. */ + if (pte_dirty(fw.pte)) + folio_mark_dirty(folio); + + /* + * Store the pfn of the page in a special device-exclusive non-swap pte. + * do_swap_page() will trigger the conversion back while holding the + * folio lock. + */ + entry = make_writable_device_exclusive_entry(page_to_pfn(page)); + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(fw.pte)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + /* The pte is writable, uffd-wp does not apply. */ + set_pte_at(mm, addr, fw.ptep, swp_pte); + + /* + * TODO: The device-exclusive non-swap PTE holds a folio reference but + * does not count as a mapping (mapcount), which is wrong and must be + * fixed, otherwise RMAP walks don't behave as expected. + */ + folio_remove_rmap_pte(folio, page, vma); + + folio_walk_end(&fw, vma); *foliop = folio; return page; }