From patchwork Thu Mar 14 16:12:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13592598 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9B79C54E60 for ; Thu, 14 Mar 2024 16:13:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43C8B800BC; Thu, 14 Mar 2024 12:13:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C64E800B4; Thu, 14 Mar 2024 12:13:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23F88800BC; Thu, 14 Mar 2024 12:13:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0BC89800B4 for ; Thu, 14 Mar 2024 12:13:13 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9EA5716153F for ; Thu, 14 Mar 2024 16:13:12 +0000 (UTC) X-FDA: 81896139024.24.1826F42 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf04.hostedemail.com (Postfix) with ESMTP id F1D8340017 for ; Thu, 14 Mar 2024 16:13:10 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WSNE6heu; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710432791; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3u70TmCODVTszy6/DbBhWFHEWUxfQhlZeNtqCG8er74=; b=u85BaXacT1ajAv+ivvyGsLbgJ544KOqR/CZ6EcaMFVMh4spYnZGATPrPA6IzGlbqHtvvnE A1H9vCM+/DhpEcKc1z3xQytDAHJ8l8jxGyn9WKpmjLPHJn2YpuScKfbvf9IumjZvEhLhfi hIN6sRxRPxs74GS2DWHrBj5IFhSsXO4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WSNE6heu; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710432791; a=rsa-sha256; cv=none; b=GKU9OMjr0yeQhE/3MpFuWYHVfyIiUZRx6nE3YeYHSoXP2TRvHotDnc0zSFr/63gOAVDPZA ZzTWTwS5gq4QGxEhZ2KWOpUxkGqlkJLlc4QrZx3ofBVDMymZyyIw6M+bldQtvjMfjeTiJB L/JLSNnNwrzdAgU0TMZK+NmYeLIwvjQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1710432790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3u70TmCODVTszy6/DbBhWFHEWUxfQhlZeNtqCG8er74=; b=WSNE6heujXzuT311SytVgUGll6SR+yBHd6rXiU/oo/yJEApxyZGRaX71oWkj4Ln4MkDRbI v+aIXfwsTzksK2lb5y2jteOcnbM6RnxK1/7p2FMfjx96TuJ+Mu6UhySJpZWBxPzQcvSGV/ WmVksMRtVu65M4tg6XFaA6kXNmhcp5Y= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-550-uEIAQqoHMdOxBr0EYk-BMQ-1; Thu, 14 Mar 2024 12:13:06 -0400 X-MC-Unique: uEIAQqoHMdOxBr0EYk-BMQ-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 07C84803F6F; Thu, 14 Mar 2024 16:13:06 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.193.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7E64240C6CB7; Thu, 14 Mar 2024 16:13:03 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , "Darrick J . Wong" , John Hubbard , Jason Gunthorpe , Hugh Dickins Subject: [PATCH v1 1/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly Date: Thu, 14 Mar 2024 17:12:59 +0100 Message-ID: <20240314161300.382526-2-david@redhat.com> In-Reply-To: <20240314161300.382526-1-david@redhat.com> References: <20240314161300.382526-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Rspam-User: X-Stat-Signature: 4askc3kuhc3kgjnded3hb8t835xcagrc X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: F1D8340017 X-HE-Tag: 1710432790-237577 X-HE-Meta: U2FsdGVkX1+W1simmlodKQ+eEukftQC35y9GTqIEJuIjLeAKqpKgTRMCXav2X+Fau8fgphYtPUJ5Jtkkq8Xkr0i25GPMd8xDm2qh0bIXLVHqy7Bf9M/b8MTlXrK411OaEsYNYiVPYEpxKUGZksw1NK2yO0nrvEYmoLyS/piaqeH3x/0udOKbPakMoZC0pPjaZRXoAHajWqzODlUoWrRdg5Gw7YAoaSXFKwfrRnJYjOLkZE/0gdGN6Y55r3uY1OdP7vPumTHbMR25/sOIzzV5GF+HBq2Rq1EGXMUjFXTUkMhQKavOrjDaMoGRfUABSzFMBS4peJK6aUEq8cZAPTQP2dGIrTYIx9TVKkCi+E8O8OYnfAzdVt3djjnXTGnQ9gOzsTZAgbY7jIYgZ2AS/hqjnp3BmIvrrTwxEhOtl9zMJrIpchlvsQ0p2Ll1q0kNseJMCz4DIPzS8j1ZcUA7/tkHrzJhtYkz61D2ISvWt7ZaxdUpJ4kETKy8VIcLNdVT0Ajamy1DgZhpRtvStFm89K2X+3ikmnqu6LJ/ASq+zHzvvW7vTQS3O+EbW7K5A3e715BGB7iCLWSsCsFU9MTSd1E/B0l6KWj9xN+aYZB7S81sbzMNfJTERzhlYzqSLrGM5gx4e6OwbrZ2E/ooKS5Jlr3UcqvTelJpXpUugFIdw5jkjVC4K98nQuuAaeWATAYdOzfq6OcJUfUjcyU7PqKZgAJp4Vf3/hie7pCIZPEXITw4WAh/Qyo7tY+sioWULGLKwcAfHsVkHUiQ4AgX/fVA8zJFdT6WG6fWFt39SGoqEJFAD80lFaJ/jSdCXhcxv61Ztgh2bp/zOZhyNZs/okXMfyq/9m4FsH93vJSpVXyL1TCxGkEnNsuZsUrdn8FqK9zrRZi+josH3ErlWlN/GeKcmgyQGgs196HV3gtFSvAi4Fhemfj/Z00aYI3d49wUrKBNp2dj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Darrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. While the madvise() call can be interrupted by a signal, this is not the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave like page faults in that case: fail and not retry forever. A reproducer can be found at [1]. The reason is that __get_user_pages(), as called by faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way: it will simply return 0 when VM_FAULT_RETRY happened, making madvise_populate()->faultin_vma_page_range() retry again and again, never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages(). __get_user_pages_locked() does what we want, but duplicating that logic in faultin_vma_page_range() feels wrong. So let's use __get_user_pages_locked() instead, that will detect VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT from faultin_page() to __get_user_pages(), all the way to madvise_populate(). But, there is an issue: __get_user_pages_locked() will end up re-taking the MM lock and then __get_user_pages() will do another VMA lookup. In the meantime, the VMA layout could have changed and we'd fail with different error codes than we'd want to. As __get_user_pages() will currently do a new VMA lookup either way, let it do the VMA handling in a different way, controlled by a new FOLL_MADV_POPULATE flag, effectively moving these checks from madvise_populate() + faultin_page_range() in there. With this change, Darricks reproducer properly fails with -EFAULT, as documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE. [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/ Reported-by: Darrick J. Wong Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/ Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand --- mm/gup.c | 54 ++++++++++++++++++++++++++++++--------------------- mm/internal.h | 10 ++++++---- mm/madvise.c | 17 ++-------------- 3 files changed, 40 insertions(+), 41 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d5..f6d55635742f5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1206,6 +1206,22 @@ static long __get_user_pages(struct mm_struct *mm, /* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { + /* + * MADV_POPULATE_(READ|WRITE) wants to handle VMA + * lookups+error reporting differently. + */ + if (gup_flags & FOLL_MADV_POPULATE) { + vma = vma_lookup(mm, start); + if (!vma) { + ret = -ENOMEM; + goto out; + } + if (check_vma_flags(vma, gup_flags)) { + ret = -EINVAL; + goto out; + } + goto retry; + } vma = gup_vma_lookup(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, @@ -1683,35 +1699,35 @@ long populate_vma_page_range(struct vm_area_struct *vma, } /* - * faultin_vma_page_range() - populate (prefault) page tables inside the - * given VMA range readable/writable + * faultin_page_range() - populate (prefault) page tables inside the + * given range readable/writable * * This takes care of mlocking the pages, too, if VM_LOCKED is set. * - * @vma: target vma + * @mm: the mm to populate page tables in * @start: start address * @end: end address * @write: whether to prefault readable or writable * @locked: whether the mmap_lock is still held * - * Returns either number of processed pages in the vma, or a negative error - * code on error (see __get_user_pages()). + * Returns either number of processed pages in the MM, or a negative error + * code on error (see __get_user_pages()). Note that this function reports + * errors related to VMAs, such as incompatible mappings, as expected by + * MADV_POPULATE_(READ|WRITE). * - * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and - * covered by the VMA. If it's released, *@locked will be set to 0. + * The range must be page-aligned. + * + * mm->mmap_lock must be held. If it's released, *@locked will be set to 0. */ -long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end, bool write, int *locked) +long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked) { - struct mm_struct *mm = vma->vm_mm; unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; long ret; VM_BUG_ON(!PAGE_ALIGNED(start)); VM_BUG_ON(!PAGE_ALIGNED(end)); - VM_BUG_ON_VMA(start < vma->vm_start, vma); - VM_BUG_ON_VMA(end > vma->vm_end, vma); mmap_assert_locked(mm); /* @@ -1723,19 +1739,13 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start, * a poisoned page. * !FOLL_FORCE: Require proper access permissions. */ - gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE; + gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE | + FOLL_MADV_POPULATE; if (write) gup_flags |= FOLL_WRITE; - /* - * We want to report -EINVAL instead of -EFAULT for any permission - * problems or incompatible mappings. - */ - if (check_vma_flags(vma, gup_flags)) - return -EINVAL; - - ret = __get_user_pages(mm, start, nr_pages, gup_flags, - NULL, locked); + ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked, + gup_flags); lru_add_drain(); return ret; } diff --git a/mm/internal.h b/mm/internal.h index d1c69119b24fb..a57dd5156cf84 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -686,9 +686,8 @@ struct anon_vma *folio_anon_vma(struct folio *folio); void unmap_mapping_folio(struct folio *folio); extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); -extern long faultin_vma_page_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - bool write, int *locked); +extern long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked); extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, unsigned long bytes); @@ -1127,10 +1126,13 @@ enum { FOLL_FAST_ONLY = 1 << 20, /* allow unlocking the mmap lock */ FOLL_UNLOCKABLE = 1 << 21, + /* VMA lookup+checks compatible with MADV_POPULATE_(READ|WRITE) */ + FOLL_MADV_POPULATE = 1 << 22, }; #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \ - FOLL_FAST_ONLY | FOLL_UNLOCKABLE) + FOLL_FAST_ONLY | FOLL_UNLOCKABLE | \ + FOLL_MADV_POPULATE) /* * Indicates for which pages that are write-protected in the page table, diff --git a/mm/madvise.c b/mm/madvise.c index 44a498c94158c..1a073fcc4c0c0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -908,27 +908,14 @@ static long madvise_populate(struct vm_area_struct *vma, { const bool write = behavior == MADV_POPULATE_WRITE; struct mm_struct *mm = vma->vm_mm; - unsigned long tmp_end; int locked = 1; long pages; *prev = vma; while (start < end) { - /* - * We might have temporarily dropped the lock. For example, - * our VMA might have been split. - */ - if (!vma || start >= vma->vm_end) { - vma = vma_lookup(mm, start); - if (!vma) - return -ENOMEM; - } - - tmp_end = min_t(unsigned long, end, vma->vm_end); /* Populate (prefault) page tables readable/writable. */ - pages = faultin_vma_page_range(vma, start, tmp_end, write, - &locked); + pages = faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked = 1; @@ -949,7 +936,7 @@ static long madvise_populate(struct vm_area_struct *vma, pr_warn_once("%s: unhandled return value: %ld\n", __func__, pages); fallthrough; - case -ENOMEM: + case -ENOMEM: /* No VMA or out of memory. */ return -ENOMEM; } }