[v1,0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

Message ID	20240314161300.382526-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, "Darrick J . Wong" <djwong@kernel.org>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Hugh Dickins <hughd@google.com> Subject: [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ\|WRITE) handle VM_FAULT_RETRY properly Date: Thu, 14 Mar 2024 17:12:58 +0100 Message-ID: <20240314161300.382526-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/madvise: make MADV_POPULATE_(READ\|WRITE) handle VM_FAULT_RETRY properly \| expand [v1,0/2] mm/madvise: make MADV_POPULATE_(READ\|WRITE) handle VM_FAULT_RETRY properly [v1,1/2] mm/madvise: make MADV_POPULATE_(READ\|WRITE) handle VM_FAULT_RETRY properly [v1,2/2] mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ\|WRITE)

Message ID

20240314161300.382526-1-david@redhat.com (mailing list archive)

Headers

From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Darrick J . Wong" <djwong@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Hugh Dickins <hughd@google.com>
Subject: [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle
 VM_FAULT_RETRY properly
Date: Thu, 14 Mar 2024 17:12:58 +0100
Message-ID: <20240314161300.382526-1-david@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly | expand

Message

David Hildenbrand March 14, 2024, 4:12 p.m. UTC

Derrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
that in a better way, similar to how ordinary GUP handles it.

Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
VMA handling into __get_user_pages(), and make faultin_page_range()
call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
avoid the now-useless madvise VMA walk, because __get_user_pages() will
perform the VMA lookup either way.

I briefly played with handling the FOLL_MADV_POPULATE checks in
__get_user_pages() a bit differently, integrating them with existing
handling, but it ended up looking worse. So I decided to keep it simple.

Likely, we need better selftests, but the reproducer from Darrick might
be a bit hard to convert into a simple selftest.

Note that using mlock() in Darricks reproducer results in a similar
endless retry. Likely, that is not what we want, and we should handle
VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
However, similarly using __get_user_pages_locked() might be more
complicated, because of the advanced VMA handling in
populate_vma_page_range().

Further, most populate_vma_page_range() callers simply ignore the return
values, so it's unclear in which cases we expect to just silently fail, or
where we'd want to retry+fail or endlessly retry instead.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>

David Hildenbrand (2):
  mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
    properly
  mm/madvise: don't perform madvise VMA walk for
    MADV_POPULATE_(READ|WRITE)

 mm/gup.c      | 54 ++++++++++++++++++++++++++++++---------------------
 mm/internal.h | 10 ++++++----
 mm/madvise.c  | 43 +++++++++++++---------------------------
 3 files changed, 52 insertions(+), 55 deletions(-)


base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c

Comments

Darrick J. Wong March 15, 2024, 2:25 a.m. UTC | #1

On Thu, Mar 14, 2024 at 05:12:58PM +0100, David Hildenbrand wrote:
> Derrick reports that in some cases where pread() would fail with -EIO and
> mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
> MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
> 
> It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
> that in a better way, similar to how ordinary GUP handles it.
> 
> Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
> VMA handling into __get_user_pages(), and make faultin_page_range()
> call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
> avoid the now-useless madvise VMA walk, because __get_user_pages() will
> perform the VMA lookup either way.
> 
> I briefly played with handling the FOLL_MADV_POPULATE checks in
> __get_user_pages() a bit differently, integrating them with existing
> handling, but it ended up looking worse. So I decided to keep it simple.
> 
> Likely, we need better selftests, but the reproducer from Darrick might
> be a bit hard to convert into a simple selftest.

No worries, I can convert my reproducer into an fstest.  I actually had
no idea that there were so many madvise flags, it's tempting to wire up
fsx and fsstress so that the long soak group tests will exercise them.

> Note that using mlock() in Darricks reproducer results in a similar
> endless retry. Likely, that is not what we want, and we should handle
> VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
> However, similarly using __get_user_pages_locked() might be more
> complicated, because of the advanced VMA handling in
> populate_vma_page_range().
> 
> Further, most populate_vma_page_range() callers simply ignore the return
> values, so it's unclear in which cases we expect to just silently fail, or
> where we'd want to retry+fail or endlessly retry instead.

With this patchset applied, my reproducer no longer gets stuck in an
infinite loop.  I'll throw this at fstests overnight and see if anything
else falls out.  Thank you!

--D

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Hugh Dickins <hughd@google.com>
> 
> David Hildenbrand (2):
>   mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
>     properly
>   mm/madvise: don't perform madvise VMA walk for
>     MADV_POPULATE_(READ|WRITE)
> 
>  mm/gup.c      | 54 ++++++++++++++++++++++++++++++---------------------
>  mm/internal.h | 10 ++++++----
>  mm/madvise.c  | 43 +++++++++++++---------------------------
>  3 files changed, 52 insertions(+), 55 deletions(-)
> 
> 
> base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c
> -- 
> 2.43.2
>

Darrick J. Wong March 17, 2024, 4:50 p.m. UTC | #2

On Thu, Mar 14, 2024 at 05:12:58PM +0100, David Hildenbrand wrote:
> Derrick reports that in some cases where pread() would fail with -EIO and
> mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
> MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
> 
> It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
> that in a better way, similar to how ordinary GUP handles it.
> 
> Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
> VMA handling into __get_user_pages(), and make faultin_page_range()
> call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
> avoid the now-useless madvise VMA walk, because __get_user_pages() will
> perform the VMA lookup either way.
> 
> I briefly played with handling the FOLL_MADV_POPULATE checks in
> __get_user_pages() a bit differently, integrating them with existing
> handling, but it ended up looking worse. So I decided to keep it simple.
> 
> Likely, we need better selftests, but the reproducer from Darrick might
> be a bit hard to convert into a simple selftest.
> 
> Note that using mlock() in Darricks reproducer results in a similar
> endless retry. Likely, that is not what we want, and we should handle
> VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
> However, similarly using __get_user_pages_locked() might be more
> complicated, because of the advanced VMA handling in
> populate_vma_page_range().
> 
> Further, most populate_vma_page_range() callers simply ignore the return
> values, so it's unclear in which cases we expect to just silently fail, or
> where we'd want to retry+fail or endlessly retry instead.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Hugh Dickins <hughd@google.com>

After a few days I haven't seen any problems, so
Tested-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> David Hildenbrand (2):
>   mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
>     properly
>   mm/madvise: don't perform madvise VMA walk for
>     MADV_POPULATE_(READ|WRITE)
> 
>  mm/gup.c      | 54 ++++++++++++++++++++++++++++++---------------------
>  mm/internal.h | 10 ++++++----
>  mm/madvise.c  | 43 +++++++++++++---------------------------
>  3 files changed, 52 insertions(+), 55 deletions(-)
> 
> 
> base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c
> -- 
> 2.43.2
>