xfs: use MMAPLOCK around filemap_map_pages()

Message ID	20200623052059.1893966-1-david@fromorbit.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <SRS0=nlsN=AE=vger.kernel.org=linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH] xfs: use MMAPLOCK around filemap_map_pages() Date: Tue, 23 Jun 2020 15:20:59 +1000 Message-Id: <20200623052059.1893966-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: use MMAPLOCK around filemap_map_pages() \| expand xfs: use MMAPLOCK around filemap_map_pages()

Dave Chinner June 23, 2020, 5:20 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

The page faultround path ->map_pages is implemented in XFS via
filemap_map_pages(). This function checks that pages found in page
cache lookups have not raced with truncate based invalidation by
checking page->mapping is correct and page->index is within EOF.

However, we've known for a long time that this is not sufficient to
protect against races with invalidations done by operations that do
not change EOF. e.g. hole punching and other fallocate() based
direct extent manipulations. The way we protect against these
races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
lock so they serialise against fallocate and truncate before calling
into the filemap function that processes the fault.

Do the same for XFS's ->map_pages implementation to close this
potential data corruption issue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Amir Goldstein June 23, 2020, 8:54 a.m. UTC | #1

On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The page faultround path ->map_pages is implemented in XFS via
> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
>
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
>
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

I wonder... should xfs_file_fadvise(POSIX_FADV_WILLNEED) also be taking
XFS_MMAPLOCK_SHARED instead of XFS_IOLOCK_SHARED?

Not that it matters that much?

Thanks,
Amir.

Dave Chinner June 23, 2020, 9:40 a.m. UTC | #2

On Tue, Jun 23, 2020 at 11:54:39AM +0300, Amir Goldstein wrote:
> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The page faultround path ->map_pages is implemented in XFS via
> > filemap_map_pages(). This function checks that pages found in page
> > cache lookups have not raced with truncate based invalidation by
> > checking page->mapping is correct and page->index is within EOF.
> >
> > However, we've known for a long time that this is not sufficient to
> > protect against races with invalidations done by operations that do
> > not change EOF. e.g. hole punching and other fallocate() based
> > direct extent manipulations. The way we protect against these
> > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > lock so they serialise against fallocate and truncate before calling
> > into the filemap function that processes the fault.
> >
> > Do the same for XFS's ->map_pages implementation to close this
> > potential data corruption issue.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> I wonder... should xfs_file_fadvise(POSIX_FADV_WILLNEED) also be taking
> XFS_MMAPLOCK_SHARED instead of XFS_IOLOCK_SHARED?

No.

The MMAPLOCK is only to be used in the page fault path because
we can't use the IOLOCK in that path or we will deadlock. i.e.
MMAPLOCK is exclusively for IO path locking in page fault contexts,
IOLOCK is exclusively for IO path locking in syscall and kernel task
contexts.

> Not that it matters that much?

Using the right lock for the IO context actually matters a lot :)

Cheers,

Dave.

Brian Foster June 23, 2020, 7:47 p.m. UTC | #3

On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The page faultround path ->map_pages is implemented in XFS via
> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
> 
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
> 
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks reasonable:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b05f8fd7b3d..4b185a907432 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
>  	return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
>  }
>  
> +static void
> +xfs_filemap_map_pages(
> +	struct vm_fault		*vmf,
> +	pgoff_t			start_pgoff,
> +	pgoff_t			end_pgoff)
> +{
> +	struct inode		*inode = file_inode(vmf->vma->vm_file);
> +
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +}
> +
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
>  	.huge_fault	= xfs_filemap_huge_fault,
> -	.map_pages	= filemap_map_pages,
> +	.map_pages	= xfs_filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
>  	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
>

Darrick J. Wong June 23, 2020, 9:19 p.m. UTC | #4

On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The page faultround path ->map_pages is implemented in XFS via

What does "faultround" mean?

I'm pretty convinced that this is merely another round of whackamole wrt
taking the MMAPLOCK before relying on or doing anything to pages in the
page cache, I just can't tell if 'faultround' is jargon or typo.

--D

> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
> 
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
> 
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b05f8fd7b3d..4b185a907432 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
>  	return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
>  }
>  
> +static void
> +xfs_filemap_map_pages(
> +	struct vm_fault		*vmf,
> +	pgoff_t			start_pgoff,
> +	pgoff_t			end_pgoff)
> +{
> +	struct inode		*inode = file_inode(vmf->vma->vm_file);
> +
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +}
> +
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
>  	.huge_fault	= xfs_filemap_huge_fault,
> -	.map_pages	= filemap_map_pages,
> +	.map_pages	= xfs_filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
>  	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
>

Dave Chinner June 23, 2020, 10:14 p.m. UTC | #5

On Tue, Jun 23, 2020 at 02:19:10PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The page faultround path ->map_pages is implemented in XFS via
> 
> What does "faultround" mean?

Typo - fault-around.

i.e. when we take a read page fault, the do_read_fault() code first
opportunistically tries to map a range of pages surrounding
surrounding the faulted page into the PTEs, not just the faulted
page. It uses ->map_pages() to do the page cache lookups for
cached pages in the range of the fault around and then inserts them
into the PTES is if finds any.

If the fault-around pass did not find the page fault page in cache
(i.e. vmf->page remains null) then it calls into do_fault(), which
ends up calling ->fault, which we then lock the MMAPLOCK and call
into filemap_fault() to populate the page cache and read the data
into it.

So, essentially, fault-around is a mechanism to reduce page faults
in the situation where previous readahead has brought adjacent pages
into the page cache by optimistically mapping up to
fault_around_bytes into PTEs on any given read page fault.

> I'm pretty convinced that this is merely another round of whackamole wrt
> taking the MMAPLOCK before relying on or doing anything to pages in the
> page cache, I just can't tell if 'faultround' is jargon or typo.

Well, it's whack-a-mole in that this is the first time I've actually
looked at what map_pages does. I knew there was fault-around in the
page fault path, but I know that it had it's own method for
page cache lookups and pte mapping, nor that it had it's own
truncate race checks to ensure it didn't map pages invalidated by
truncate into the PTEs.

There's so much technical debt hidden in the kernel code base. The
fact we're still finding places that assume only truncate can
invalidate the page cache over a file range indicates just how deep
this debt runs...

-Dave.

Darrick J. Wong June 29, 2020, 5 p.m. UTC | #6

On Wed, Jun 24, 2020 at 08:14:31AM +1000, Dave Chinner wrote:
> On Tue, Jun 23, 2020 at 02:19:10PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The page faultround path ->map_pages is implemented in XFS via
> > 
> > What does "faultround" mean?
> 
> Typo - fault-around.
> 
> i.e. when we take a read page fault, the do_read_fault() code first
> opportunistically tries to map a range of pages surrounding
> surrounding the faulted page into the PTEs, not just the faulted
> page. It uses ->map_pages() to do the page cache lookups for
> cached pages in the range of the fault around and then inserts them
> into the PTES is if finds any.
> 
> If the fault-around pass did not find the page fault page in cache
> (i.e. vmf->page remains null) then it calls into do_fault(), which
> ends up calling ->fault, which we then lock the MMAPLOCK and call
> into filemap_fault() to populate the page cache and read the data
> into it.
> 
> So, essentially, fault-around is a mechanism to reduce page faults
> in the situation where previous readahead has brought adjacent pages
> into the page cache by optimistically mapping up to
> fault_around_bytes into PTEs on any given read page fault.
> 
> > I'm pretty convinced that this is merely another round of whackamole wrt
> > taking the MMAPLOCK before relying on or doing anything to pages in the
> > page cache, I just can't tell if 'faultround' is jargon or typo.
> 
> Well, it's whack-a-mole in that this is the first time I've actually
> looked at what map_pages does. I knew there was fault-around in the
> page fault path, but I know that it had it's own method for
> page cache lookups and pte mapping, nor that it had it's own
> truncate race checks to ensure it didn't map pages invalidated by
> truncate into the PTEs.

<nod> Thanks for the explanation.

/me wonders if someone could please check all the *_ops that point to
generic helpers to see if we're missing obvious things like lock
taking.  Particularly someone who wants to learn about xfs' locking
strategy; I promise it won't let out a ton of bees.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


> There's so much technical debt hidden in the kernel code base. The
> fact we're still finding places that assume only truncate can
> invalidate the page cache over a file range indicates just how deep
> this debt runs...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Amir Goldstein June 30, 2020, 3:23 p.m. UTC | #7

> /me wonders if someone could please check all the *_ops that point to
> generic helpers to see if we're missing obvious things like lock
> taking.  Particularly someone who wants to learn about xfs' locking
> strategy; I promise it won't let out a ton of bees.
>

The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs'
structs for non xfs_/iomap_/noop_ functions.
I am not sure if all iomap_ functions are safe in that respect, but I suppose
those were done recently with sufficient xfs developers review...

fs/xfs/xfs_aops.c:const struct address_space_operations
xfs_address_space_operations = {
        .error_remove_page      = generic_error_remove_page,

generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK
Is that safe? not sure

fs/xfs/xfs_file.c:static const struct vm_operations_struct xfs_file_vm_ops = {
        .map_pages      = filemap_map_pages,

Fixed by $SUBJECT

fs/xfs/xfs_file.c:const struct file_operations xfs_file_operations = {
        .splice_read    = generic_file_splice_read,

Will call xfs_file_read_iter, so looks fine

       .splice_write   = iter_file_splice_write,

Will call xfs_file_write_iter, so looks fine

       .get_unmapped_area = thp_get_unmapped_area,

Looks fine?

fs/xfs/xfs_file.c:const struct file_operations xfs_dir_file_operations = {
        .read           = generic_read_dir,
        .llseek         = generic_file_llseek,

No page cache, no dio, no worries?

Thanks,
Amir.

Darrick J. Wong June 30, 2020, 6:26 p.m. UTC | #8

On Tue, Jun 30, 2020 at 06:23:12PM +0300, Amir Goldstein wrote:
> > /me wonders if someone could please check all the *_ops that point to
> > generic helpers to see if we're missing obvious things like lock
> > taking.  Particularly someone who wants to learn about xfs' locking
> > strategy; I promise it won't let out a ton of bees.
> >
> 
> The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs'
> structs for non xfs_/iomap_/noop_ functions.
> I am not sure if all iomap_ functions are safe in that respect, but I suppose
> those were done recently with sufficient xfs developers review...

The iomap functions shouldn't be taking/releasing any locks at all; it's
up to the filesystem to provide the concurrency controls.

> fs/xfs/xfs_aops.c:const struct address_space_operations
> xfs_address_space_operations = {
>         .error_remove_page      = generic_error_remove_page,
> 
> generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK
> Is that safe? not sure

/me has a funny feeling it isn't, since this does the same thing to the
pagecache as a holepunch.

> fs/xfs/xfs_file.c:static const struct vm_operations_struct xfs_file_vm_ops = {
>         .map_pages      = filemap_map_pages,
> 
> Fixed by $SUBJECT
> 
> fs/xfs/xfs_file.c:const struct file_operations xfs_file_operations = {
>         .splice_read    = generic_file_splice_read,
> 
> Will call xfs_file_read_iter, so looks fine
> 
>        .splice_write   = iter_file_splice_write,
> 
> Will call xfs_file_write_iter, so looks fine
> 
>        .get_unmapped_area = thp_get_unmapped_area,
> 
> Looks fine?
> 
> fs/xfs/xfs_file.c:const struct file_operations xfs_dir_file_operations = {
>         .read           = generic_read_dir,
>         .llseek         = generic_file_llseek,
> 
> No page cache, no dio, no worries?

Right.

--D

> Thanks,
> Amir.

Darrick J. Wong June 30, 2020, 6:27 p.m. UTC | #9

On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The page faultround path ->map_pages is implemented in XFS via
> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
> 
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
> 
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok, notwithstanding the (semirelated) questions that Amir
unearthed elsewhere in this thread...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_file.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b05f8fd7b3d..4b185a907432 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
>  	return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
>  }
>  
> +static void
> +xfs_filemap_map_pages(
> +	struct vm_fault		*vmf,
> +	pgoff_t			start_pgoff,
> +	pgoff_t			end_pgoff)
> +{
> +	struct inode		*inode = file_inode(vmf->vma->vm_file);
> +
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +}
> +
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
>  	.huge_fault	= xfs_filemap_huge_fault,
> -	.map_pages	= filemap_map_pages,
> +	.map_pages	= xfs_filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
>  	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
>

Dave Chinner June 30, 2020, 10:46 p.m. UTC | #10

On Tue, Jun 30, 2020 at 11:26:45AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 30, 2020 at 06:23:12PM +0300, Amir Goldstein wrote:
> > > /me wonders if someone could please check all the *_ops that point to
> > > generic helpers to see if we're missing obvious things like lock
> > > taking.  Particularly someone who wants to learn about xfs' locking
> > > strategy; I promise it won't let out a ton of bees.
> > >
> > 
> > The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs'
> > structs for non xfs_/iomap_/noop_ functions.
> > I am not sure if all iomap_ functions are safe in that respect, but I suppose
> > those were done recently with sufficient xfs developers review...
> 
> The iomap functions shouldn't be taking/releasing any locks at all; it's
> up to the filesystem to provide the concurrency controls.
> 
> > fs/xfs/xfs_aops.c:const struct address_space_operations
> > xfs_address_space_operations = {
> >         .error_remove_page      = generic_error_remove_page,
> > 
> > generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK
> > Is that safe? not sure
> 
> /me has a funny feeling it isn't, since this does the same thing to the
> pagecache as a holepunch.

And I really can't tell, because this comes from the convoluted 
hardware memory error path. We know that path is completely screwed
up w.r.t. memory errors in pmem and DAX filesystems. Hence I think
there's a good chance it's completely screwed up for cached
file-backed page cache pages, too.

Indeed, look at the comment in me_pagecache_clean():

        /*
         * Truncation is a bit tricky. Enable it per file system for now.
         *
         * Open: to take i_mutex or not for this? Right now we don't.
         */
        return truncate_error_page(p, pfn, mapping);

the call path is:

me_pagecache_clean()
  truncate_error_page()
    ->error_remove_page()

IOWs, the authors of this code did not know what to do, and like the
DAX failure stuff, merged the code without having resolving the
fundamental issues around interfacing with filesystem owned pages
directly...

I don't really have the time to look at it in any more depth right
now. I also suspect the memory failure code is a path we simply
cannot exercise in any useful manner so it's unlikely that we'll
ever be able to tell if this stuff works correctly or not....

Cheers,

-Dave.

Amir Goldstein Sept. 12, 2020, 6:19 a.m. UTC | #11

On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The page faultround path ->map_pages is implemented in XFS via
> filemap_map_pages(). This function checks that pages found in page
> cache lookups have not raced with truncate based invalidation by
> checking page->mapping is correct and page->index is within EOF.
>
> However, we've known for a long time that this is not sufficient to
> protect against races with invalidations done by operations that do
> not change EOF. e.g. hole punching and other fallocate() based
> direct extent manipulations. The way we protect against these
> races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> lock so they serialise against fallocate and truncate before calling
> into the filemap function that processes the fault.
>
> Do the same for XFS's ->map_pages implementation to close this
> potential data corruption issue.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b05f8fd7b3d..4b185a907432 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
>         return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
>  }
>
> +static void
> +xfs_filemap_map_pages(
> +       struct vm_fault         *vmf,
> +       pgoff_t                 start_pgoff,
> +       pgoff_t                 end_pgoff)
> +{
> +       struct inode            *inode = file_inode(vmf->vma->vm_file);
> +
> +       xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +       filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +       xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +}
> +
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>         .fault          = xfs_filemap_fault,
>         .huge_fault     = xfs_filemap_huge_fault,
> -       .map_pages      = filemap_map_pages,
> +       .map_pages      = xfs_filemap_map_pages,
>         .page_mkwrite   = xfs_filemap_page_mkwrite,
>         .pfn_mkwrite    = xfs_filemap_pfn_mkwrite,
>  };
> --
> 2.26.2.761.g0e0b3e54be
>

It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix

zonefs does not support hole punching, so it may not need to use
mmap_sem at all.

It is interesting to look at how this bug came to be duplicated in so
many filesystems, because there are lessons to be learned.

Commit f1820361f83d ("mm: implement ->map_pages for page cache")
added to ->map_pages() operation and its commit message said:

"...It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault()."

At the time, all of the aforementioned filesystems used filemap_fault()
for ->fault().

But since then, ext4, xfs, f2fs and just recently gfs2 have added a filesystem
->fault() operation.

orangefs has added vm_operations since and zonefs was added since,
probably copying the mmap_sem handling from ext4. Both have a filesystem
->fault() operation.

It was surprising for me to see that some of the filesystem developers
signed on the added ->fault() operations are not strangers to mm. The
recent gfs2 change was even reviewed by an established mm developer
[1].

So what can we learn from this case study? How could we fix the interface to
avoid repeating the same mistake in the future?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20200703113801.GD25523@casper.infradead.org/

Jan Kara Sept. 14, 2020, 11:35 a.m. UTC | #12

On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The page faultround path ->map_pages is implemented in XFS via
> > filemap_map_pages(). This function checks that pages found in page
> > cache lookups have not raced with truncate based invalidation by
> > checking page->mapping is correct and page->index is within EOF.
> >
> > However, we've known for a long time that this is not sufficient to
> > protect against races with invalidations done by operations that do
> > not change EOF. e.g. hole punching and other fallocate() based
> > direct extent manipulations. The way we protect against these
> > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > lock so they serialise against fallocate and truncate before calling
> > into the filemap function that processes the fault.
> >
> > Do the same for XFS's ->map_pages implementation to close this
> > potential data corruption issue.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_file.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 7b05f8fd7b3d..4b185a907432 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> >         return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> >  }
> >
> > +static void
> > +xfs_filemap_map_pages(
> > +       struct vm_fault         *vmf,
> > +       pgoff_t                 start_pgoff,
> > +       pgoff_t                 end_pgoff)
> > +{
> > +       struct inode            *inode = file_inode(vmf->vma->vm_file);
> > +
> > +       xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +       filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > +       xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +}
> > +
> >  static const struct vm_operations_struct xfs_file_vm_ops = {
> >         .fault          = xfs_filemap_fault,
> >         .huge_fault     = xfs_filemap_huge_fault,
> > -       .map_pages      = filemap_map_pages,
> > +       .map_pages      = xfs_filemap_map_pages,
> >         .page_mkwrite   = xfs_filemap_page_mkwrite,
> >         .pfn_mkwrite    = xfs_filemap_pfn_mkwrite,
> >  };
> > --
> > 2.26.2.761.g0e0b3e54be
> >
> 
> It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
> 
> zonefs does not support hole punching, so it may not need to use
> mmap_sem at all.
> 
> It is interesting to look at how this bug came to be duplicated in so
> many filesystems, because there are lessons to be learned.
> 
> Commit f1820361f83d ("mm: implement ->map_pages for page cache")
> added to ->map_pages() operation and its commit message said:
> 
> "...It should be safe to use filemap_map_pages() for ->map_pages() if
>     filesystem use filemap_fault() for ->fault()."
> 
> At the time, all of the aforementioned filesystems used filemap_fault()
> for ->fault().
> 
> But since then, ext4, xfs, f2fs and just recently gfs2 have added a
> filesystem ->fault() operation.
> 
> orangefs has added vm_operations since and zonefs was added since,
> probably copying the mmap_sem handling from ext4. Both have a filesystem
> ->fault() operation.

A standard pattern of copying bug from one place into many. Sadly it's
happening all the time for stuff that's complex enough that only a few
people (if anybody) are carrying all the details in their head.

> It was surprising for me to see that some of the filesystem developers
> signed on the added ->fault() operations are not strangers to mm. The
> recent gfs2 change was even reviewed by an established mm developer
> [1].

Well, people do miss things... And this stuff is twisted maze so it is easy
to miss something even for an experienced developer.

> So what can we learn from this case study? How could we fix the interface to
> avoid repeating the same mistake in the future?

IMO the serialization between page cache and various fs operations is just
too complex with too many special corner cases. But that's difficult to
change while keeping all the features and performance. So the best
realistic answer I have (and this is not meant to discourage anybody from
trying to implement a simpler scheme of page-cache - filesystem interaction
:) is that we should have added a fstest when XFS fix landed which would
then hopefully catch attention of other fs maintainers (at least those that
do run fstest).

								Honza

Andreas Gruenbacher Sept. 14, 2020, 12:29 p.m. UTC | #13

Could the xfs mmap lock documentation please be cleaned up? For
example, the xfs_ilock description says:

> * In addition to i_rwsem in the VFS inode, the xfs inode contains 2
> * multi-reader locks: i_mmap_lock and the i_lock.  This routine allows
> * various combinations of the locks to be obtained.

The field in struct xfs_inode is called i_mmaplock though, not
i_mmap_lock. In addition, struct inode has an i_mmap_rwsem field which
is also referred to as i_mmap_lock. If that isn't irritating enough.

Thanks,
Andreas

Jan Kara Sept. 16, 2020, 3:58 p.m. UTC | #14

On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The page faultround path ->map_pages is implemented in XFS via
> > filemap_map_pages(). This function checks that pages found in page
> > cache lookups have not raced with truncate based invalidation by
> > checking page->mapping is correct and page->index is within EOF.
> >
> > However, we've known for a long time that this is not sufficient to
> > protect against races with invalidations done by operations that do
> > not change EOF. e.g. hole punching and other fallocate() based
> > direct extent manipulations. The way we protect against these
> > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > lock so they serialise against fallocate and truncate before calling
> > into the filemap function that processes the fault.
> >
> > Do the same for XFS's ->map_pages implementation to close this
> > potential data corruption issue.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_file.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 7b05f8fd7b3d..4b185a907432 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> >         return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> >  }
> >
> > +static void
> > +xfs_filemap_map_pages(
> > +       struct vm_fault         *vmf,
> > +       pgoff_t                 start_pgoff,
> > +       pgoff_t                 end_pgoff)
> > +{
> > +       struct inode            *inode = file_inode(vmf->vma->vm_file);
> > +
> > +       xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +       filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > +       xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +}
> > +
> >  static const struct vm_operations_struct xfs_file_vm_ops = {
> >         .fault          = xfs_filemap_fault,
> >         .huge_fault     = xfs_filemap_huge_fault,
> > -       .map_pages      = filemap_map_pages,
> > +       .map_pages      = xfs_filemap_map_pages,
> >         .page_mkwrite   = xfs_filemap_page_mkwrite,
> >         .pfn_mkwrite    = xfs_filemap_pfn_mkwrite,
> >  };
> > --
> > 2.26.2.761.g0e0b3e54be
> >
> 
> It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
> 
> zonefs does not support hole punching, so it may not need to use
> mmap_sem at all.

So I've written an ext4 fix for this but before actually posting it Nikolay
working on btrfs fix asked why exactly is filemap_map_pages() actually a
problem and I think he's right it actually isn't a problem. The thing is:
filemap_map_pages() never does any page mapping or IO. It only creates PTEs
for uptodate pages that are already in page cache. As such it is a rather
different beast compared to the fault handler from fs POV and does not need
protection from hole punching (current serialization on page lock and
checking of page->mapping is enough).

That being said I agree this is subtle and the moment someone adds e.g. a
readahead call into filemap_map_pages() we have a real problem. I'm not
sure how to prevent this risk...

								Honza

Dave Chinner Sept. 17, 2020, 1:44 a.m. UTC | #15

On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > The page faultround path ->map_pages is implemented in XFS via
> > > filemap_map_pages(). This function checks that pages found in page
> > > cache lookups have not raced with truncate based invalidation by
> > > checking page->mapping is correct and page->index is within EOF.
> > >
> > > However, we've known for a long time that this is not sufficient to
> > > protect against races with invalidations done by operations that do
> > > not change EOF. e.g. hole punching and other fallocate() based
> > > direct extent manipulations. The way we protect against these
> > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> > > lock so they serialise against fallocate and truncate before calling
> > > into the filemap function that processes the fault.
> > >
> > > Do the same for XFS's ->map_pages implementation to close this
> > > potential data corruption issue.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 7b05f8fd7b3d..4b185a907432 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite(
> > >         return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> > >  }
> > >
> > > +static void
> > > +xfs_filemap_map_pages(
> > > +       struct vm_fault         *vmf,
> > > +       pgoff_t                 start_pgoff,
> > > +       pgoff_t                 end_pgoff)
> > > +{
> > > +       struct inode            *inode = file_inode(vmf->vma->vm_file);
> > > +
> > > +       xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +       filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > > +       xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +}
> > > +
> > >  static const struct vm_operations_struct xfs_file_vm_ops = {
> > >         .fault          = xfs_filemap_fault,
> > >         .huge_fault     = xfs_filemap_huge_fault,
> > > -       .map_pages      = filemap_map_pages,
> > > +       .map_pages      = xfs_filemap_map_pages,
> > >         .page_mkwrite   = xfs_filemap_page_mkwrite,
> > >         .pfn_mkwrite    = xfs_filemap_pfn_mkwrite,
> > >  };
> > > --
> > > 2.26.2.761.g0e0b3e54be
> > >
> > 
> > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix
> > 
> > zonefs does not support hole punching, so it may not need to use
> > mmap_sem at all.
> 
> So I've written an ext4 fix for this but before actually posting it Nikolay
> working on btrfs fix asked why exactly is filemap_map_pages() actually a
> problem and I think he's right it actually isn't a problem. The thing is:
> filemap_map_pages() never does any page mapping or IO. It only creates PTEs
> for uptodate pages that are already in page cache.

So....

P0					p1

hole punch starts
  takes XFS_MMAPLOCK_EXCL
  truncate_pagecache_range()
    unmap_mapping_range(start, end)
      <clears ptes>
					<read fault>
					do_fault_around()
					  ->map_pages
					    filemap_map_pages()
					      page mapping valid,
					      page is up to date
					      maps PTEs
					<fault done>
    truncate_inode_pages_range()
      truncate_cleanup_page(page)
        invalidates page
      delete_from_page_cache_batch(page)
        frees page
					<pte now points to a freed page>

That doesn't seem good to me.

Sure, maybe the page hasn't been freed back to the free lists
because of elevated refcounts. But it's been released by the
filesystem and not longer in the page cache so nothing good can come
of this situation...

AFAICT, this race condition exists for the truncate case as well
as filemap_map_pages() doesn't have a "page beyond inode size" check
in it. However, exposure here is very limited in the truncate case
because truncate_setsize()->truncate_pagecache() zaps the PTEs
again after invalidating the page cache.

Either way, adding the XFS_MMAPLOCK_SHARED around
filemap_map_pages() avoids the race condition for fallocate and
truncate operations for XFS...

> As such it is a rather
> different beast compared to the fault handler from fs POV and does not need
> protection from hole punching (current serialization on page lock and
> checking of page->mapping is enough).
> That being said I agree this is subtle and the moment someone adds e.g. a
> readahead call into filemap_map_pages() we have a real problem. I'm not
> sure how to prevent this risk...

Subtle, yes. So subtle, in fact, I fail to see any reason why the
above race cannot occur as there's no obvious serialisation in the
page cache between PTE zapping and page invalidation to prevent a
fault from coming in an re-establishing the PTEs before invalidation
occurs.

Cheers,

Dave.

Hugh Dickins Sept. 17, 2020, 2:04 a.m. UTC | #16

On Thu, 17 Sep 2020, Dave Chinner wrote:
> 
> So....
> 
> P0					p1
> 
> hole punch starts
>   takes XFS_MMAPLOCK_EXCL
>   truncate_pagecache_range()
>     unmap_mapping_range(start, end)
>       <clears ptes>
> 					<read fault>
> 					do_fault_around()
> 					  ->map_pages
> 					    filemap_map_pages()
> 					      page mapping valid,
> 					      page is up to date
> 					      maps PTEs
> 					<fault done>
>     truncate_inode_pages_range()
>       truncate_cleanup_page(page)
>         invalidates page
>       delete_from_page_cache_batch(page)
>         frees page
> 					<pte now points to a freed page>

No.  filemap_map_pages() checks page->mapping after trylock_page(),
before setting up the pte; and truncate_cleanup_page() does a one-page
unmap_mapping_range() if page_mapped(), while holding page lock.

(Of course, there's a different thread, in which less reliance on
page lock is being discussed, but that would be a future thing.)

Hugh

Matthew Wilcox Sept. 17, 2020, 3:01 a.m. UTC | #17

On Thu, Sep 17, 2020 at 11:44:54AM +1000, Dave Chinner wrote:
> So....
> 
> P0					p1
> 
> hole punch starts
>   takes XFS_MMAPLOCK_EXCL
>   truncate_pagecache_range()

... locks page ...

>     unmap_mapping_range(start, end)
>       <clears ptes>
> 					<read fault>
> 					do_fault_around()
> 					  ->map_pages
> 					    filemap_map_pages()

... trylock page fails ...

> 					      page mapping valid,
> 					      page is up to date
> 					      maps PTEs
> 					<fault done>
>     truncate_inode_pages_range()
>       truncate_cleanup_page(page)
>         invalidates page
>       delete_from_page_cache_batch(page)
>         frees page
> 					<pte now points to a freed page>
> 
> That doesn't seem good to me.
> 
> Sure, maybe the page hasn't been freed back to the free lists
> because of elevated refcounts. But it's been released by the
> filesystem and not longer in the page cache so nothing good can come
> of this situation...
> 
> AFAICT, this race condition exists for the truncate case as well
> as filemap_map_pages() doesn't have a "page beyond inode size" check
> in it. However, exposure here is very limited in the truncate case
> because truncate_setsize()->truncate_pagecache() zaps the PTEs
> again after invalidating the page cache.
> 
> Either way, adding the XFS_MMAPLOCK_SHARED around
> filemap_map_pages() avoids the race condition for fallocate and
> truncate operations for XFS...
> 
> > As such it is a rather
> > different beast compared to the fault handler from fs POV and does not need
> > protection from hole punching (current serialization on page lock and
> > checking of page->mapping is enough).
> > That being said I agree this is subtle and the moment someone adds e.g. a
> > readahead call into filemap_map_pages() we have a real problem. I'm not
> > sure how to prevent this risk...
> 
> Subtle, yes. So subtle, in fact, I fail to see any reason why the
> above race cannot occur as there's no obvious serialisation in the
> page cache between PTE zapping and page invalidation to prevent a
> fault from coming in an re-establishing the PTEs before invalidation
> occurs.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Nikolay Borisov Sept. 17, 2020, 5:37 a.m. UTC | #18

On 17.09.20 г. 4:44 ч., Dave Chinner wrote:
> On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
>> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
>>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:

<snip>

> 
> So....
> 
> P0					p1
> 
> hole punch starts
>   takes XFS_MMAPLOCK_EXCL
>   truncate_pagecache_range()
>     unmap_mapping_range(start, end)
>       <clears ptes>
> 					<read fault>
> 					do_fault_around()
> 					  ->map_pages
> 					    filemap_map_pages()
> 					      page mapping valid,
> 					      page is up to date
> 					      maps PTEs
> 					<fault done>
>     truncate_inode_pages_range()
>       truncate_cleanup_page(page)
>         invalidates page
>       delete_from_page_cache_batch(page)
>         frees page
> 					<pte now points to a freed page>
> 
> That doesn't seem good to me.
> 
> Sure, maybe the page hasn't been freed back to the free lists
> because of elevated refcounts. But it's been released by the
> filesystem and not longer in the page cache so nothing good can come
> of this situation...
> 
> AFAICT, this race condition exists for the truncate case as well
> as filemap_map_pages() doesn't have a "page beyond inode size" check
> in it. 

(It's not relevant to the discussion at hand but for the sake of
completeness):

It does have a check:

 max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
 if (page->index >= max_idx)
      goto unlock;


<snip>

Dave Chinner Sept. 17, 2020, 6:45 a.m. UTC | #19

On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> On Thu, 17 Sep 2020, Dave Chinner wrote:
> > 
> > So....
> > 
> > P0					p1
> > 
> > hole punch starts
> >   takes XFS_MMAPLOCK_EXCL
> >   truncate_pagecache_range()
> >     unmap_mapping_range(start, end)
> >       <clears ptes>
> > 					<read fault>
> > 					do_fault_around()
> > 					  ->map_pages
> > 					    filemap_map_pages()
> > 					      page mapping valid,
> > 					      page is up to date
> > 					      maps PTEs
> > 					<fault done>
> >     truncate_inode_pages_range()
> >       truncate_cleanup_page(page)
> >         invalidates page
> >       delete_from_page_cache_batch(page)
> >         frees page
> > 					<pte now points to a freed page>
> 
> No.  filemap_map_pages() checks page->mapping after trylock_page(),
> before setting up the pte; and truncate_cleanup_page() does a one-page
> unmap_mapping_range() if page_mapped(), while holding page lock.

Ok, fair, I missed that.

So why does truncate_pagecache() talk about fault races and require
a second unmap range after the invalidation "for correctness" if
this sort of race cannot happen?

Why is that different to truncate_pagecache_range() which -doesn't-i
do that second removal? It's called for more than just hole_punch -
from the filesystem's persepective holepunch should do exactly the
same as truncate to the page cache, and for things like
COLLAPSE_RANGE it is absolutely essential because the data in that
range is -not zero- and will be stale if the mappings are not
invalidated completely....

Also, if page->mapping == NULL is sufficient to detect an invalidated
page in all cases, then why does page_cache_delete() explicitly
leave page->index intact:

	page->mapping = NULL;
	/* Leave page->index set: truncation lookup relies upon it */

Cheers,

Dave.

Jan Kara Sept. 17, 2020, 7:40 a.m. UTC | #20

On Thu 17-09-20 08:37:17, Nikolay Borisov wrote:
> On 17.09.20 г. 4:44 ч., Dave Chinner wrote:
> > On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
> >> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
> >>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote:
> 
> <snip>
> 
> > 
> > So....
> > 
> > P0					p1
> > 
> > hole punch starts
> >   takes XFS_MMAPLOCK_EXCL
> >   truncate_pagecache_range()
> >     unmap_mapping_range(start, end)
> >       <clears ptes>
> > 					<read fault>
> > 					do_fault_around()
> > 					  ->map_pages
> > 					    filemap_map_pages()
> > 					      page mapping valid,
> > 					      page is up to date
> > 					      maps PTEs
> > 					<fault done>
> >     truncate_inode_pages_range()
> >       truncate_cleanup_page(page)
> >         invalidates page
> >       delete_from_page_cache_batch(page)
> >         frees page
> > 					<pte now points to a freed page>
> > 
> > That doesn't seem good to me.
> > 
> > Sure, maybe the page hasn't been freed back to the free lists
> > because of elevated refcounts. But it's been released by the
> > filesystem and not longer in the page cache so nothing good can come
> > of this situation...
> > 
> > AFAICT, this race condition exists for the truncate case as well
> > as filemap_map_pages() doesn't have a "page beyond inode size" check
> > in it. 
> 
> (It's not relevant to the discussion at hand but for the sake of
> completeness):
> 
> It does have a check:
> 
>  max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
>  if (page->index >= max_idx)
>       goto unlock;

Yes, but this does something meaningful only for truncate. For other
operations such as hole punch this check doesn't bring anything. That's why
only filesystems supporting hole punching and similar advanced operations
need an equivalent of mmap_lock.

								Honza

Hugh Dickins Sept. 17, 2020, 7:47 a.m. UTC | #21

On Thu, 17 Sep 2020, Dave Chinner wrote:
> On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> > On Thu, 17 Sep 2020, Dave Chinner wrote:
> > > 					<pte now points to a freed page>
> > 
> > No.  filemap_map_pages() checks page->mapping after trylock_page(),
> > before setting up the pte; and truncate_cleanup_page() does a one-page
> > unmap_mapping_range() if page_mapped(), while holding page lock.
> 
> Ok, fair, I missed that.
> 
> So why does truncate_pagecache() talk about fault races and require
> a second unmap range after the invalidation "for correctness" if
> this sort of race cannot happen?

I thought the comment
	 * unmap_mapping_range is called twice, first simply for
	 * efficiency so that truncate_inode_pages does fewer
	 * single-page unmaps.  However after this first call, and
	 * before truncate_inode_pages finishes, it is possible for
	 * private pages to be COWed, which remain after
	 * truncate_inode_pages finishes, hence the second
	 * unmap_mapping_range call must be made for correctness.
explains it fairly well. It's because POSIX demanded that when a file
is truncated, the user will get SIGBUS on trying to access even the
COWed pages beyond EOF in a MAP_PRIVATE mapping.  Page lock on the
cache page does not serialize the pages COWed from it very well.

But there's no such SIGBUS requirement in the case of hole-punching,
and trying to unmap those pages racily instantiated just after the
punching cursor passed, would probably do more harm than good.

> 
> Why is that different to truncate_pagecache_range() which -doesn't-i
> do that second removal? It's called for more than just hole_punch -
> from the filesystem's persepective holepunch should do exactly the
> same as truncate to the page cache, and for things like
> COLLAPSE_RANGE it is absolutely essential because the data in that
> range is -not zero- and will be stale if the mappings are not
> invalidated completely....

I can't speak to COLLAPSE_RANGE.

> 
> Also, if page->mapping == NULL is sufficient to detect an invalidated
> page in all cases, then why does page_cache_delete() explicitly
> leave page->index intact:
> 
> 	page->mapping = NULL;
> 	/* Leave page->index set: truncation lookup relies upon it */

Because there was, and I think still is (but might it now be xarrayed
away?), code (mainly in mm/truncate.c) which finds it convenient to
check page->index for end of range, without necessitating the overhead
of getting page lock.  I've no doubt it's an (minor) optimization that
could be discarded if there were ever a need to invalidate page->index
when deleting; but nobody has required that yet.

Hugh

Dave Chinner Sept. 21, 2020, 8:26 a.m. UTC | #22

On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote:
> On Thu, 17 Sep 2020, Dave Chinner wrote:
> > On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote:
> > > On Thu, 17 Sep 2020, Dave Chinner wrote:
> > > > 					<pte now points to a freed page>
> > > 
> > > No.  filemap_map_pages() checks page->mapping after trylock_page(),
> > > before setting up the pte; and truncate_cleanup_page() does a one-page
> > > unmap_mapping_range() if page_mapped(), while holding page lock.
> > 
> > Ok, fair, I missed that.
> > 
> > So why does truncate_pagecache() talk about fault races and require
> > a second unmap range after the invalidation "for correctness" if
> > this sort of race cannot happen?
> 
> I thought the comment
> 	 * unmap_mapping_range is called twice, first simply for
> 	 * efficiency so that truncate_inode_pages does fewer
> 	 * single-page unmaps.  However after this first call, and
> 	 * before truncate_inode_pages finishes, it is possible for
> 	 * private pages to be COWed, which remain after
> 	 * truncate_inode_pages finishes, hence the second
> 	 * unmap_mapping_range call must be made for correctness.
> explains it fairly well.

Not to me. It explains what the code is doing, and the why is simply
"correctness".

I have no idea what "correctness" actually means in this context
because there is no reference to what correct behaviour should be.
Nor do I have any idea why COW faults might behave differently to a
normal read/write page fault...

> It's because POSIX demanded that when a file
> is truncated, the user will get SIGBUS on trying to access even the
> COWed pages beyond EOF in a MAP_PRIVATE mapping.  Page lock on the
> cache page does not serialize the pages COWed from it very well.

And there's the "why". I don't find the "page lock doesn't
serialise COW faults very well" particularly reassuring in this
case....

> But there's no such SIGBUS requirement in the case of hole-punching,
> and trying to unmap those pages racily instantiated just after the
> punching cursor passed, would probably do more harm than good.

There isn't a SIGBUS requirement for fallocate operations, just a
"don't expose stale data to userspace" requirement.

FWIW, how does a COW fault even work with file backed pages? We can
only have a single page attached to the inode address space for a given
offset, so if there's been a COW fault and a new page faulted in for
the write fault in that VMA, doesn't that imply the user data then
written to that page is never going to be written back to storage
because the COW page is not tracked by the inode address space?

> > Why is that different to truncate_pagecache_range() which -doesn't-i
> > do that second removal? It's called for more than just hole_punch -
> > from the filesystem's persepective holepunch should do exactly the
> > same as truncate to the page cache, and for things like
> > COLLAPSE_RANGE it is absolutely essential because the data in that
> > range is -not zero- and will be stale if the mappings are not
> > invalidated completely....
> 
> I can't speak to COLLAPSE_RANGE.

It moves data around, doesn't replace data with zeros. Hence the
contents of any page that isn't invalidated entirely by
truncate_pagecache_range() is now entirely incorrect...

> > Also, if page->mapping == NULL is sufficient to detect an invalidated
> > page in all cases, then why does page_cache_delete() explicitly
> > leave page->index intact:
> > 
> > 	page->mapping = NULL;
> > 	/* Leave page->index set: truncation lookup relies upon it */
> 
> Because there was, and I think still is (but might it now be xarrayed
> away?), code (mainly in mm/truncate.c) which finds it convenient to
> check page->index for end of range, without necessitating the overhead
> of getting page lock.  I've no doubt it's an (minor) optimization that
> could be discarded if there were ever a need to invalidate page->index
> when deleting; but nobody has required that yet.

And that's exactly my concern w.r.t. fallocate based invalidation:
checking the page is beyond EOF without locking the page or checking
the mapping does not detect pages invalidated by hole punching and
other fallocate() operations because page->index on the invalidated
pages is never beyond EOF....

Cheers,

Dave.

Jan Kara Sept. 21, 2020, 9:11 a.m. UTC | #23

On Mon 21-09-20 18:26:00, Dave Chinner wrote:
> On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote:
> > It's because POSIX demanded that when a file
> > is truncated, the user will get SIGBUS on trying to access even the
> > COWed pages beyond EOF in a MAP_PRIVATE mapping.  Page lock on the
> > cache page does not serialize the pages COWed from it very well.
> 
> And there's the "why". I don't find the "page lock doesn't
> serialise COW faults very well" particularly reassuring in this
> case....
> 
> > But there's no such SIGBUS requirement in the case of hole-punching,
> > and trying to unmap those pages racily instantiated just after the
> > punching cursor passed, would probably do more harm than good.
> 
> There isn't a SIGBUS requirement for fallocate operations, just a
> "don't expose stale data to userspace" requirement.
> 
> FWIW, how does a COW fault even work with file backed pages? We can
> only have a single page attached to the inode address space for a given
> offset, so if there's been a COW fault and a new page faulted in for
> the write fault in that VMA, doesn't that imply the user data then
> written to that page is never going to be written back to storage
> because the COW page is not tracked by the inode address space?

Correct. Private file mappings work so that on first write fault on some
page offset we allocate anonymous page for that offset, copy to it current
contents of the corresponding file page, and from that moment on it behaves
as an anonymous page. Except that on truncate, we have to unmap these
anonymous pages in private file mappings as well...

								Honza

Linus Torvalds Sept. 21, 2020, 4:20 p.m. UTC | #24

On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote:
>
> Except that on truncate, we have to unmap these
> anonymous pages in private file mappings as well...

I'm actually not 100% sure we strictly would need to care.

Once we've faulted in a private file mapping page, that page is
"ours". That's kind of what MAP_PRIVATE means.

If we haven't written to it, we do keep things coherent with the file,
but that's actually not required by POSIX afaik - it's a QoI issue,
and a lot of (bad) Unixes didn't do it at all.

So as long as truncate _clears_ the pages it truncates, I think we'd
actually be ok.

The SIGBUS is supposed to happen, but that's really only relevant for
the _first_ access. Once we've accessed the page, and have it mapped,
the private part really means that there are no guarantees it stays
coherent.

In particular, obviously if we've written to a page, we've lost the
association with the original file entirely. And I'm pretty sure that
a private mapping is allowed to act as if it was a private copy
without that association in the first place.

That said, this _is_ a QoI thing, and in Linux we've generally tried
quite hard to stay as coherent as possible even with private mappings.

In fact, before we had real shared file mappings (in a distant past,
long long ago), we allowed read-only shared mappings because we
internally turned them into read-only private mappings and our private
mappings were coherent.

And those "fake" read-only shared mappings actually were much better
than some other platforms "real" shared mappings (*cough*hpux*cough*)
and actually worked with things that mixed "write()" and "mmap()" and
expected coherency.

Admittedly the only case I'm aware of that did that was nntpd or
something like that. Exactly because a lot of Unixes were *not*
coherent (either because they had actual hardware cache coherency
issues, or because their VM was not set up for it).

                 Linus

Matthew Wilcox Sept. 21, 2020, 5:59 p.m. UTC | #25

On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote:
> On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Except that on truncate, we have to unmap these
> > anonymous pages in private file mappings as well...
> 
> I'm actually not 100% sure we strictly would need to care.
> 
> Once we've faulted in a private file mapping page, that page is
> "ours". That's kind of what MAP_PRIVATE means.
> 
> If we haven't written to it, we do keep things coherent with the file,
> but that's actually not required by POSIX afaik - it's a QoI issue,
> and a lot of (bad) Unixes didn't do it at all.
> So as long as truncate _clears_ the pages it truncates, I think we'd
> actually be ok.

We don't even need to do that ...

"If the size of the mapped file changes after the call to mmap()
as a result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified."

https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html

As you say, there's a QoI here, and POSIX permits some shockingly
bad and useless implementations.

Jan Kara Sept. 22, 2020, 7:54 a.m. UTC | #26

On Mon 21-09-20 18:59:43, Matthew Wilcox wrote:
> On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote:
> > On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Except that on truncate, we have to unmap these
> > > anonymous pages in private file mappings as well...
> > 
> > I'm actually not 100% sure we strictly would need to care.
> > 
> > Once we've faulted in a private file mapping page, that page is
> > "ours". That's kind of what MAP_PRIVATE means.
> > 
> > If we haven't written to it, we do keep things coherent with the file,
> > but that's actually not required by POSIX afaik - it's a QoI issue,
> > and a lot of (bad) Unixes didn't do it at all.
> > So as long as truncate _clears_ the pages it truncates, I think we'd
> > actually be ok.
> 
> We don't even need to do that ...
> 
> "If the size of the mapped file changes after the call to mmap()
> as a result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added or
> removed portions of the file is unspecified."
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html
> 
> As you say, there's a QoI here, and POSIX permits some shockingly
> bad and useless implementations.

Something from ftruncate(2) POSIX definition [1] for comparison:

If the effect of ftruncate() is to decrease the size of a memory mapped
file or a shared memory object and whole pages beyond the new end were
previously mapped, then the whole pages beyond the new end shall be
discarded.

References to discarded pages shall result in the generation of a SIGBUS
signal.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

Now pick... ;)

								Honza

xfs: use MMAPLOCK around filemap_map_pages()

Commit Message

Comments

Patch