Message ID | 20200623052059.1893966-1-david@fromorbit.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | xfs: use MMAPLOCK around filemap_map_pages() | expand |
On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > From: Dave Chinner <dchinner@redhat.com> > > The page faultround path ->map_pages is implemented in XFS via > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> I wonder... should xfs_file_fadvise(POSIX_FADV_WILLNEED) also be taking XFS_MMAPLOCK_SHARED instead of XFS_IOLOCK_SHARED? Not that it matters that much? Thanks, Amir.
On Tue, Jun 23, 2020 at 11:54:39AM +0300, Amir Goldstein wrote: > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > The page faultround path ->map_pages is implemented in XFS via > > filemap_map_pages(). This function checks that pages found in page > > cache lookups have not raced with truncate based invalidation by > > checking page->mapping is correct and page->index is within EOF. > > > > However, we've known for a long time that this is not sufficient to > > protect against races with invalidations done by operations that do > > not change EOF. e.g. hole punching and other fallocate() based > > direct extent manipulations. The way we protect against these > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > > lock so they serialise against fallocate and truncate before calling > > into the filemap function that processes the fault. > > > > Do the same for XFS's ->map_pages implementation to close this > > potential data corruption issue. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > I wonder... should xfs_file_fadvise(POSIX_FADV_WILLNEED) also be taking > XFS_MMAPLOCK_SHARED instead of XFS_IOLOCK_SHARED? No. The MMAPLOCK is only to be used in the page fault path because we can't use the IOLOCK in that path or we will deadlock. i.e. MMAPLOCK is exclusively for IO path locking in page fault contexts, IOLOCK is exclusively for IO path locking in syscall and kernel task contexts. > Not that it matters that much? Using the right lock for the IO context actually matters a lot :) Cheers, Dave.
On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The page faultround path ->map_pages is implemented in XFS via > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Looks reasonable: Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b05f8fd7b3d..4b185a907432 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > +static void > +xfs_filemap_map_pages( > + struct vm_fault *vmf, > + pgoff_t start_pgoff, > + pgoff_t end_pgoff) > +{ > + struct inode *inode = file_inode(vmf->vma->vm_file); > + > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > +} > + > static const struct vm_operations_struct xfs_file_vm_ops = { > .fault = xfs_filemap_fault, > .huge_fault = xfs_filemap_huge_fault, > - .map_pages = filemap_map_pages, > + .map_pages = xfs_filemap_map_pages, > .page_mkwrite = xfs_filemap_page_mkwrite, > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > }; > -- > 2.26.2.761.g0e0b3e54be >
On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The page faultround path ->map_pages is implemented in XFS via What does "faultround" mean? I'm pretty convinced that this is merely another round of whackamole wrt taking the MMAPLOCK before relying on or doing anything to pages in the page cache, I just can't tell if 'faultround' is jargon or typo. --D > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b05f8fd7b3d..4b185a907432 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > +static void > +xfs_filemap_map_pages( > + struct vm_fault *vmf, > + pgoff_t start_pgoff, > + pgoff_t end_pgoff) > +{ > + struct inode *inode = file_inode(vmf->vma->vm_file); > + > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > +} > + > static const struct vm_operations_struct xfs_file_vm_ops = { > .fault = xfs_filemap_fault, > .huge_fault = xfs_filemap_huge_fault, > - .map_pages = filemap_map_pages, > + .map_pages = xfs_filemap_map_pages, > .page_mkwrite = xfs_filemap_page_mkwrite, > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > }; > -- > 2.26.2.761.g0e0b3e54be >
On Tue, Jun 23, 2020 at 02:19:10PM -0700, Darrick J. Wong wrote: > On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The page faultround path ->map_pages is implemented in XFS via > > What does "faultround" mean? Typo - fault-around. i.e. when we take a read page fault, the do_read_fault() code first opportunistically tries to map a range of pages surrounding surrounding the faulted page into the PTEs, not just the faulted page. It uses ->map_pages() to do the page cache lookups for cached pages in the range of the fault around and then inserts them into the PTES is if finds any. If the fault-around pass did not find the page fault page in cache (i.e. vmf->page remains null) then it calls into do_fault(), which ends up calling ->fault, which we then lock the MMAPLOCK and call into filemap_fault() to populate the page cache and read the data into it. So, essentially, fault-around is a mechanism to reduce page faults in the situation where previous readahead has brought adjacent pages into the page cache by optimistically mapping up to fault_around_bytes into PTEs on any given read page fault. > I'm pretty convinced that this is merely another round of whackamole wrt > taking the MMAPLOCK before relying on or doing anything to pages in the > page cache, I just can't tell if 'faultround' is jargon or typo. Well, it's whack-a-mole in that this is the first time I've actually looked at what map_pages does. I knew there was fault-around in the page fault path, but I know that it had it's own method for page cache lookups and pte mapping, nor that it had it's own truncate race checks to ensure it didn't map pages invalidated by truncate into the PTEs. There's so much technical debt hidden in the kernel code base. The fact we're still finding places that assume only truncate can invalidate the page cache over a file range indicates just how deep this debt runs... -Dave.
On Wed, Jun 24, 2020 at 08:14:31AM +1000, Dave Chinner wrote: > On Tue, Jun 23, 2020 at 02:19:10PM -0700, Darrick J. Wong wrote: > > On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The page faultround path ->map_pages is implemented in XFS via > > > > What does "faultround" mean? > > Typo - fault-around. > > i.e. when we take a read page fault, the do_read_fault() code first > opportunistically tries to map a range of pages surrounding > surrounding the faulted page into the PTEs, not just the faulted > page. It uses ->map_pages() to do the page cache lookups for > cached pages in the range of the fault around and then inserts them > into the PTES is if finds any. > > If the fault-around pass did not find the page fault page in cache > (i.e. vmf->page remains null) then it calls into do_fault(), which > ends up calling ->fault, which we then lock the MMAPLOCK and call > into filemap_fault() to populate the page cache and read the data > into it. > > So, essentially, fault-around is a mechanism to reduce page faults > in the situation where previous readahead has brought adjacent pages > into the page cache by optimistically mapping up to > fault_around_bytes into PTEs on any given read page fault. > > > I'm pretty convinced that this is merely another round of whackamole wrt > > taking the MMAPLOCK before relying on or doing anything to pages in the > > page cache, I just can't tell if 'faultround' is jargon or typo. > > Well, it's whack-a-mole in that this is the first time I've actually > looked at what map_pages does. I knew there was fault-around in the > page fault path, but I know that it had it's own method for > page cache lookups and pte mapping, nor that it had it's own > truncate race checks to ensure it didn't map pages invalidated by > truncate into the PTEs. <nod> Thanks for the explanation. /me wonders if someone could please check all the *_ops that point to generic helpers to see if we're missing obvious things like lock taking. Particularly someone who wants to learn about xfs' locking strategy; I promise it won't let out a ton of bees. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > There's so much technical debt hidden in the kernel code base. The > fact we're still finding places that assume only truncate can > invalidate the page cache over a file range indicates just how deep > this debt runs... > > -Dave. > -- > Dave Chinner > david@fromorbit.com
> /me wonders if someone could please check all the *_ops that point to > generic helpers to see if we're missing obvious things like lock > taking. Particularly someone who wants to learn about xfs' locking > strategy; I promise it won't let out a ton of bees. > The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs' structs for non xfs_/iomap_/noop_ functions. I am not sure if all iomap_ functions are safe in that respect, but I suppose those were done recently with sufficient xfs developers review... fs/xfs/xfs_aops.c:const struct address_space_operations xfs_address_space_operations = { .error_remove_page = generic_error_remove_page, generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK Is that safe? not sure fs/xfs/xfs_file.c:static const struct vm_operations_struct xfs_file_vm_ops = { .map_pages = filemap_map_pages, Fixed by $SUBJECT fs/xfs/xfs_file.c:const struct file_operations xfs_file_operations = { .splice_read = generic_file_splice_read, Will call xfs_file_read_iter, so looks fine .splice_write = iter_file_splice_write, Will call xfs_file_write_iter, so looks fine .get_unmapped_area = thp_get_unmapped_area, Looks fine? fs/xfs/xfs_file.c:const struct file_operations xfs_dir_file_operations = { .read = generic_read_dir, .llseek = generic_file_llseek, No page cache, no dio, no worries? Thanks, Amir.
On Tue, Jun 30, 2020 at 06:23:12PM +0300, Amir Goldstein wrote: > > /me wonders if someone could please check all the *_ops that point to > > generic helpers to see if we're missing obvious things like lock > > taking. Particularly someone who wants to learn about xfs' locking > > strategy; I promise it won't let out a ton of bees. > > > > The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs' > structs for non xfs_/iomap_/noop_ functions. > I am not sure if all iomap_ functions are safe in that respect, but I suppose > those were done recently with sufficient xfs developers review... The iomap functions shouldn't be taking/releasing any locks at all; it's up to the filesystem to provide the concurrency controls. > fs/xfs/xfs_aops.c:const struct address_space_operations > xfs_address_space_operations = { > .error_remove_page = generic_error_remove_page, > > generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK > Is that safe? not sure /me has a funny feeling it isn't, since this does the same thing to the pagecache as a holepunch. > fs/xfs/xfs_file.c:static const struct vm_operations_struct xfs_file_vm_ops = { > .map_pages = filemap_map_pages, > > Fixed by $SUBJECT > > fs/xfs/xfs_file.c:const struct file_operations xfs_file_operations = { > .splice_read = generic_file_splice_read, > > Will call xfs_file_read_iter, so looks fine > > .splice_write = iter_file_splice_write, > > Will call xfs_file_write_iter, so looks fine > > .get_unmapped_area = thp_get_unmapped_area, > > Looks fine? > > fs/xfs/xfs_file.c:const struct file_operations xfs_dir_file_operations = { > .read = generic_read_dir, > .llseek = generic_file_llseek, > > No page cache, no dio, no worries? Right. --D > Thanks, > Amir.
On Tue, Jun 23, 2020 at 03:20:59PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The page faultround path ->map_pages is implemented in XFS via > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Looks ok, notwithstanding the (semirelated) questions that Amir unearthed elsewhere in this thread... Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/xfs/xfs_file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b05f8fd7b3d..4b185a907432 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > +static void > +xfs_filemap_map_pages( > + struct vm_fault *vmf, > + pgoff_t start_pgoff, > + pgoff_t end_pgoff) > +{ > + struct inode *inode = file_inode(vmf->vma->vm_file); > + > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > +} > + > static const struct vm_operations_struct xfs_file_vm_ops = { > .fault = xfs_filemap_fault, > .huge_fault = xfs_filemap_huge_fault, > - .map_pages = filemap_map_pages, > + .map_pages = xfs_filemap_map_pages, > .page_mkwrite = xfs_filemap_page_mkwrite, > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > }; > -- > 2.26.2.761.g0e0b3e54be >
On Tue, Jun 30, 2020 at 11:26:45AM -0700, Darrick J. Wong wrote: > On Tue, Jun 30, 2020 at 06:23:12PM +0300, Amir Goldstein wrote: > > > /me wonders if someone could please check all the *_ops that point to > > > generic helpers to see if we're missing obvious things like lock > > > taking. Particularly someone who wants to learn about xfs' locking > > > strategy; I promise it won't let out a ton of bees. > > > > > > > The list was compiled manually by auditing 'git grep '_operations.*=' fs/xfs' > > structs for non xfs_/iomap_/noop_ functions. > > I am not sure if all iomap_ functions are safe in that respect, but I suppose > > those were done recently with sufficient xfs developers review... > > The iomap functions shouldn't be taking/releasing any locks at all; it's > up to the filesystem to provide the concurrency controls. > > > fs/xfs/xfs_aops.c:const struct address_space_operations > > xfs_address_space_operations = { > > .error_remove_page = generic_error_remove_page, > > > > generic_error_remove_page() calls truncate_inode_page() without MMAPLOCK > > Is that safe? not sure > > /me has a funny feeling it isn't, since this does the same thing to the > pagecache as a holepunch. And I really can't tell, because this comes from the convoluted hardware memory error path. We know that path is completely screwed up w.r.t. memory errors in pmem and DAX filesystems. Hence I think there's a good chance it's completely screwed up for cached file-backed page cache pages, too. Indeed, look at the comment in me_pagecache_clean(): /* * Truncation is a bit tricky. Enable it per file system for now. * * Open: to take i_mutex or not for this? Right now we don't. */ return truncate_error_page(p, pfn, mapping); the call path is: me_pagecache_clean() truncate_error_page() ->error_remove_page() IOWs, the authors of this code did not know what to do, and like the DAX failure stuff, merged the code without having resolving the fundamental issues around interfacing with filesystem owned pages directly... I don't really have the time to look at it in any more depth right now. I also suspect the memory failure code is a path we simply cannot exercise in any useful manner so it's unlikely that we'll ever be able to tell if this stuff works correctly or not.... Cheers, -Dave.
On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > From: Dave Chinner <dchinner@redhat.com> > > The page faultround path ->map_pages is implemented in XFS via > filemap_map_pages(). This function checks that pages found in page > cache lookups have not raced with truncate based invalidation by > checking page->mapping is correct and page->index is within EOF. > > However, we've known for a long time that this is not sufficient to > protect against races with invalidations done by operations that do > not change EOF. e.g. hole punching and other fallocate() based > direct extent manipulations. The way we protect against these > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > lock so they serialise against fallocate and truncate before calling > into the filemap function that processes the fault. > > Do the same for XFS's ->map_pages implementation to close this > potential data corruption issue. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b05f8fd7b3d..4b185a907432 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > +static void > +xfs_filemap_map_pages( > + struct vm_fault *vmf, > + pgoff_t start_pgoff, > + pgoff_t end_pgoff) > +{ > + struct inode *inode = file_inode(vmf->vma->vm_file); > + > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > +} > + > static const struct vm_operations_struct xfs_file_vm_ops = { > .fault = xfs_filemap_fault, > .huge_fault = xfs_filemap_huge_fault, > - .map_pages = filemap_map_pages, > + .map_pages = xfs_filemap_map_pages, > .page_mkwrite = xfs_filemap_page_mkwrite, > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > }; > -- > 2.26.2.761.g0e0b3e54be > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix zonefs does not support hole punching, so it may not need to use mmap_sem at all. It is interesting to look at how this bug came to be duplicated in so many filesystems, because there are lessons to be learned. Commit f1820361f83d ("mm: implement ->map_pages for page cache") added to ->map_pages() operation and its commit message said: "...It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault()." At the time, all of the aforementioned filesystems used filemap_fault() for ->fault(). But since then, ext4, xfs, f2fs and just recently gfs2 have added a filesystem ->fault() operation. orangefs has added vm_operations since and zonefs was added since, probably copying the mmap_sem handling from ext4. Both have a filesystem ->fault() operation. It was surprising for me to see that some of the filesystem developers signed on the added ->fault() operations are not strangers to mm. The recent gfs2 change was even reviewed by an established mm developer [1]. So what can we learn from this case study? How could we fix the interface to avoid repeating the same mistake in the future? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20200703113801.GD25523@casper.infradead.org/
On Sat 12-09-20 09:19:11, Amir Goldstein wrote: > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > The page faultround path ->map_pages is implemented in XFS via > > filemap_map_pages(). This function checks that pages found in page > > cache lookups have not raced with truncate based invalidation by > > checking page->mapping is correct and page->index is within EOF. > > > > However, we've known for a long time that this is not sufficient to > > protect against races with invalidations done by operations that do > > not change EOF. e.g. hole punching and other fallocate() based > > direct extent manipulations. The way we protect against these > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > > lock so they serialise against fallocate and truncate before calling > > into the filemap function that processes the fault. > > > > Do the same for XFS's ->map_pages implementation to close this > > potential data corruption issue. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_file.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 7b05f8fd7b3d..4b185a907432 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > > } > > > > +static void > > +xfs_filemap_map_pages( > > + struct vm_fault *vmf, > > + pgoff_t start_pgoff, > > + pgoff_t end_pgoff) > > +{ > > + struct inode *inode = file_inode(vmf->vma->vm_file); > > + > > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > +} > > + > > static const struct vm_operations_struct xfs_file_vm_ops = { > > .fault = xfs_filemap_fault, > > .huge_fault = xfs_filemap_huge_fault, > > - .map_pages = filemap_map_pages, > > + .map_pages = xfs_filemap_map_pages, > > .page_mkwrite = xfs_filemap_page_mkwrite, > > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > > }; > > -- > > 2.26.2.761.g0e0b3e54be > > > > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix > > zonefs does not support hole punching, so it may not need to use > mmap_sem at all. > > It is interesting to look at how this bug came to be duplicated in so > many filesystems, because there are lessons to be learned. > > Commit f1820361f83d ("mm: implement ->map_pages for page cache") > added to ->map_pages() operation and its commit message said: > > "...It should be safe to use filemap_map_pages() for ->map_pages() if > filesystem use filemap_fault() for ->fault()." > > At the time, all of the aforementioned filesystems used filemap_fault() > for ->fault(). > > But since then, ext4, xfs, f2fs and just recently gfs2 have added a > filesystem ->fault() operation. > > orangefs has added vm_operations since and zonefs was added since, > probably copying the mmap_sem handling from ext4. Both have a filesystem > ->fault() operation. A standard pattern of copying bug from one place into many. Sadly it's happening all the time for stuff that's complex enough that only a few people (if anybody) are carrying all the details in their head. > It was surprising for me to see that some of the filesystem developers > signed on the added ->fault() operations are not strangers to mm. The > recent gfs2 change was even reviewed by an established mm developer > [1]. Well, people do miss things... And this stuff is twisted maze so it is easy to miss something even for an experienced developer. > So what can we learn from this case study? How could we fix the interface to > avoid repeating the same mistake in the future? IMO the serialization between page cache and various fs operations is just too complex with too many special corner cases. But that's difficult to change while keeping all the features and performance. So the best realistic answer I have (and this is not meant to discourage anybody from trying to implement a simpler scheme of page-cache - filesystem interaction :) is that we should have added a fstest when XFS fix landed which would then hopefully catch attention of other fs maintainers (at least those that do run fstest). Honza
Could the xfs mmap lock documentation please be cleaned up? For example, the xfs_ilock description says: > * In addition to i_rwsem in the VFS inode, the xfs inode contains 2 > * multi-reader locks: i_mmap_lock and the i_lock. This routine allows > * various combinations of the locks to be obtained. The field in struct xfs_inode is called i_mmaplock though, not i_mmap_lock. In addition, struct inode has an i_mmap_rwsem field which is also referred to as i_mmap_lock. If that isn't irritating enough. Thanks, Andreas
On Sat 12-09-20 09:19:11, Amir Goldstein wrote: > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > The page faultround path ->map_pages is implemented in XFS via > > filemap_map_pages(). This function checks that pages found in page > > cache lookups have not raced with truncate based invalidation by > > checking page->mapping is correct and page->index is within EOF. > > > > However, we've known for a long time that this is not sufficient to > > protect against races with invalidations done by operations that do > > not change EOF. e.g. hole punching and other fallocate() based > > direct extent manipulations. The way we protect against these > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > > lock so they serialise against fallocate and truncate before calling > > into the filemap function that processes the fault. > > > > Do the same for XFS's ->map_pages implementation to close this > > potential data corruption issue. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_file.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 7b05f8fd7b3d..4b185a907432 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > > } > > > > +static void > > +xfs_filemap_map_pages( > > + struct vm_fault *vmf, > > + pgoff_t start_pgoff, > > + pgoff_t end_pgoff) > > +{ > > + struct inode *inode = file_inode(vmf->vma->vm_file); > > + > > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > +} > > + > > static const struct vm_operations_struct xfs_file_vm_ops = { > > .fault = xfs_filemap_fault, > > .huge_fault = xfs_filemap_huge_fault, > > - .map_pages = filemap_map_pages, > > + .map_pages = xfs_filemap_map_pages, > > .page_mkwrite = xfs_filemap_page_mkwrite, > > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > > }; > > -- > > 2.26.2.761.g0e0b3e54be > > > > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix > > zonefs does not support hole punching, so it may not need to use > mmap_sem at all. So I've written an ext4 fix for this but before actually posting it Nikolay working on btrfs fix asked why exactly is filemap_map_pages() actually a problem and I think he's right it actually isn't a problem. The thing is: filemap_map_pages() never does any page mapping or IO. It only creates PTEs for uptodate pages that are already in page cache. As such it is a rather different beast compared to the fault handler from fs POV and does not need protection from hole punching (current serialization on page lock and checking of page->mapping is enough). That being said I agree this is subtle and the moment someone adds e.g. a readahead call into filemap_map_pages() we have a real problem. I'm not sure how to prevent this risk... Honza
On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote: > On Sat 12-09-20 09:19:11, Amir Goldstein wrote: > > On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The page faultround path ->map_pages is implemented in XFS via > > > filemap_map_pages(). This function checks that pages found in page > > > cache lookups have not raced with truncate based invalidation by > > > checking page->mapping is correct and page->index is within EOF. > > > > > > However, we've known for a long time that this is not sufficient to > > > protect against races with invalidations done by operations that do > > > not change EOF. e.g. hole punching and other fallocate() based > > > direct extent manipulations. The way we protect against these > > > races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED > > > lock so they serialise against fallocate and truncate before calling > > > into the filemap function that processes the fault. > > > > > > Do the same for XFS's ->map_pages implementation to close this > > > potential data corruption issue. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_file.c | 15 ++++++++++++++- > > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > > index 7b05f8fd7b3d..4b185a907432 100644 > > > --- a/fs/xfs/xfs_file.c > > > +++ b/fs/xfs/xfs_file.c > > > @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( > > > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > > > } > > > > > > +static void > > > +xfs_filemap_map_pages( > > > + struct vm_fault *vmf, > > > + pgoff_t start_pgoff, > > > + pgoff_t end_pgoff) > > > +{ > > > + struct inode *inode = file_inode(vmf->vma->vm_file); > > > + > > > + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > > + filemap_map_pages(vmf, start_pgoff, end_pgoff); > > > + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > > > +} > > > + > > > static const struct vm_operations_struct xfs_file_vm_ops = { > > > .fault = xfs_filemap_fault, > > > .huge_fault = xfs_filemap_huge_fault, > > > - .map_pages = filemap_map_pages, > > > + .map_pages = xfs_filemap_map_pages, > > > .page_mkwrite = xfs_filemap_page_mkwrite, > > > .pfn_mkwrite = xfs_filemap_pfn_mkwrite, > > > }; > > > -- > > > 2.26.2.761.g0e0b3e54be > > > > > > > It appears that ext4, f2fs, gfs2, orangefs, zonefs also need this fix > > > > zonefs does not support hole punching, so it may not need to use > > mmap_sem at all. > > So I've written an ext4 fix for this but before actually posting it Nikolay > working on btrfs fix asked why exactly is filemap_map_pages() actually a > problem and I think he's right it actually isn't a problem. The thing is: > filemap_map_pages() never does any page mapping or IO. It only creates PTEs > for uptodate pages that are already in page cache. So.... P0 p1 hole punch starts takes XFS_MMAPLOCK_EXCL truncate_pagecache_range() unmap_mapping_range(start, end) <clears ptes> <read fault> do_fault_around() ->map_pages filemap_map_pages() page mapping valid, page is up to date maps PTEs <fault done> truncate_inode_pages_range() truncate_cleanup_page(page) invalidates page delete_from_page_cache_batch(page) frees page <pte now points to a freed page> That doesn't seem good to me. Sure, maybe the page hasn't been freed back to the free lists because of elevated refcounts. But it's been released by the filesystem and not longer in the page cache so nothing good can come of this situation... AFAICT, this race condition exists for the truncate case as well as filemap_map_pages() doesn't have a "page beyond inode size" check in it. However, exposure here is very limited in the truncate case because truncate_setsize()->truncate_pagecache() zaps the PTEs again after invalidating the page cache. Either way, adding the XFS_MMAPLOCK_SHARED around filemap_map_pages() avoids the race condition for fallocate and truncate operations for XFS... > As such it is a rather > different beast compared to the fault handler from fs POV and does not need > protection from hole punching (current serialization on page lock and > checking of page->mapping is enough). > That being said I agree this is subtle and the moment someone adds e.g. a > readahead call into filemap_map_pages() we have a real problem. I'm not > sure how to prevent this risk... Subtle, yes. So subtle, in fact, I fail to see any reason why the above race cannot occur as there's no obvious serialisation in the page cache between PTE zapping and page invalidation to prevent a fault from coming in an re-establishing the PTEs before invalidation occurs. Cheers, Dave.
On Thu, 17 Sep 2020, Dave Chinner wrote: > > So.... > > P0 p1 > > hole punch starts > takes XFS_MMAPLOCK_EXCL > truncate_pagecache_range() > unmap_mapping_range(start, end) > <clears ptes> > <read fault> > do_fault_around() > ->map_pages > filemap_map_pages() > page mapping valid, > page is up to date > maps PTEs > <fault done> > truncate_inode_pages_range() > truncate_cleanup_page(page) > invalidates page > delete_from_page_cache_batch(page) > frees page > <pte now points to a freed page> No. filemap_map_pages() checks page->mapping after trylock_page(), before setting up the pte; and truncate_cleanup_page() does a one-page unmap_mapping_range() if page_mapped(), while holding page lock. (Of course, there's a different thread, in which less reliance on page lock is being discussed, but that would be a future thing.) Hugh
On Thu, Sep 17, 2020 at 11:44:54AM +1000, Dave Chinner wrote: > So.... > > P0 p1 > > hole punch starts > takes XFS_MMAPLOCK_EXCL > truncate_pagecache_range() ... locks page ... > unmap_mapping_range(start, end) > <clears ptes> > <read fault> > do_fault_around() > ->map_pages > filemap_map_pages() ... trylock page fails ... > page mapping valid, > page is up to date > maps PTEs > <fault done> > truncate_inode_pages_range() > truncate_cleanup_page(page) > invalidates page > delete_from_page_cache_batch(page) > frees page > <pte now points to a freed page> > > That doesn't seem good to me. > > Sure, maybe the page hasn't been freed back to the free lists > because of elevated refcounts. But it's been released by the > filesystem and not longer in the page cache so nothing good can come > of this situation... > > AFAICT, this race condition exists for the truncate case as well > as filemap_map_pages() doesn't have a "page beyond inode size" check > in it. However, exposure here is very limited in the truncate case > because truncate_setsize()->truncate_pagecache() zaps the PTEs > again after invalidating the page cache. > > Either way, adding the XFS_MMAPLOCK_SHARED around > filemap_map_pages() avoids the race condition for fallocate and > truncate operations for XFS... > > > As such it is a rather > > different beast compared to the fault handler from fs POV and does not need > > protection from hole punching (current serialization on page lock and > > checking of page->mapping is enough). > > That being said I agree this is subtle and the moment someone adds e.g. a > > readahead call into filemap_map_pages() we have a real problem. I'm not > > sure how to prevent this risk... > > Subtle, yes. So subtle, in fact, I fail to see any reason why the > above race cannot occur as there's no obvious serialisation in the > page cache between PTE zapping and page invalidation to prevent a > fault from coming in an re-establishing the PTEs before invalidation > occurs. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On 17.09.20 г. 4:44 ч., Dave Chinner wrote: > On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote: >> On Sat 12-09-20 09:19:11, Amir Goldstein wrote: >>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: <snip> > > So.... > > P0 p1 > > hole punch starts > takes XFS_MMAPLOCK_EXCL > truncate_pagecache_range() > unmap_mapping_range(start, end) > <clears ptes> > <read fault> > do_fault_around() > ->map_pages > filemap_map_pages() > page mapping valid, > page is up to date > maps PTEs > <fault done> > truncate_inode_pages_range() > truncate_cleanup_page(page) > invalidates page > delete_from_page_cache_batch(page) > frees page > <pte now points to a freed page> > > That doesn't seem good to me. > > Sure, maybe the page hasn't been freed back to the free lists > because of elevated refcounts. But it's been released by the > filesystem and not longer in the page cache so nothing good can come > of this situation... > > AFAICT, this race condition exists for the truncate case as well > as filemap_map_pages() doesn't have a "page beyond inode size" check > in it. (It's not relevant to the discussion at hand but for the sake of completeness): It does have a check: max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); if (page->index >= max_idx) goto unlock; <snip>
On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote: > On Thu, 17 Sep 2020, Dave Chinner wrote: > > > > So.... > > > > P0 p1 > > > > hole punch starts > > takes XFS_MMAPLOCK_EXCL > > truncate_pagecache_range() > > unmap_mapping_range(start, end) > > <clears ptes> > > <read fault> > > do_fault_around() > > ->map_pages > > filemap_map_pages() > > page mapping valid, > > page is up to date > > maps PTEs > > <fault done> > > truncate_inode_pages_range() > > truncate_cleanup_page(page) > > invalidates page > > delete_from_page_cache_batch(page) > > frees page > > <pte now points to a freed page> > > No. filemap_map_pages() checks page->mapping after trylock_page(), > before setting up the pte; and truncate_cleanup_page() does a one-page > unmap_mapping_range() if page_mapped(), while holding page lock. Ok, fair, I missed that. So why does truncate_pagecache() talk about fault races and require a second unmap range after the invalidation "for correctness" if this sort of race cannot happen? Why is that different to truncate_pagecache_range() which -doesn't-i do that second removal? It's called for more than just hole_punch - from the filesystem's persepective holepunch should do exactly the same as truncate to the page cache, and for things like COLLAPSE_RANGE it is absolutely essential because the data in that range is -not zero- and will be stale if the mappings are not invalidated completely.... Also, if page->mapping == NULL is sufficient to detect an invalidated page in all cases, then why does page_cache_delete() explicitly leave page->index intact: page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ Cheers, Dave.
On Thu 17-09-20 08:37:17, Nikolay Borisov wrote: > On 17.09.20 г. 4:44 ч., Dave Chinner wrote: > > On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote: > >> On Sat 12-09-20 09:19:11, Amir Goldstein wrote: > >>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner <david@fromorbit.com> wrote: > > <snip> > > > > > So.... > > > > P0 p1 > > > > hole punch starts > > takes XFS_MMAPLOCK_EXCL > > truncate_pagecache_range() > > unmap_mapping_range(start, end) > > <clears ptes> > > <read fault> > > do_fault_around() > > ->map_pages > > filemap_map_pages() > > page mapping valid, > > page is up to date > > maps PTEs > > <fault done> > > truncate_inode_pages_range() > > truncate_cleanup_page(page) > > invalidates page > > delete_from_page_cache_batch(page) > > frees page > > <pte now points to a freed page> > > > > That doesn't seem good to me. > > > > Sure, maybe the page hasn't been freed back to the free lists > > because of elevated refcounts. But it's been released by the > > filesystem and not longer in the page cache so nothing good can come > > of this situation... > > > > AFAICT, this race condition exists for the truncate case as well > > as filemap_map_pages() doesn't have a "page beyond inode size" check > > in it. > > (It's not relevant to the discussion at hand but for the sake of > completeness): > > It does have a check: > > max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); > if (page->index >= max_idx) > goto unlock; Yes, but this does something meaningful only for truncate. For other operations such as hole punch this check doesn't bring anything. That's why only filesystems supporting hole punching and similar advanced operations need an equivalent of mmap_lock. Honza
On Thu, 17 Sep 2020, Dave Chinner wrote: > On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote: > > On Thu, 17 Sep 2020, Dave Chinner wrote: > > > <pte now points to a freed page> > > > > No. filemap_map_pages() checks page->mapping after trylock_page(), > > before setting up the pte; and truncate_cleanup_page() does a one-page > > unmap_mapping_range() if page_mapped(), while holding page lock. > > Ok, fair, I missed that. > > So why does truncate_pagecache() talk about fault races and require > a second unmap range after the invalidation "for correctness" if > this sort of race cannot happen? I thought the comment * unmap_mapping_range is called twice, first simply for * efficiency so that truncate_inode_pages does fewer * single-page unmaps. However after this first call, and * before truncate_inode_pages finishes, it is possible for * private pages to be COWed, which remain after * truncate_inode_pages finishes, hence the second * unmap_mapping_range call must be made for correctness. explains it fairly well. It's because POSIX demanded that when a file is truncated, the user will get SIGBUS on trying to access even the COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the cache page does not serialize the pages COWed from it very well. But there's no such SIGBUS requirement in the case of hole-punching, and trying to unmap those pages racily instantiated just after the punching cursor passed, would probably do more harm than good. > > Why is that different to truncate_pagecache_range() which -doesn't-i > do that second removal? It's called for more than just hole_punch - > from the filesystem's persepective holepunch should do exactly the > same as truncate to the page cache, and for things like > COLLAPSE_RANGE it is absolutely essential because the data in that > range is -not zero- and will be stale if the mappings are not > invalidated completely.... I can't speak to COLLAPSE_RANGE. > > Also, if page->mapping == NULL is sufficient to detect an invalidated > page in all cases, then why does page_cache_delete() explicitly > leave page->index intact: > > page->mapping = NULL; > /* Leave page->index set: truncation lookup relies upon it */ Because there was, and I think still is (but might it now be xarrayed away?), code (mainly in mm/truncate.c) which finds it convenient to check page->index for end of range, without necessitating the overhead of getting page lock. I've no doubt it's an (minor) optimization that could be discarded if there were ever a need to invalidate page->index when deleting; but nobody has required that yet. Hugh
On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote: > On Thu, 17 Sep 2020, Dave Chinner wrote: > > On Wed, Sep 16, 2020 at 07:04:46PM -0700, Hugh Dickins wrote: > > > On Thu, 17 Sep 2020, Dave Chinner wrote: > > > > <pte now points to a freed page> > > > > > > No. filemap_map_pages() checks page->mapping after trylock_page(), > > > before setting up the pte; and truncate_cleanup_page() does a one-page > > > unmap_mapping_range() if page_mapped(), while holding page lock. > > > > Ok, fair, I missed that. > > > > So why does truncate_pagecache() talk about fault races and require > > a second unmap range after the invalidation "for correctness" if > > this sort of race cannot happen? > > I thought the comment > * unmap_mapping_range is called twice, first simply for > * efficiency so that truncate_inode_pages does fewer > * single-page unmaps. However after this first call, and > * before truncate_inode_pages finishes, it is possible for > * private pages to be COWed, which remain after > * truncate_inode_pages finishes, hence the second > * unmap_mapping_range call must be made for correctness. > explains it fairly well. Not to me. It explains what the code is doing, and the why is simply "correctness". I have no idea what "correctness" actually means in this context because there is no reference to what correct behaviour should be. Nor do I have any idea why COW faults might behave differently to a normal read/write page fault... > It's because POSIX demanded that when a file > is truncated, the user will get SIGBUS on trying to access even the > COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the > cache page does not serialize the pages COWed from it very well. And there's the "why". I don't find the "page lock doesn't serialise COW faults very well" particularly reassuring in this case.... > But there's no such SIGBUS requirement in the case of hole-punching, > and trying to unmap those pages racily instantiated just after the > punching cursor passed, would probably do more harm than good. There isn't a SIGBUS requirement for fallocate operations, just a "don't expose stale data to userspace" requirement. FWIW, how does a COW fault even work with file backed pages? We can only have a single page attached to the inode address space for a given offset, so if there's been a COW fault and a new page faulted in for the write fault in that VMA, doesn't that imply the user data then written to that page is never going to be written back to storage because the COW page is not tracked by the inode address space? > > Why is that different to truncate_pagecache_range() which -doesn't-i > > do that second removal? It's called for more than just hole_punch - > > from the filesystem's persepective holepunch should do exactly the > > same as truncate to the page cache, and for things like > > COLLAPSE_RANGE it is absolutely essential because the data in that > > range is -not zero- and will be stale if the mappings are not > > invalidated completely.... > > I can't speak to COLLAPSE_RANGE. It moves data around, doesn't replace data with zeros. Hence the contents of any page that isn't invalidated entirely by truncate_pagecache_range() is now entirely incorrect... > > Also, if page->mapping == NULL is sufficient to detect an invalidated > > page in all cases, then why does page_cache_delete() explicitly > > leave page->index intact: > > > > page->mapping = NULL; > > /* Leave page->index set: truncation lookup relies upon it */ > > Because there was, and I think still is (but might it now be xarrayed > away?), code (mainly in mm/truncate.c) which finds it convenient to > check page->index for end of range, without necessitating the overhead > of getting page lock. I've no doubt it's an (minor) optimization that > could be discarded if there were ever a need to invalidate page->index > when deleting; but nobody has required that yet. And that's exactly my concern w.r.t. fallocate based invalidation: checking the page is beyond EOF without locking the page or checking the mapping does not detect pages invalidated by hole punching and other fallocate() operations because page->index on the invalidated pages is never beyond EOF.... Cheers, Dave.
On Mon 21-09-20 18:26:00, Dave Chinner wrote: > On Thu, Sep 17, 2020 at 12:47:10AM -0700, Hugh Dickins wrote: > > It's because POSIX demanded that when a file > > is truncated, the user will get SIGBUS on trying to access even the > > COWed pages beyond EOF in a MAP_PRIVATE mapping. Page lock on the > > cache page does not serialize the pages COWed from it very well. > > And there's the "why". I don't find the "page lock doesn't > serialise COW faults very well" particularly reassuring in this > case.... > > > But there's no such SIGBUS requirement in the case of hole-punching, > > and trying to unmap those pages racily instantiated just after the > > punching cursor passed, would probably do more harm than good. > > There isn't a SIGBUS requirement for fallocate operations, just a > "don't expose stale data to userspace" requirement. > > FWIW, how does a COW fault even work with file backed pages? We can > only have a single page attached to the inode address space for a given > offset, so if there's been a COW fault and a new page faulted in for > the write fault in that VMA, doesn't that imply the user data then > written to that page is never going to be written back to storage > because the COW page is not tracked by the inode address space? Correct. Private file mappings work so that on first write fault on some page offset we allocate anonymous page for that offset, copy to it current contents of the corresponding file page, and from that moment on it behaves as an anonymous page. Except that on truncate, we have to unmap these anonymous pages in private file mappings as well... Honza
On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote: > > Except that on truncate, we have to unmap these > anonymous pages in private file mappings as well... I'm actually not 100% sure we strictly would need to care. Once we've faulted in a private file mapping page, that page is "ours". That's kind of what MAP_PRIVATE means. If we haven't written to it, we do keep things coherent with the file, but that's actually not required by POSIX afaik - it's a QoI issue, and a lot of (bad) Unixes didn't do it at all. So as long as truncate _clears_ the pages it truncates, I think we'd actually be ok. The SIGBUS is supposed to happen, but that's really only relevant for the _first_ access. Once we've accessed the page, and have it mapped, the private part really means that there are no guarantees it stays coherent. In particular, obviously if we've written to a page, we've lost the association with the original file entirely. And I'm pretty sure that a private mapping is allowed to act as if it was a private copy without that association in the first place. That said, this _is_ a QoI thing, and in Linux we've generally tried quite hard to stay as coherent as possible even with private mappings. In fact, before we had real shared file mappings (in a distant past, long long ago), we allowed read-only shared mappings because we internally turned them into read-only private mappings and our private mappings were coherent. And those "fake" read-only shared mappings actually were much better than some other platforms "real" shared mappings (*cough*hpux*cough*) and actually worked with things that mixed "write()" and "mmap()" and expected coherency. Admittedly the only case I'm aware of that did that was nntpd or something like that. Exactly because a lot of Unixes were *not* coherent (either because they had actual hardware cache coherency issues, or because their VM was not set up for it). Linus
On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote: > On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote: > > > > Except that on truncate, we have to unmap these > > anonymous pages in private file mappings as well... > > I'm actually not 100% sure we strictly would need to care. > > Once we've faulted in a private file mapping page, that page is > "ours". That's kind of what MAP_PRIVATE means. > > If we haven't written to it, we do keep things coherent with the file, > but that's actually not required by POSIX afaik - it's a QoI issue, > and a lot of (bad) Unixes didn't do it at all. > So as long as truncate _clears_ the pages it truncates, I think we'd > actually be ok. We don't even need to do that ... "If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified." https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html As you say, there's a QoI here, and POSIX permits some shockingly bad and useless implementations.
On Mon 21-09-20 18:59:43, Matthew Wilcox wrote: > On Mon, Sep 21, 2020 at 09:20:25AM -0700, Linus Torvalds wrote: > > On Mon, Sep 21, 2020 at 2:11 AM Jan Kara <jack@suse.cz> wrote: > > > > > > Except that on truncate, we have to unmap these > > > anonymous pages in private file mappings as well... > > > > I'm actually not 100% sure we strictly would need to care. > > > > Once we've faulted in a private file mapping page, that page is > > "ours". That's kind of what MAP_PRIVATE means. > > > > If we haven't written to it, we do keep things coherent with the file, > > but that's actually not required by POSIX afaik - it's a QoI issue, > > and a lot of (bad) Unixes didn't do it at all. > > So as long as truncate _clears_ the pages it truncates, I think we'd > > actually be ok. > > We don't even need to do that ... > > "If the size of the mapped file changes after the call to mmap() > as a result of some other operation on the mapped file, the effect of > references to portions of the mapped region that correspond to added or > removed portions of the file is unspecified." > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html > > As you say, there's a QoI here, and POSIX permits some shockingly > bad and useless implementations. Something from ftruncate(2) POSIX definition [1] for comparison: If the effect of ftruncate() is to decrease the size of a memory mapped file or a shared memory object and whole pages beyond the new end were previously mapped, then the whole pages beyond the new end shall be discarded. References to discarded pages shall result in the generation of a SIGBUS signal. [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Now pick... ;) Honza
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7b05f8fd7b3d..4b185a907432 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1266,10 +1266,23 @@ xfs_filemap_pfn_mkwrite( return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } +static void +xfs_filemap_map_pages( + struct vm_fault *vmf, + pgoff_t start_pgoff, + pgoff_t end_pgoff) +{ + struct inode *inode = file_inode(vmf->vma->vm_file); + + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + filemap_map_pages(vmf, start_pgoff, end_pgoff); + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); +} + static const struct vm_operations_struct xfs_file_vm_ops = { .fault = xfs_filemap_fault, .huge_fault = xfs_filemap_huge_fault, - .map_pages = filemap_map_pages, + .map_pages = xfs_filemap_map_pages, .page_mkwrite = xfs_filemap_page_mkwrite, .pfn_mkwrite = xfs_filemap_pfn_mkwrite, };