Message ID | 416a465a9fbe1d27085883dbf652c115cd195697.1503523424.git.dodgen@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed 23-08-17 14:26:52, rdodgen@gmail.com wrote: > From: Randy Dodgen <dodgen@google.com> > > If an ext4 filesystem is mounted with both the DAX and read-only > options, executables on that filesystem will fail to start (claiming > 'Segmentation fault') due to the fault handler returning > VM_FAULT_SIGBUS. > > This is due to the DAX fault handler (see ext4_dax_huge_fault) > attempting to write to the journal when FAULT_FLAG_WRITE is set. This is > the wrong behavior for write faults which will lead to a COW page; in > particular, this fails for readonly mounts. > > This change avoids journal writes for faults that are expected to COW. > > It might be the case that this could be better handled in > ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside > dax_iomap_fault). These is some overlap already (e.g. grabbing journal > handles). > > Signed-off-by: Randy Dodgen <dodgen@google.com> Thanks for the verbose comment :). The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > > This version is simplified as suggested by Ross; all fault sizes and fallbacks > are handled by dax_iomap_fault. > > fs/ext4/file.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 0d7cf0cc9b87..dc1e1fb6b54c 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -279,7 +279,20 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf, > handle_t *handle = NULL; > struct inode *inode = file_inode(vmf->vma->vm_file); > struct super_block *sb = inode->i_sb; > - bool write = vmf->flags & FAULT_FLAG_WRITE; > + > + /* > + * We have to distinguish real writes from writes which will result in a > + * COW page; COW writes should *not* poke the journal (the file will not > + * be changed). Doing so would cause unintended failures when mounted > + * read-only. > + * > + * We check for VM_SHARED rather than vmf->cow_page since the latter is > + * unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for > + * other sizes, dax_iomap_fault will handle splitting / fallback so that > + * we eventually come back with a COW page. > + */ > + bool write = (vmf->flags & FAULT_FLAG_WRITE) && > + (vmf->vma->vm_flags & VM_SHARED); > > if (write) { > sb_start_pagefault(sb); > -- > 2.14.1.342.g6490525c54-goog >
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote: > From: Randy Dodgen <dodgen@google.com> > > If an ext4 filesystem is mounted with both the DAX and read-only > options, executables on that filesystem will fail to start (claiming > 'Segmentation fault') due to the fault handler returning > VM_FAULT_SIGBUS. > > This is due to the DAX fault handler (see ext4_dax_huge_fault) > attempting to write to the journal when FAULT_FLAG_WRITE is set. This is > the wrong behavior for write faults which will lead to a COW page; in > particular, this fails for readonly mounts. > > This change avoids journal writes for faults that are expected to COW. > > It might be the case that this could be better handled in > ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside > dax_iomap_fault). These is some overlap already (e.g. grabbing journal > handles). > > Signed-off-by: Randy Dodgen <dodgen@google.com> Cool, looks good from the DAX point of view. You can add: Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote: > From: Randy Dodgen <dodgen@google.com> > > If an ext4 filesystem is mounted with both the DAX and read-only > options, executables on that filesystem will fail to start (claiming > 'Segmentation fault') due to the fault handler returning > VM_FAULT_SIGBUS. > > This is due to the DAX fault handler (see ext4_dax_huge_fault) > attempting to write to the journal when FAULT_FLAG_WRITE is set. This is > the wrong behavior for write faults which will lead to a COW page; in > particular, this fails for readonly mounts. > > This change avoids journal writes for faults that are expected to COW. > > It might be the case that this could be better handled in > ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside > dax_iomap_fault). These is some overlap already (e.g. grabbing journal > handles). > > Signed-off-by: Randy Dodgen <dodgen@google.com> > --- > > This version is simplified as suggested by Ross; all fault sizes and fallbacks > are handled by dax_iomap_fault. We really need to do the same for ext2 and xfs. And we really should be doing that in common VM code instead of the file system. See my recent xfs synchronous page fault series on the mess the inconsistent handling of the FAULT_FLAG_WRITE causes. I think we just need a new FAULT_FLAG_ALLOC or similar for those page faults that needs to allocate space instead of piling hacks over hacks, and make sure it's only set over the method that will actually do the allocation, as the DAX and non-DAX path are not consistent on that. Also any chance you could write an xfstest for this?
On Wed, Aug 23, 2017 at 02:26:52PM -0700, rdodgen@gmail.com wrote: > From: Randy Dodgen <dodgen@google.com> > > If an ext4 filesystem is mounted with both the DAX and read-only > options, executables on that filesystem will fail to start (claiming > 'Segmentation fault') due to the fault handler returning > VM_FAULT_SIGBUS. > > This is due to the DAX fault handler (see ext4_dax_huge_fault) > attempting to write to the journal when FAULT_FLAG_WRITE is set. This is > the wrong behavior for write faults which will lead to a COW page; in > particular, this fails for readonly mounts. > > This change avoids journal writes for faults that are expected to COW. > > It might be the case that this could be better handled in > ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside > dax_iomap_fault). These is some overlap already (e.g. grabbing journal > handles). > > Signed-off-by: Randy Dodgen <dodgen@google.com> Applied, thanks!! - Ted
On Thu, Aug 24, 2017 at 09:01:44AM -0700, Christoph Hellwig wrote: > > We really need to do the same for ext2 and xfs. And we really should > be doing that in common VM code instead of the file system. See > my recent xfs synchronous page fault series on the mess the inconsistent > handling of the FAULT_FLAG_WRITE causes. I think we just need a new > FAULT_FLAG_ALLOC or similar for those page faults that needs to > allocate space instead of piling hacks over hacks, and make sure > it's only set over the method that will actually do the allocation, > as the DAX and non-DAX path are not consistent on that. Yeah, agreed, that's the cleaner way of doing that. It does mean we'll have sweep through all of the file systems so that they DTRT with this new FAULT_FLAG_ALLOC, though, right? - Ted
On Thu, Aug 24, 2017 at 04:57:27PM -0400, Theodore Ts'o wrote: > Yeah, agreed, that's the cleaner way of doing that. It does mean > we'll have sweep through all of the file systems so that they DTRT > with this new FAULT_FLAG_ALLOC, though, right? I think the only ones that it matters for for now are the DAX fault handlers. So we can add the new flag, check it in ext2, ext4 and xfs for now and we're done. But if we eventually want to phase out the ->page_mkwrite handler it would be useful for other file systems as well.
Randy, any chance yto at least share a test script so that others can wire it up for the test suite?
On Tue, Aug 29, 2017 at 02:20:02PM -0700, Christoph Hellwig wrote: > Randy, any chance yto at least share a test script so that others > can wire it up for the test suite? I made a reproducer for my testing. I'll make an xfstest if Randy isn't able to.
An xfstest is a fine idea. I've started some work on that. On Tue, Aug 29, 2017 at 2:37 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Aug 29, 2017 at 02:20:02PM -0700, Christoph Hellwig wrote: >> Randy, any chance yto at least share a test script so that others >> can wire it up for the test suite? > > I made a reproducer for my testing. I'll make an xfstest if Randy isn't able > to.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 0d7cf0cc9b87..dc1e1fb6b54c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -279,7 +279,20 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf, handle_t *handle = NULL; struct inode *inode = file_inode(vmf->vma->vm_file); struct super_block *sb = inode->i_sb; - bool write = vmf->flags & FAULT_FLAG_WRITE; + + /* + * We have to distinguish real writes from writes which will result in a + * COW page; COW writes should *not* poke the journal (the file will not + * be changed). Doing so would cause unintended failures when mounted + * read-only. + * + * We check for VM_SHARED rather than vmf->cow_page since the latter is + * unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for + * other sizes, dax_iomap_fault will handle splitting / fallback so that + * we eventually come back with a COW page. + */ + bool write = (vmf->flags & FAULT_FLAG_WRITE) && + (vmf->vma->vm_flags & VM_SHARED); if (write) { sb_start_pagefault(sb);