dax, pmem: add support for msync
diff mbox

Message ID 20150902091321.GA2323@node.dhcp.inet.fi
State New
Headers show

Commit Message

Kirill A. Shutemov Sept. 2, 2015, 9:13 a.m. UTC
On Wed, Sep 02, 2015 at 08:49:22AM +1000, Dave Chinner wrote:
> On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> > > the backing store allocations to stable storage, so there's not
> > > getting around the fact msync is the wrong place to be flushing
> > > DAX mappings to persistent storage.
> > 
> > Why?
> > IIUC, msync() doesn't have any requirements wrt metadata, right?
> 
> Of course it does. If the backing store allocation has not been
> committed, then after a crash there will be a hole in file and
> so it will read as zeroes regardless of what data was written and
> flushed.

Any reason why backing store allocation cannot be committed on *_mkwrite?

> > > I pointed this out almost 6 months ago (i.e. that fsync was broken)
> > > anf hinted at how to solve it. Fix fsync, and msync gets fixed for
> > > free:
> > > 
> > > https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html
> > > 
> > > I've also reported to Willy that DAX write page faults don't work
> > > correctly, either. xfstests generic/080 exposes this: a read
> > > from a page followed immediately by a write to that page does not
> > > result in ->page_mkwrite being called on the write and so
> > > backing store is not allocated for the page, nor are the timestamps
> > > for the file updated. This will also result in fsync (and msync)
> > > not working properly.
> > 
> > Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
> 
> I didn't know that had been committed. I don't recall seeing a pull
> request with that in it

It went though -mm tree.

> none of the XFS DAX patches conflicted
> against it and there's been no runtime errors. I'll fix it up.
> 
> As such, shouldn't there be a check in the VM (in ->mmap callers)
> that if we have the vma is returned with VM_MIXEDMODE enabled that
> ->pfn_mkwrite is not NULL?  It's now clear to me that any filesystem
> that sets VM_MIXEDMODE needs to support both page_mkwrite and
> pfn_mkwrite, and such a check would have caught this immediately...

I guess it's "both or none" case. We have VM_MIXEDMAP users who don't care
about *_mkwrite.

I'm not yet sure it would be always correct, but something like this will
catch the XFS case, without false-positive on other stuff in my KVM setup:

Comments

Boaz Harrosh Sept. 2, 2015, 9:37 a.m. UTC | #1
On 09/02/2015 12:13 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2015 at 08:49:22AM +1000, Dave Chinner wrote:
>> On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote:
>>> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
>>>> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
>>>> the backing store allocations to stable storage, so there's not
>>>> getting around the fact msync is the wrong place to be flushing
>>>> DAX mappings to persistent storage.
>>>
>>> Why?
>>> IIUC, msync() doesn't have any requirements wrt metadata, right?
>>
>> Of course it does. If the backing store allocation has not been
>> committed, then after a crash there will be a hole in file and
>> so it will read as zeroes regardless of what data was written and
>> flushed.
> 
> Any reason why backing store allocation cannot be committed on *_mkwrite?
> 
>>>> I pointed this out almost 6 months ago (i.e. that fsync was broken)
>>>> anf hinted at how to solve it. Fix fsync, and msync gets fixed for
>>>> free:
>>>>
>>>> https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html
>>>>
>>>> I've also reported to Willy that DAX write page faults don't work
>>>> correctly, either. xfstests generic/080 exposes this: a read
>>>> from a page followed immediately by a write to that page does not
>>>> result in ->page_mkwrite being called on the write and so
>>>> backing store is not allocated for the page, nor are the timestamps
>>>> for the file updated. This will also result in fsync (and msync)
>>>> not working properly.
>>>
>>> Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
>>
>> I didn't know that had been committed. I don't recall seeing a pull
>> request with that in it
> 
> It went though -mm tree.
> 
>> none of the XFS DAX patches conflicted
>> against it and there's been no runtime errors. I'll fix it up.
>>
>> As such, shouldn't there be a check in the VM (in ->mmap callers)
>> that if we have the vma is returned with VM_MIXEDMODE enabled that
>> ->pfn_mkwrite is not NULL?  It's now clear to me that any filesystem
>> that sets VM_MIXEDMODE needs to support both page_mkwrite and
>> pfn_mkwrite, and such a check would have caught this immediately...
> 
> I guess it's "both or none" case. We have VM_MIXEDMAP users who don't care
> about *_mkwrite.
> 
> I'm not yet sure it would be always correct, but something like this will
> catch the XFS case, without false-positive on other stuff in my KVM setup:
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f78bceefe5a..f2e29a541e14 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1645,6 +1645,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                         vma->vm_ops = &dummy_ops;
>                 }
>  
> +               /*
> +                * Make sure that for VM_MIXEDMAP VMA has both
> +                * vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
> +                */
> +               if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
> +                               vma->vm_flags & VM_MIXEDMAP) {
> +                       VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
> +                       VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);

BTW: the page_mkwrite is used for reading of holes that put zero-pages at the radix tree.
     One can just map a single global zero-page in pfn-mode for that.

Kirill Hi. Please don't make these BUG_ONs its counter productive believe me.
Please make them WARN_ON_ONCE() it is not a crashing bug to work like this.
(Actually it is not a bug at all in some cases, but we can relax that when a user
 comes up)

Thanks
Boaz

> +               }
>                 addr = vma->vm_start;
>                 vm_flags = vma->vm_flags;
>         } else if (vm_flags & VM_SHARED) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Sept. 2, 2015, 9:41 a.m. UTC | #2
On 09/02/2015 12:37 PM, Boaz Harrosh wrote:
>>  
>> +               /*
>> +                * Make sure that for VM_MIXEDMAP VMA has both
>> +                * vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
>> +                */
>> +               if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
>> +                               vma->vm_flags & VM_MIXEDMAP) {
>> +                       VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
>> +                       VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);
> 
> BTW: the page_mkwrite is used for reading of holes that put zero-pages at the radix tree.
>      One can just map a single global zero-page in pfn-mode for that.
> 
> Kirill Hi. Please don't make these BUG_ONs its counter productive believe me.
> Please make them WARN_ON_ONCE() it is not a crashing bug to work like this.
> (Actually it is not a bug at all in some cases, but we can relax that when a user
>  comes up)
> 
> Thanks
> Boaz
> 

Second thought I do not like this patch. This is why we have xftests for, the fact of it
is that test 080 catches this. For me this is enough.

An FS developer should test his code, and worst case we help him on ML, like we did
in this case.

Thanks
Boaz

>> +               }
>>                 addr = vma->vm_start;
>>                 vm_flags = vma->vm_flags;
>>         } else if (vm_flags & VM_SHARED) {
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A. Shutemov Sept. 2, 2015, 9:47 a.m. UTC | #3
On Wed, Sep 02, 2015 at 12:41:44PM +0300, Boaz Harrosh wrote:
> On 09/02/2015 12:37 PM, Boaz Harrosh wrote:
> >>  
> >> +               /*
> >> +                * Make sure that for VM_MIXEDMAP VMA has both
> >> +                * vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
> >> +                */
> >> +               if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
> >> +                               vma->vm_flags & VM_MIXEDMAP) {
> >> +                       VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
> >> +                       VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);
> > 
> > BTW: the page_mkwrite is used for reading of holes that put zero-pages at the radix tree.
> >      One can just map a single global zero-page in pfn-mode for that.
> > 
> > Kirill Hi. Please don't make these BUG_ONs its counter productive believe me.

This is VM_BUG_ON, not normal BUG_ON. VM_BUG_ON is under CONFIG_DEBUG_VM 
which is disabled on production kernels.

> > Please make them WARN_ON_ONCE() it is not a crashing bug to work like this.
> > (Actually it is not a bug at all in some cases, but we can relax that when a user
> >  comes up)
> > 
> > Thanks
> > Boaz
> > 
> 
> Second thought I do not like this patch. This is why we have xftests for, the fact of it
> is that test 080 catches this. For me this is enough.

I don't insist on applying the patch. And I worry about false-positives.

> An FS developer should test his code, and worst case we help him on ML, like we did
> in this case.
> 
> Thanks
> Boaz
> 
> >> +               }
> >>                 addr = vma->vm_start;
> >>                 vm_flags = vma->vm_flags;
> >>         } else if (vm_flags & VM_SHARED) {
> >>
> > 
>
Boaz Harrosh Sept. 2, 2015, 10:28 a.m. UTC | #4
On 09/02/2015 12:47 PM, Kirill A. Shutemov wrote:
<>
> 
> I don't insist on applying the patch. And I worry about false-positives.
> 

Thanks, yes
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 3, 2015, 12:57 a.m. UTC | #5
On Wed, Sep 02, 2015 at 12:13:21PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2015 at 08:49:22AM +1000, Dave Chinner wrote:
> > On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
> > > > Even for DAX, msync has to call vfs_fsync_range() for the filesystem to commit
> > > > the backing store allocations to stable storage, so there's not
> > > > getting around the fact msync is the wrong place to be flushing
> > > > DAX mappings to persistent storage.
> > > 
> > > Why?
> > > IIUC, msync() doesn't have any requirements wrt metadata, right?
> > 
> > Of course it does. If the backing store allocation has not been
> > committed, then after a crash there will be a hole in file and
> > so it will read as zeroes regardless of what data was written and
> > flushed.
> 
> Any reason why backing store allocation cannot be committed on *_mkwrite?

Oh, I could change that if you want, it'll just be ridiculously
slow because it requires journal flushes on every page fault that
needs to change the filesytsem block map (i.e. every allocation and/or
every unwritten extent conversion).

Sycnhronous journalling requires flushing the log on every
transaction commit. That involves switching to a work queue, copying
the changes into a log buffer, issuing IO to flush the journal,
waiting for that to complete, etc. i.e.  synchronous journalling
incurs a minimum overhead of 4 context switches per page fault that
needs to allocate/convert backing store, along with all the CPU time
needed to process the journal commit.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f78bceefe5a..f2e29a541e14 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1645,6 +1645,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                         vma->vm_ops = &dummy_ops;
>                 }
>  
> +               /*
> +                * Make sure that for VM_MIXEDMAP VMA has both
> +                * vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
> +                */
> +               if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
> +                               vma->vm_flags & VM_MIXEDMAP) {
> +                       VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
> +                       VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);
> +               }

Doesn't really help developers that don't use CONFIG_DEBUG_VM. i.e
it's the FS developers that you need to warn, not VM developers -
in this case a "WARN_ON_ONCE" is probably more appropriate.

Cheers,

Dave.

Patch
diff mbox

diff --git a/mm/mmap.c b/mm/mmap.c
index 3f78bceefe5a..f2e29a541e14 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1645,6 +1645,15 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
                        vma->vm_ops = &dummy_ops;
                }
 
+               /*
+                * Make sure that for VM_MIXEDMAP VMA has both
+                * vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
+                */
+               if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
+                               vma->vm_flags & VM_MIXEDMAP) {
+                       VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
+                       VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);
+               }
                addr = vma->vm_start;
                vm_flags = vma->vm_flags;
        } else if (vm_flags & VM_SHARED) {