diff mbox

[6/7] mm, fs: introduce file_operations->post_mmap()

Message ID 20170925231404.32723-7-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Sept. 25, 2017, 11:14 p.m. UTC
When mappings are created the vma->vm_flags that they use vary based on
whether the inode being mapped is using DAX or not.  This setup happens in
XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().

For us to be able to safely use the DAX per-inode flag we need to prevent
S_DAX transitions when any mappings are present, and we will do that by
looking at the address_space->i_mmap tree and returning -EBUSY if any
mappings are present.

Unfortunately at the time that the filesystem's file_operations->mmap()
entry point is called the mapping has not yet been added to the
address_space->i_mmap tree.  This means that at that point in time we
cannot determine whether or not the mapping will be set up to support DAX.

Fix this by adding a new file_operations entry called post_mmap() which is
called after the mapping has been added to the address_space->i_mmap tree.
This post_mmap() op now happens at a time when we can be sure whether the
mapping will use DAX or not, and we can set up the vma->vm_flags
appropriately.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c  | 15 ++++++++++++++-
 include/linux/fs.h |  1 +
 mm/mmap.c          |  2 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

Comments

Dan Williams Sept. 25, 2017, 11:38 p.m. UTC | #1
On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not.  This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
>
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree.  This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
>
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
>  include/linux/fs.h |  1 +
>  mm/mmap.c          |  2 ++
>  3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2816858..9d66aaa 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1087,9 +1087,21 @@ xfs_file_mmap(
>  {
>         file_accessed(filp);
>         vma->vm_ops = &xfs_file_vm_ops;
> +       return 0;
> +}
> +
> +/* This call happens during mmap(), after the vma has been inserted into the
> + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
> + * not to use DAX for this mapping has been set and will not change for the
> + * duration of the mapping.
> + */
> +STATIC void
> +xfs_file_post_mmap(
> +       struct file     *filp,
> +       struct vm_area_struct *vma)
> +{
>         if (IS_DAX(file_inode(filp)))
>                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;

It's not clear to me what this is actually protecting? vma_is_dax()
returns true regardless of the vm_flags state , so what is the benefit
to delaying the vm_flags setting to ->post_mmap()?

Also, why is this a file_operation and not a vm_operation?
Christoph Hellwig Sept. 26, 2017, 6:34 a.m. UTC | #2
On Mon, Sep 25, 2017 at 05:14:03PM -0600, Ross Zwisler wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not.  This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
> 
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
> 
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree.  This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
> 
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.

Just like in the read/write path we'll need a flag that is passed down
from the VM based on checking IS_DAX once and exactly once instead.
Ross Zwisler Sept. 26, 2017, 6:57 p.m. UTC | #3
On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > When mappings are created the vma->vm_flags that they use vary based on
> > whether the inode being mapped is using DAX or not.  This setup happens in
> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
> >
> > For us to be able to safely use the DAX per-inode flag we need to prevent
> > S_DAX transitions when any mappings are present, and we will do that by
> > looking at the address_space->i_mmap tree and returning -EBUSY if any
> > mappings are present.
> >
> > Unfortunately at the time that the filesystem's file_operations->mmap()
> > entry point is called the mapping has not yet been added to the
> > address_space->i_mmap tree.  This means that at that point in time we
> > cannot determine whether or not the mapping will be set up to support DAX.
> >
> > Fix this by adding a new file_operations entry called post_mmap() which is
> > called after the mapping has been added to the address_space->i_mmap tree.
> > This post_mmap() op now happens at a time when we can be sure whether the
> > mapping will use DAX or not, and we can set up the vma->vm_flags
> > appropriately.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
> >  include/linux/fs.h |  1 +
> >  mm/mmap.c          |  2 ++
> >  3 files changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 2816858..9d66aaa 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
> >  {
> >         file_accessed(filp);
> >         vma->vm_ops = &xfs_file_vm_ops;
> > +       return 0;
> > +}
> > +
> > +/* This call happens during mmap(), after the vma has been inserted into the
> > + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
> > + * not to use DAX for this mapping has been set and will not change for the
> > + * duration of the mapping.
> > + */
> > +STATIC void
> > +xfs_file_post_mmap(
> > +       struct file     *filp,
> > +       struct vm_area_struct *vma)
> > +{
> >         if (IS_DAX(file_inode(filp)))
> >                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> 
> It's not clear to me what this is actually protecting? vma_is_dax()
> returns true regardless of the vm_flags state , so what is the benefit
> to delaying the vm_flags setting to ->post_mmap()?

Right, but the point is that until the vma has been inserted into the
inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
can still change.  Until this insertion happens we cannot know whether or not
we should set up the vma->vm_flags to support DAX mappings (i.e. have
VM_MIXEDMAP and VM_HUGEPAGE set).  This decision can only be made (in this
proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
populated, which means we need another call into the filesystem after this
insertion has happened.

We don't want to mess with the existing file_operations->mmap() call because
in many filesystems that does sanity checking and setup that you really want
to have happen *before* the mapping is completed and inserted into the
inode->i_mapping->i_mmap tree.

> Also, why is this a file_operation and not a vm_operation?

Because ->mmap() is also a file_operation, and this is an analogous call from
the mmap code that needs to happen at a different time.  Or are you suggesting
that file_operations->mmap() should be moved to be a vm_operation?  If not,
why would one be in one operations table and one in another?
Dan Williams Sept. 26, 2017, 7:19 p.m. UTC | #4
On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
>> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > When mappings are created the vma->vm_flags that they use vary based on
>> > whether the inode being mapped is using DAX or not.  This setup happens in
>> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>> >
>> > For us to be able to safely use the DAX per-inode flag we need to prevent
>> > S_DAX transitions when any mappings are present, and we will do that by
>> > looking at the address_space->i_mmap tree and returning -EBUSY if any
>> > mappings are present.
>> >
>> > Unfortunately at the time that the filesystem's file_operations->mmap()
>> > entry point is called the mapping has not yet been added to the
>> > address_space->i_mmap tree.  This means that at that point in time we
>> > cannot determine whether or not the mapping will be set up to support DAX.
>> >
>> > Fix this by adding a new file_operations entry called post_mmap() which is
>> > called after the mapping has been added to the address_space->i_mmap tree.
>> > This post_mmap() op now happens at a time when we can be sure whether the
>> > mapping will use DAX or not, and we can set up the vma->vm_flags
>> > appropriately.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > ---
>> >  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
>> >  include/linux/fs.h |  1 +
>> >  mm/mmap.c          |  2 ++
>> >  3 files changed, 17 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> > index 2816858..9d66aaa 100644
>> > --- a/fs/xfs/xfs_file.c
>> > +++ b/fs/xfs/xfs_file.c
>> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
>> >  {
>> >         file_accessed(filp);
>> >         vma->vm_ops = &xfs_file_vm_ops;
>> > +       return 0;
>> > +}
>> > +
>> > +/* This call happens during mmap(), after the vma has been inserted into the
>> > + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
>> > + * not to use DAX for this mapping has been set and will not change for the
>> > + * duration of the mapping.
>> > + */
>> > +STATIC void
>> > +xfs_file_post_mmap(
>> > +       struct file     *filp,
>> > +       struct vm_area_struct *vma)
>> > +{
>> >         if (IS_DAX(file_inode(filp)))
>> >                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>>
>> It's not clear to me what this is actually protecting? vma_is_dax()
>> returns true regardless of the vm_flags state , so what is the benefit
>> to delaying the vm_flags setting to ->post_mmap()?
>
> Right, but the point is that until the vma has been inserted into the
> inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
> can still change.  Until this insertion happens we cannot know whether or not
> we should set up the vma->vm_flags to support DAX mappings (i.e. have
> VM_MIXEDMAP and VM_HUGEPAGE set).

Those flags are not DAX flags. The side effect of these being set on
non-DAX mappings is that we effectively auto madvise(MADV_HUGEPAGE)
and enable some page-less insertion paths. Both of those are
effectively no-ops for normal mappings since normal mappings always
have an associated struct page and the THP policy these days is
usually "always".

> This decision can only be made (in this
> proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> populated, which means we need another call into the filesystem after this
> insertion has happened.

I get that, but it seems over-engineered and something that can also
be safely cleaned up after the fact by the code path that is disabling
DAX.

> We don't want to mess with the existing file_operations->mmap() call because
> in many filesystems that does sanity checking and setup that you really want
> to have happen *before* the mapping is completed and inserted into the
> inode->i_mapping->i_mmap tree.
>
>> Also, why is this a file_operation and not a vm_operation?
>
> Because ->mmap() is also a file_operation, and this is an analogous call from
> the mmap code that needs to happen at a different time.  Or are you suggesting
> that file_operations->mmap() should be moved to be a vm_operation?  If not,
> why would one be in one operations table and one in another?

Growing something as widely used as file_operations for this one-off
fixup feels like overkill. vm_operations is not much better, but it at
least constrains the data structure growth to something closer to the
problem space.
Ross Zwisler Sept. 26, 2017, 9:06 p.m. UTC | #5
On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<>
> > This decision can only be made (in this
> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> > populated, which means we need another call into the filesystem after this
> > insertion has happened.
> 
> I get that, but it seems over-engineered and something that can also
> be safely cleaned up after the fact by the code path that is disabling
> DAX.

I don't think you can safely clean it up after the fact because some thread
might have already called ->mmap() to set up the vma->vm_flags for their new
mapping, but they haven't added it to inode->i_mapping->i_mmap.

The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
that the filesystem has any idea about about the mapping.  This is the method
by which we would try and clean up mapping flags, if we were to do so, and
it's the only way that the filesystem can know whether or not mappings exist.

The only way that I could think of to make this safely work is to have the
insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
that the filesystem and the mapping code can communicate on the state of DAX,
but before that I think it's basically indeterminate.
Dan Williams Sept. 26, 2017, 9:41 p.m. UTC | #6
On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> <>
>> > This decision can only be made (in this
>> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> > populated, which means we need another call into the filesystem after this
>> > insertion has happened.
>>
>> I get that, but it seems over-engineered and something that can also
>> be safely cleaned up after the fact by the code path that is disabling
>> DAX.
>
> I don't think you can safely clean it up after the fact because some thread
> might have already called ->mmap() to set up the vma->vm_flags for their new
> mapping, but they haven't added it to inode->i_mapping->i_mmap.

If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
memory mappings.

> The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> that the filesystem has any idea about about the mapping.  This is the method
> by which we would try and clean up mapping flags, if we were to do so, and
> it's the only way that the filesystem can know whether or not mappings exist.
>
> The only way that I could think of to make this safely work is to have the
> insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> that the filesystem and the mapping code can communicate on the state of DAX,
> but before that I think it's basically indeterminate.

If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
the ->mmap() handler and let the default THP policy take over. In
fact, see transparent_hugepage_enabled() we already auto-enable huge
page support for dax mappings regardless of VM_HUGEPAGE.
Jan Kara Sept. 27, 2017, 11:35 a.m. UTC | #7
On Tue 26-09-17 14:41:53, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > <>
> >> > This decision can only be made (in this
> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> >> > populated, which means we need another call into the filesystem after this
> >> > insertion has happened.
> >>
> >> I get that, but it seems over-engineered and something that can also
> >> be safely cleaned up after the fact by the code path that is disabling
> >> DAX.
> >
> > I don't think you can safely clean it up after the fact because some thread
> > might have already called ->mmap() to set up the vma->vm_flags for their new
> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> 
> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> memory mappings.
> 
> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > that the filesystem has any idea about about the mapping.  This is the method
> > by which we would try and clean up mapping flags, if we were to do so, and
> > it's the only way that the filesystem can know whether or not mappings exist.
> >
> > The only way that I could think of to make this safely work is to have the
> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> > that the filesystem and the mapping code can communicate on the state of DAX,
> > but before that I think it's basically indeterminate.
> 
> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> the ->mmap() handler and let the default THP policy take over. In
> fact, see transparent_hugepage_enabled() we already auto-enable huge
> page support for dax mappings regardless of VM_HUGEPAGE.

Hum, this is an interesting option. So do you suggest that filesystems
supporting DAX would always setup mappings with VM_MIXEDMAP and without
VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
That could actually work. The only possible issue I can see is that
VM_MIXEDMAP is still slightly different from normal page mappings and it
could have some performance implications - e.g. copy_page_range() does more
work on VM_MIXEDMAP mappings but not on normal page mappings.

								Honza
Dan Williams Sept. 27, 2017, 2 p.m. UTC | #8
On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> > <>
>> >> > This decision can only be made (in this
>> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> >> > populated, which means we need another call into the filesystem after this
>> >> > insertion has happened.
>> >>
>> >> I get that, but it seems over-engineered and something that can also
>> >> be safely cleaned up after the fact by the code path that is disabling
>> >> DAX.
>> >
>> > I don't think you can safely clean it up after the fact because some thread
>> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>>
>> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> memory mappings.
>>
>> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> > that the filesystem has any idea about about the mapping.  This is the method
>> > by which we would try and clean up mapping flags, if we were to do so, and
>> > it's the only way that the filesystem can know whether or not mappings exist.
>> >
>> > The only way that I could think of to make this safely work is to have the
>> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
>> > that the filesystem and the mapping code can communicate on the state of DAX,
>> > but before that I think it's basically indeterminate.
>>
>> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> the ->mmap() handler and let the default THP policy take over. In
>> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> page support for dax mappings regardless of VM_HUGEPAGE.
>
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.

We can also get rid of VM_MIXEDMAP if we disable DAX in the
!pfn_t_has_page() case.
Jan Kara Sept. 27, 2017, 3:07 p.m. UTC | #9
On Wed 27-09-17 07:00:53, Dan Williams wrote:
> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> >> > <>
> >> >> > This decision can only be made (in this
> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> >> >> > populated, which means we need another call into the filesystem after this
> >> >> > insertion has happened.
> >> >>
> >> >> I get that, but it seems over-engineered and something that can also
> >> >> be safely cleaned up after the fact by the code path that is disabling
> >> >> DAX.
> >> >
> >> > I don't think you can safely clean it up after the fact because some thread
> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> >>
> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> >> memory mappings.
> >>
> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> >> > that the filesystem has any idea about about the mapping.  This is the method
> >> > by which we would try and clean up mapping flags, if we were to do so, and
> >> > it's the only way that the filesystem can know whether or not mappings exist.
> >> >
> >> > The only way that I could think of to make this safely work is to have the
> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> >> > that the filesystem and the mapping code can communicate on the state of DAX,
> >> > but before that I think it's basically indeterminate.
> >>
> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> >> the ->mmap() handler and let the default THP policy take over. In
> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
> >> page support for dax mappings regardless of VM_HUGEPAGE.
> >
> > Hum, this is an interesting option. So do you suggest that filesystems
> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> > That could actually work. The only possible issue I can see is that
> > VM_MIXEDMAP is still slightly different from normal page mappings and it
> > could have some performance implications - e.g. copy_page_range() does more
> > work on VM_MIXEDMAP mappings but not on normal page mappings.
> 
> We can also get rid of VM_MIXEDMAP if we disable DAX in the
> !pfn_t_has_page() case.

Yeah, although it would be a pity to require struct page just to avoid
having to set VM_MIXEDMAP flag...

								Honza
Dan Williams Sept. 27, 2017, 3:36 p.m. UTC | #10
On Wed, Sep 27, 2017 at 8:07 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 27-09-17 07:00:53, Dan Williams wrote:
>> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> >> <ross.zwisler@linux.intel.com> wrote:
>> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> >> > <>
>> >> >> > This decision can only be made (in this
>> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> >> >> > populated, which means we need another call into the filesystem after this
>> >> >> > insertion has happened.
>> >> >>
>> >> >> I get that, but it seems over-engineered and something that can also
>> >> >> be safely cleaned up after the fact by the code path that is disabling
>> >> >> DAX.
>> >> >
>> >> > I don't think you can safely clean it up after the fact because some thread
>> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>> >>
>> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> >> memory mappings.
>> >>
>> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> >> > that the filesystem has any idea about about the mapping.  This is the method
>> >> > by which we would try and clean up mapping flags, if we were to do so, and
>> >> > it's the only way that the filesystem can know whether or not mappings exist.
>> >> >
>> >> > The only way that I could think of to make this safely work is to have the
>> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
>> >> > that the filesystem and the mapping code can communicate on the state of DAX,
>> >> > but before that I think it's basically indeterminate.
>> >>
>> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> >> the ->mmap() handler and let the default THP policy take over. In
>> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> >> page support for dax mappings regardless of VM_HUGEPAGE.
>> >
>> > Hum, this is an interesting option. So do you suggest that filesystems
>> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> > That could actually work. The only possible issue I can see is that
>> > VM_MIXEDMAP is still slightly different from normal page mappings and it
>> > could have some performance implications - e.g. copy_page_range() does more
>> > work on VM_MIXEDMAP mappings but not on normal page mappings.
>>
>> We can also get rid of VM_MIXEDMAP if we disable DAX in the
>> !pfn_t_has_page() case.
>
> Yeah, although it would be a pity to require struct page just to avoid
> having to set VM_MIXEDMAP flag...

Yes, but the real motivation is fixing all the basic things that break
without struct page, like ptrace and direct-i/o. The removal of
needing to set VM_MIXEDMAP is just a nice side effect. I'll send a
patch because DAX without pages has too many surprise failure cases.
Ross Zwisler Sept. 27, 2017, 3:39 p.m. UTC | #11
On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
> > On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> > >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > > <>
> > >> > This decision can only be made (in this
> > >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> > >> > populated, which means we need another call into the filesystem after this
> > >> > insertion has happened.
> > >>
> > >> I get that, but it seems over-engineered and something that can also
> > >> be safely cleaned up after the fact by the code path that is disabling
> > >> DAX.
> > >
> > > I don't think you can safely clean it up after the fact because some thread
> > > might have already called ->mmap() to set up the vma->vm_flags for their new
> > > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> > 
> > If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> > DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> > memory mappings.
> > 
> > > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > > that the filesystem has any idea about about the mapping.  This is the method
> > > by which we would try and clean up mapping flags, if we were to do so, and
> > > it's the only way that the filesystem can know whether or not mappings exist.
> > >
> > > The only way that I could think of to make this safely work is to have the
> > > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> > > that the filesystem and the mapping code can communicate on the state of DAX,
> > > but before that I think it's basically indeterminate.
> > 
> > If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> > breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> > the ->mmap() handler and let the default THP policy take over. In
> > fact, see transparent_hugepage_enabled() we already auto-enable huge
> > page support for dax mappings regardless of VM_HUGEPAGE.
> 
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.

It looks like having VM_MIXEDMAP always set for filesystems that support DAX
might affect their memory's NUMA migration in the non-DAX case? 

8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
Dan Williams Sept. 27, 2017, 3:54 p.m. UTC | #12
On Wed, Sep 27, 2017 at 8:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
[..]
>> Hum, this is an interesting option. So do you suggest that filesystems
>> supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> That could actually work. The only possible issue I can see is that
>> VM_MIXEDMAP is still slightly different from normal page mappings and it
>> could have some performance implications - e.g. copy_page_range() does more
>> work on VM_MIXEDMAP mappings but not on normal page mappings.
>
> It looks like having VM_MIXEDMAP always set for filesystems that support DAX
> might affect their memory's NUMA migration in the non-DAX case?
>
> 8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings

Addressed separately here:

c1ef8e2c0235 mm: disable numa migration faults for dax vmas
diff mbox

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2816858..9d66aaa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1087,9 +1087,21 @@  xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
+	return 0;
+}
+
+/* This call happens during mmap(), after the vma has been inserted into the
+ * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
+ * not to use DAX for this mapping has been set and will not change for the
+ * duration of the mapping.
+ */
+STATIC void
+xfs_file_post_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
 	if (IS_DAX(file_inode(filp)))
 		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
-	return 0;
 }
 
 const struct file_operations xfs_file_operations = {
@@ -1103,6 +1115,7 @@  const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.post_mmap	= xfs_file_post_mmap,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..7c06838 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,7 @@  struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	void (*post_mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..ee7458a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1711,6 +1711,8 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
+		if (file->f_op->post_mmap)
+			file->f_op->post_mmap(file, vma);
 		if (vm_flags & VM_SHARED)
 			mapping_unmap_writable(file->f_mapping);
 		if (vm_flags & VM_DENYWRITE)