Message ID | 20170925231404.32723-7-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > When mappings are created the vma->vm_flags that they use vary based on > whether the inode being mapped is using DAX or not. This setup happens in > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap(). > > For us to be able to safely use the DAX per-inode flag we need to prevent > S_DAX transitions when any mappings are present, and we will do that by > looking at the address_space->i_mmap tree and returning -EBUSY if any > mappings are present. > > Unfortunately at the time that the filesystem's file_operations->mmap() > entry point is called the mapping has not yet been added to the > address_space->i_mmap tree. This means that at that point in time we > cannot determine whether or not the mapping will be set up to support DAX. > > Fix this by adding a new file_operations entry called post_mmap() which is > called after the mapping has been added to the address_space->i_mmap tree. > This post_mmap() op now happens at a time when we can be sure whether the > mapping will use DAX or not, and we can set up the vma->vm_flags > appropriately. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > fs/xfs/xfs_file.c | 15 ++++++++++++++- > include/linux/fs.h | 1 + > mm/mmap.c | 2 ++ > 3 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 2816858..9d66aaa 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1087,9 +1087,21 @@ xfs_file_mmap( > { > file_accessed(filp); > vma->vm_ops = &xfs_file_vm_ops; > + return 0; > +} > + > +/* This call happens during mmap(), after the vma has been inserted into the > + * inode->i_mapping->i_mmap tree. At this point the decision on whether or > + * not to use DAX for this mapping has been set and will not change for the > + * duration of the mapping. > + */ > +STATIC void > +xfs_file_post_mmap( > + struct file *filp, > + struct vm_area_struct *vma) > +{ > if (IS_DAX(file_inode(filp))) > vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; It's not clear to me what this is actually protecting? vma_is_dax() returns true regardless of the vm_flags state , so what is the benefit to delaying the vm_flags setting to ->post_mmap()? Also, why is this a file_operation and not a vm_operation?
On Mon, Sep 25, 2017 at 05:14:03PM -0600, Ross Zwisler wrote: > When mappings are created the vma->vm_flags that they use vary based on > whether the inode being mapped is using DAX or not. This setup happens in > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap(). > > For us to be able to safely use the DAX per-inode flag we need to prevent > S_DAX transitions when any mappings are present, and we will do that by > looking at the address_space->i_mmap tree and returning -EBUSY if any > mappings are present. > > Unfortunately at the time that the filesystem's file_operations->mmap() > entry point is called the mapping has not yet been added to the > address_space->i_mmap tree. This means that at that point in time we > cannot determine whether or not the mapping will be set up to support DAX. > > Fix this by adding a new file_operations entry called post_mmap() which is > called after the mapping has been added to the address_space->i_mmap tree. > This post_mmap() op now happens at a time when we can be sure whether the > mapping will use DAX or not, and we can set up the vma->vm_flags > appropriately. Just like in the read/write path we'll need a flag that is passed down from the VM based on checking IS_DAX once and exactly once instead.
On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote: > On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > When mappings are created the vma->vm_flags that they use vary based on > > whether the inode being mapped is using DAX or not. This setup happens in > > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap(). > > > > For us to be able to safely use the DAX per-inode flag we need to prevent > > S_DAX transitions when any mappings are present, and we will do that by > > looking at the address_space->i_mmap tree and returning -EBUSY if any > > mappings are present. > > > > Unfortunately at the time that the filesystem's file_operations->mmap() > > entry point is called the mapping has not yet been added to the > > address_space->i_mmap tree. This means that at that point in time we > > cannot determine whether or not the mapping will be set up to support DAX. > > > > Fix this by adding a new file_operations entry called post_mmap() which is > > called after the mapping has been added to the address_space->i_mmap tree. > > This post_mmap() op now happens at a time when we can be sure whether the > > mapping will use DAX or not, and we can set up the vma->vm_flags > > appropriately. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > --- > > fs/xfs/xfs_file.c | 15 ++++++++++++++- > > include/linux/fs.h | 1 + > > mm/mmap.c | 2 ++ > > 3 files changed, 17 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 2816858..9d66aaa 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -1087,9 +1087,21 @@ xfs_file_mmap( > > { > > file_accessed(filp); > > vma->vm_ops = &xfs_file_vm_ops; > > + return 0; > > +} > > + > > +/* This call happens during mmap(), after the vma has been inserted into the > > + * inode->i_mapping->i_mmap tree. At this point the decision on whether or > > + * not to use DAX for this mapping has been set and will not change for the > > + * duration of the mapping. > > + */ > > +STATIC void > > +xfs_file_post_mmap( > > + struct file *filp, > > + struct vm_area_struct *vma) > > +{ > > if (IS_DAX(file_inode(filp))) > > vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > > It's not clear to me what this is actually protecting? vma_is_dax() > returns true regardless of the vm_flags state , so what is the benefit > to delaying the vm_flags setting to ->post_mmap()? Right, but the point is that until the vma has been inserted into the inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it can still change. Until this insertion happens we cannot know whether or not we should set up the vma->vm_flags to support DAX mappings (i.e. have VM_MIXEDMAP and VM_HUGEPAGE set). This decision can only be made (in this proposed scheme) *after* the inode->i_mapping->i_mmap tree has been populated, which means we need another call into the filesystem after this insertion has happened. We don't want to mess with the existing file_operations->mmap() call because in many filesystems that does sanity checking and setup that you really want to have happen *before* the mapping is completed and inserted into the inode->i_mapping->i_mmap tree. > Also, why is this a file_operation and not a vm_operation? Because ->mmap() is also a file_operation, and this is an analogous call from the mmap code that needs to happen at a different time. Or are you suggesting that file_operations->mmap() should be moved to be a vm_operation? If not, why would one be in one operations table and one in another?
On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote: >> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler >> <ross.zwisler@linux.intel.com> wrote: >> > When mappings are created the vma->vm_flags that they use vary based on >> > whether the inode being mapped is using DAX or not. This setup happens in >> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap(). >> > >> > For us to be able to safely use the DAX per-inode flag we need to prevent >> > S_DAX transitions when any mappings are present, and we will do that by >> > looking at the address_space->i_mmap tree and returning -EBUSY if any >> > mappings are present. >> > >> > Unfortunately at the time that the filesystem's file_operations->mmap() >> > entry point is called the mapping has not yet been added to the >> > address_space->i_mmap tree. This means that at that point in time we >> > cannot determine whether or not the mapping will be set up to support DAX. >> > >> > Fix this by adding a new file_operations entry called post_mmap() which is >> > called after the mapping has been added to the address_space->i_mmap tree. >> > This post_mmap() op now happens at a time when we can be sure whether the >> > mapping will use DAX or not, and we can set up the vma->vm_flags >> > appropriately. >> > >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> >> > --- >> > fs/xfs/xfs_file.c | 15 ++++++++++++++- >> > include/linux/fs.h | 1 + >> > mm/mmap.c | 2 ++ >> > 3 files changed, 17 insertions(+), 1 deletion(-) >> > >> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c >> > index 2816858..9d66aaa 100644 >> > --- a/fs/xfs/xfs_file.c >> > +++ b/fs/xfs/xfs_file.c >> > @@ -1087,9 +1087,21 @@ xfs_file_mmap( >> > { >> > file_accessed(filp); >> > vma->vm_ops = &xfs_file_vm_ops; >> > + return 0; >> > +} >> > + >> > +/* This call happens during mmap(), after the vma has been inserted into the >> > + * inode->i_mapping->i_mmap tree. At this point the decision on whether or >> > + * not to use DAX for this mapping has been set and will not change for the >> > + * duration of the mapping. >> > + */ >> > +STATIC void >> > +xfs_file_post_mmap( >> > + struct file *filp, >> > + struct vm_area_struct *vma) >> > +{ >> > if (IS_DAX(file_inode(filp))) >> > vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; >> >> It's not clear to me what this is actually protecting? vma_is_dax() >> returns true regardless of the vm_flags state , so what is the benefit >> to delaying the vm_flags setting to ->post_mmap()? > > Right, but the point is that until the vma has been inserted into the > inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it > can still change. Until this insertion happens we cannot know whether or not > we should set up the vma->vm_flags to support DAX mappings (i.e. have > VM_MIXEDMAP and VM_HUGEPAGE set). Those flags are not DAX flags. The side effect of these being set on non-DAX mappings is that we effectively auto madvise(MADV_HUGEPAGE) and enable some page-less insertion paths. Both of those are effectively no-ops for normal mappings since normal mappings always have an associated struct page and the THP policy these days is usually "always". > This decision can only be made (in this > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been > populated, which means we need another call into the filesystem after this > insertion has happened. I get that, but it seems over-engineered and something that can also be safely cleaned up after the fact by the code path that is disabling DAX. > We don't want to mess with the existing file_operations->mmap() call because > in many filesystems that does sanity checking and setup that you really want > to have happen *before* the mapping is completed and inserted into the > inode->i_mapping->i_mmap tree. > >> Also, why is this a file_operation and not a vm_operation? > > Because ->mmap() is also a file_operation, and this is an analogous call from > the mmap code that needs to happen at a different time. Or are you suggesting > that file_operations->mmap() should be moved to be a vm_operation? If not, > why would one be in one operations table and one in another? Growing something as widely used as file_operations for this one-off fixup feels like overkill. vm_operations is not much better, but it at least constrains the data structure growth to something closer to the problem space.
On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: > On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler <> > > This decision can only be made (in this > > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been > > populated, which means we need another call into the filesystem after this > > insertion has happened. > > I get that, but it seems over-engineered and something that can also > be safely cleaned up after the fact by the code path that is disabling > DAX. I don't think you can safely clean it up after the fact because some thread might have already called ->mmap() to set up the vma->vm_flags for their new mapping, but they haven't added it to inode->i_mapping->i_mmap. The inode->i_mapping->i_mmap tree is the only way (that I know of at least) that the filesystem has any idea about about the mapping. This is the method by which we would try and clean up mapping flags, if we were to do so, and it's the only way that the filesystem can know whether or not mappings exist. The only way that I could think of to make this safely work is to have the insertion into the inode->i_mapping->i_mmap tree be our sync point. After that the filesystem and the mapping code can communicate on the state of DAX, but before that I think it's basically indeterminate.
On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler > <> >> > This decision can only be made (in this >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been >> > populated, which means we need another call into the filesystem after this >> > insertion has happened. >> >> I get that, but it seems over-engineered and something that can also >> be safely cleaned up after the fact by the code path that is disabling >> DAX. > > I don't think you can safely clean it up after the fact because some thread > might have already called ->mmap() to set up the vma->vm_flags for their new > mapping, but they haven't added it to inode->i_mapping->i_mmap. If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal memory mappings. > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) > that the filesystem has any idea about about the mapping. This is the method > by which we would try and clean up mapping flags, if we were to do so, and > it's the only way that the filesystem can know whether or not mappings exist. > > The only way that I could think of to make this safely work is to have the > insertion into the inode->i_mapping->i_mmap tree be our sync point. After > that the filesystem and the mapping code can communicate on the state of DAX, > but before that I think it's basically indeterminate. If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in the ->mmap() handler and let the default THP policy take over. In fact, see transparent_hugepage_enabled() we already auto-enable huge page support for dax mappings regardless of VM_HUGEPAGE.
On Tue 26-09-17 14:41:53, Dan Williams wrote: > On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: > >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler > > <> > >> > This decision can only be made (in this > >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been > >> > populated, which means we need another call into the filesystem after this > >> > insertion has happened. > >> > >> I get that, but it seems over-engineered and something that can also > >> be safely cleaned up after the fact by the code path that is disabling > >> DAX. > > > > I don't think you can safely clean it up after the fact because some thread > > might have already called ->mmap() to set up the vma->vm_flags for their new > > mapping, but they haven't added it to inode->i_mapping->i_mmap. > > If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the > DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal > memory mappings. > > > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) > > that the filesystem has any idea about about the mapping. This is the method > > by which we would try and clean up mapping flags, if we were to do so, and > > it's the only way that the filesystem can know whether or not mappings exist. > > > > The only way that I could think of to make this safely work is to have the > > insertion into the inode->i_mapping->i_mmap tree be our sync point. After > > that the filesystem and the mapping code can communicate on the state of DAX, > > but before that I think it's basically indeterminate. > > If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what > breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in > the ->mmap() handler and let the default THP policy take over. In > fact, see transparent_hugepage_enabled() we already auto-enable huge > page support for dax mappings regardless of VM_HUGEPAGE. Hum, this is an interesting option. So do you suggest that filesystems supporting DAX would always setup mappings with VM_MIXEDMAP and without VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? That could actually work. The only possible issue I can see is that VM_MIXEDMAP is still slightly different from normal page mappings and it could have some performance implications - e.g. copy_page_range() does more work on VM_MIXEDMAP mappings but not on normal page mappings. Honza
On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote: > On Tue 26-09-17 14:41:53, Dan Williams wrote: >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler >> <ross.zwisler@linux.intel.com> wrote: >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler >> > <> >> >> > This decision can only be made (in this >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been >> >> > populated, which means we need another call into the filesystem after this >> >> > insertion has happened. >> >> >> >> I get that, but it seems over-engineered and something that can also >> >> be safely cleaned up after the fact by the code path that is disabling >> >> DAX. >> > >> > I don't think you can safely clean it up after the fact because some thread >> > might have already called ->mmap() to set up the vma->vm_flags for their new >> > mapping, but they haven't added it to inode->i_mapping->i_mmap. >> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal >> memory mappings. >> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) >> > that the filesystem has any idea about about the mapping. This is the method >> > by which we would try and clean up mapping flags, if we were to do so, and >> > it's the only way that the filesystem can know whether or not mappings exist. >> > >> > The only way that I could think of to make this safely work is to have the >> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After >> > that the filesystem and the mapping code can communicate on the state of DAX, >> > but before that I think it's basically indeterminate. >> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in >> the ->mmap() handler and let the default THP policy take over. In >> fact, see transparent_hugepage_enabled() we already auto-enable huge >> page support for dax mappings regardless of VM_HUGEPAGE. > > Hum, this is an interesting option. So do you suggest that filesystems > supporting DAX would always setup mappings with VM_MIXEDMAP and without > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? > That could actually work. The only possible issue I can see is that > VM_MIXEDMAP is still slightly different from normal page mappings and it > could have some performance implications - e.g. copy_page_range() does more > work on VM_MIXEDMAP mappings but not on normal page mappings. We can also get rid of VM_MIXEDMAP if we disable DAX in the !pfn_t_has_page() case.
On Wed 27-09-17 07:00:53, Dan Williams wrote: > On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote: > > On Tue 26-09-17 14:41:53, Dan Williams wrote: > >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler > >> <ross.zwisler@linux.intel.com> wrote: > >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: > >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler > >> > <> > >> >> > This decision can only be made (in this > >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been > >> >> > populated, which means we need another call into the filesystem after this > >> >> > insertion has happened. > >> >> > >> >> I get that, but it seems over-engineered and something that can also > >> >> be safely cleaned up after the fact by the code path that is disabling > >> >> DAX. > >> > > >> > I don't think you can safely clean it up after the fact because some thread > >> > might have already called ->mmap() to set up the vma->vm_flags for their new > >> > mapping, but they haven't added it to inode->i_mapping->i_mmap. > >> > >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the > >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal > >> memory mappings. > >> > >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) > >> > that the filesystem has any idea about about the mapping. This is the method > >> > by which we would try and clean up mapping flags, if we were to do so, and > >> > it's the only way that the filesystem can know whether or not mappings exist. > >> > > >> > The only way that I could think of to make this safely work is to have the > >> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After > >> > that the filesystem and the mapping code can communicate on the state of DAX, > >> > but before that I think it's basically indeterminate. > >> > >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what > >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in > >> the ->mmap() handler and let the default THP policy take over. In > >> fact, see transparent_hugepage_enabled() we already auto-enable huge > >> page support for dax mappings regardless of VM_HUGEPAGE. > > > > Hum, this is an interesting option. So do you suggest that filesystems > > supporting DAX would always setup mappings with VM_MIXEDMAP and without > > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? > > That could actually work. The only possible issue I can see is that > > VM_MIXEDMAP is still slightly different from normal page mappings and it > > could have some performance implications - e.g. copy_page_range() does more > > work on VM_MIXEDMAP mappings but not on normal page mappings. > > We can also get rid of VM_MIXEDMAP if we disable DAX in the > !pfn_t_has_page() case. Yeah, although it would be a pity to require struct page just to avoid having to set VM_MIXEDMAP flag... Honza
On Wed, Sep 27, 2017 at 8:07 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 27-09-17 07:00:53, Dan Williams wrote: >> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote: >> > On Tue 26-09-17 14:41:53, Dan Williams wrote: >> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler >> >> <ross.zwisler@linux.intel.com> wrote: >> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: >> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler >> >> > <> >> >> >> > This decision can only be made (in this >> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been >> >> >> > populated, which means we need another call into the filesystem after this >> >> >> > insertion has happened. >> >> >> >> >> >> I get that, but it seems over-engineered and something that can also >> >> >> be safely cleaned up after the fact by the code path that is disabling >> >> >> DAX. >> >> > >> >> > I don't think you can safely clean it up after the fact because some thread >> >> > might have already called ->mmap() to set up the vma->vm_flags for their new >> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap. >> >> >> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the >> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal >> >> memory mappings. >> >> >> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) >> >> > that the filesystem has any idea about about the mapping. This is the method >> >> > by which we would try and clean up mapping flags, if we were to do so, and >> >> > it's the only way that the filesystem can know whether or not mappings exist. >> >> > >> >> > The only way that I could think of to make this safely work is to have the >> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After >> >> > that the filesystem and the mapping code can communicate on the state of DAX, >> >> > but before that I think it's basically indeterminate. >> >> >> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what >> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in >> >> the ->mmap() handler and let the default THP policy take over. In >> >> fact, see transparent_hugepage_enabled() we already auto-enable huge >> >> page support for dax mappings regardless of VM_HUGEPAGE. >> > >> > Hum, this is an interesting option. So do you suggest that filesystems >> > supporting DAX would always setup mappings with VM_MIXEDMAP and without >> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? >> > That could actually work. The only possible issue I can see is that >> > VM_MIXEDMAP is still slightly different from normal page mappings and it >> > could have some performance implications - e.g. copy_page_range() does more >> > work on VM_MIXEDMAP mappings but not on normal page mappings. >> >> We can also get rid of VM_MIXEDMAP if we disable DAX in the >> !pfn_t_has_page() case. > > Yeah, although it would be a pity to require struct page just to avoid > having to set VM_MIXEDMAP flag... Yes, but the real motivation is fixing all the basic things that break without struct page, like ptrace and direct-i/o. The removal of needing to set VM_MIXEDMAP is just a nice side effect. I'll send a patch because DAX without pages has too many surprise failure cases.
On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote: > On Tue 26-09-17 14:41:53, Dan Williams wrote: > > On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler > > <ross.zwisler@linux.intel.com> wrote: > > > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote: > > >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler > > > <> > > >> > This decision can only be made (in this > > >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been > > >> > populated, which means we need another call into the filesystem after this > > >> > insertion has happened. > > >> > > >> I get that, but it seems over-engineered and something that can also > > >> be safely cleaned up after the fact by the code path that is disabling > > >> DAX. > > > > > > I don't think you can safely clean it up after the fact because some thread > > > might have already called ->mmap() to set up the vma->vm_flags for their new > > > mapping, but they haven't added it to inode->i_mapping->i_mmap. > > > > If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the > > DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal > > memory mappings. > > > > > The inode->i_mapping->i_mmap tree is the only way (that I know of at least) > > > that the filesystem has any idea about about the mapping. This is the method > > > by which we would try and clean up mapping flags, if we were to do so, and > > > it's the only way that the filesystem can know whether or not mappings exist. > > > > > > The only way that I could think of to make this safely work is to have the > > > insertion into the inode->i_mapping->i_mmap tree be our sync point. After > > > that the filesystem and the mapping code can communicate on the state of DAX, > > > but before that I think it's basically indeterminate. > > > > If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what > > breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in > > the ->mmap() handler and let the default THP policy take over. In > > fact, see transparent_hugepage_enabled() we already auto-enable huge > > page support for dax mappings regardless of VM_HUGEPAGE. > > Hum, this is an interesting option. So do you suggest that filesystems > supporting DAX would always setup mappings with VM_MIXEDMAP and without > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? > That could actually work. The only possible issue I can see is that > VM_MIXEDMAP is still slightly different from normal page mappings and it > could have some performance implications - e.g. copy_page_range() does more > work on VM_MIXEDMAP mappings but not on normal page mappings. It looks like having VM_MIXEDMAP always set for filesystems that support DAX might affect their memory's NUMA migration in the non-DAX case? 8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
On Wed, Sep 27, 2017 at 8:39 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote: [..] >> Hum, this is an interesting option. So do you suggest that filesystems >> supporting DAX would always setup mappings with VM_MIXEDMAP and without >> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap? >> That could actually work. The only possible issue I can see is that >> VM_MIXEDMAP is still slightly different from normal page mappings and it >> could have some performance implications - e.g. copy_page_range() does more >> work on VM_MIXEDMAP mappings but not on normal page mappings. > > It looks like having VM_MIXEDMAP always set for filesystems that support DAX > might affect their memory's NUMA migration in the non-DAX case? > > 8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings Addressed separately here: c1ef8e2c0235 mm: disable numa migration faults for dax vmas
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 2816858..9d66aaa 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1087,9 +1087,21 @@ xfs_file_mmap( { file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; + return 0; +} + +/* This call happens during mmap(), after the vma has been inserted into the + * inode->i_mapping->i_mmap tree. At this point the decision on whether or + * not to use DAX for this mapping has been set and will not change for the + * duration of the mapping. + */ +STATIC void +xfs_file_post_mmap( + struct file *filp, + struct vm_area_struct *vma) +{ if (IS_DAX(file_inode(filp))) vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; - return 0; } const struct file_operations xfs_file_operations = { @@ -1103,6 +1115,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .post_mmap = xfs_file_post_mmap, .open = xfs_file_open, .release = xfs_file_release, .fsync = xfs_file_fsync, diff --git a/include/linux/fs.h b/include/linux/fs.h index 339e737..7c06838 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1701,6 +1701,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + void (*post_mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/mm/mmap.c b/mm/mmap.c index 680506f..ee7458a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1711,6 +1711,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_link(mm, vma, prev, rb_link, rb_parent); /* Once vma denies write, undo our temporary denial count */ if (file) { + if (file->f_op->post_mmap) + file->f_op->post_mmap(file, vma); if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); if (vm_flags & VM_DENYWRITE)
When mappings are created the vma->vm_flags that they use vary based on whether the inode being mapped is using DAX or not. This setup happens in XFS via mmap_region()=>call_mmap()=>xfs_file_mmap(). For us to be able to safely use the DAX per-inode flag we need to prevent S_DAX transitions when any mappings are present, and we will do that by looking at the address_space->i_mmap tree and returning -EBUSY if any mappings are present. Unfortunately at the time that the filesystem's file_operations->mmap() entry point is called the mapping has not yet been added to the address_space->i_mmap tree. This means that at that point in time we cannot determine whether or not the mapping will be set up to support DAX. Fix this by adding a new file_operations entry called post_mmap() which is called after the mapping has been added to the address_space->i_mmap tree. This post_mmap() op now happens at a time when we can be sure whether the mapping will use DAX or not, and we can set up the vma->vm_flags appropriately. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/xfs/xfs_file.c | 15 ++++++++++++++- include/linux/fs.h | 1 + mm/mmap.c | 2 ++ 3 files changed, 17 insertions(+), 1 deletion(-)