diff mbox

x86 get_unmapped_area: Add PMD alignment for DAX PMD mmap

Message ID 1459951089-14911-1-git-send-email-toshi.kani@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kani, Toshi April 6, 2016, 1:58 p.m. UTC
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page
size.  This feature relies on both mmap virtual address and FS
block data (i.e. physical address) to be aligned by the PMD page
size.  Users can use mkfs options to specify FS to align block
allocations.  However, aligning mmap() address requires application
changes to mmap() calls, such as:

 -  /* let the kernel to assign a mmap addr */
 -  mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);

 +  /* 1. obtain a PMD-aligned virtual address */
 +  ret = posix_memalign(&mptr, PMD_SIZE, fsize);
 +  if (!ret)
 +    free(mptr);  /* 2. release the virt addr */
 +
 +  /* 3. then pass the PMD-aligned virt addr to mmap() */
 +  mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);

These changes add unnecessary dependency to DAX and PMD page size
into application code.  The kernel should assign a mmap address
appropriate for the operation.

Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown()
to request PMD_SIZE alignment when the request is for a DAX file and
its mapping range is large enough for using a PMD page.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Matthew Wilcox April 6, 2016, 4:50 p.m. UTC | #1
On Wed, Apr 06, 2016 at 07:58:09AM -0600, Toshi Kani wrote:
> When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page
> size.  This feature relies on both mmap virtual address and FS
> block data (i.e. physical address) to be aligned by the PMD page
> size.  Users can use mkfs options to specify FS to align block
> allocations.  However, aligning mmap() address requires application
> changes to mmap() calls, such as:
> 
>  -  /* let the kernel to assign a mmap addr */
>  -  mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> 
>  +  /* 1. obtain a PMD-aligned virtual address */
>  +  ret = posix_memalign(&mptr, PMD_SIZE, fsize);
>  +  if (!ret)
>  +    free(mptr);  /* 2. release the virt addr */
>  +
>  +  /* 3. then pass the PMD-aligned virt addr to mmap() */
>  +  mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> 
> These changes add unnecessary dependency to DAX and PMD page size
> into application code.  The kernel should assign a mmap address
> appropriate for the operation.

I question the need for this patch.  Choosing an appropriate base address
is the least of the changes needed for an application to take advantage of
DAX.  The NVML chooses appropriate addresses and gets a properly aligned
address without any kernel code.

> Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown()
> to request PMD_SIZE alignment when the request is for a DAX file and
> its mapping range is large enough for using a PMD page.

I think this is the wrong place for it, if we decide that this is the
right thing to do.  The filesystem has a get_unmapped_area() which
should be used instead.

> @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  		info.align_mask = get_align_mask();
>  		info.align_offset += get_align_bits();
>  	}
> +	if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) {

And there's never a need for the IS_ENABLED.  IS_DAX() compiles to '0' if
CONFIG_FS_DAX is disabled.

And where would this end?  Would you also change this code to look for
1GB entries if CONFIG_FS_DAX_PUD is enabled?  Far better to have this
in the individual filesystem (probably calling a common helper in the DAX code).
Kani, Toshi April 6, 2016, 5:44 p.m. UTC | #2
On Wed, 2016-04-06 at 12:50 -0400, Matthew Wilcox wrote:
> On Wed, Apr 06, 2016 at 07:58:09AM -0600, Toshi Kani wrote:
> > 
> > When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page
> > size.  This feature relies on both mmap virtual address and FS
> > block data (i.e. physical address) to be aligned by the PMD page
> > size.  Users can use mkfs options to specify FS to align block
> > allocations.  However, aligning mmap() address requires application
> > changes to mmap() calls, such as:
> > 
> >  -  /* let the kernel to assign a mmap addr */
> >  -  mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> > 
> >  +  /* 1. obtain a PMD-aligned virtual address */
> >  +  ret = posix_memalign(&mptr, PMD_SIZE, fsize);
> >  +  if (!ret)
> >  +    free(mptr);  /* 2. release the virt addr */
> >  +
> >  +  /* 3. then pass the PMD-aligned virt addr to mmap() */
> >  +  mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> > 
> > These changes add unnecessary dependency to DAX and PMD page size
> > into application code.  The kernel should assign a mmap address
> > appropriate for the operation.
>
> I question the need for this patch.  Choosing an appropriate base address
> is the least of the changes needed for an application to take advantage
> of DAX.  

An application also needs to make sure that a given range [base -
base+size] is free in VMA.  The above example uses posix_memalign() to find
such a range, which in turn calls mmap() with size as (fsize + PMD_SIZE) in
this case.

> The NVML chooses appropriate addresses and gets a properly aligned
> address without any kernel code.

An application like NVML can continue to specify a specific address to
mmap().  Most existing applications, however, do not specify an address to
mmap().  With this patch, specifying an address will remain optional.

> > Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown()
> > to request PMD_SIZE alignment when the request is for a DAX file and
> > its mapping range is large enough for using a PMD page.
>
> I think this is the wrong place for it, if we decide that this is the
> right thing to do.  The filesystem has a get_unmapped_area() which
> should be used instead.

Yes, I considered adding a filesystem entry point, but decided going this
way because:
 - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are arch-
specific code.  Therefore, this filesystem entry point will need arch-
specific implementation. 
 - There is nothing filesystem specific about requesting PMD alignment.

> > 
> > @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned
> > long addr,
> >  		info.align_mask = get_align_mask();
> >  		info.align_offset += get_align_bits();
> >  	}
> > +	if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) &&
> > IS_DAX(file_inode(filp))) {
>
> And there's never a need for the IS_ENABLED.  IS_DAX() compiles to '0' if
> CONFIG_FS_DAX is disabled.

CONFIG_FS_DAX_PMD can be disabled while CONFIG_FS_DAX is enabled.

> And where would this end?  Would you also change this code to look for
> 1GB entries if CONFIG_FS_DAX_PUD is enabled?  Far better to have this
> in the individual filesystem (probably calling a common helper in the DAX
> code).

Yes, it can be easily extended to support PUD.  This avoids another round
of application changes to align with the PUD size.

If the PUD support turns out to be filesystem specific, we may need a
capability bit in addition to CONFIG_FS_DAX_PUD.

Thanks,
-Toshi
Matthew Wilcox April 7, 2016, 5:41 p.m. UTC | #3
On Wed, Apr 06, 2016 at 11:44:32AM -0600, Toshi Kani wrote:
> > The NVML chooses appropriate addresses and gets a properly aligned
> > address without any kernel code.
> 
> An application like NVML can continue to specify a specific address to
> mmap().  Most existing applications, however, do not specify an address to
> mmap().  With this patch, specifying an address will remain optional.

The point is that this *can* be done in userspace.  You need to sell us
on the advantages of doing it in the kernel.

> > I think this is the wrong place for it, if we decide that this is the
> > right thing to do.  The filesystem has a get_unmapped_area() which
> > should be used instead.
> 
> Yes, I considered adding a filesystem entry point, but decided going this
> way because:
>  - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are arch-
> specific code.  Therefore, this filesystem entry point will need arch-
> specific implementation. 
>  - There is nothing filesystem specific about requesting PMD alignment.

See http://article.gmane.org/gmane.linux.kernel.mm/149227 for Hugh's
approach for shmem.  I strongly believe that if we're going to do this
i the kernel, we should build on this approach, and not hack something
into each architecture's generic get_unmapped_area.
Kani, Toshi April 7, 2016, 9:20 p.m. UTC | #4
On Thu, 2016-04-07 at 13:41 -0400, Matthew Wilcox wrote:
> On Wed, Apr 06, 2016 at 11:44:32AM -0600, Toshi Kani wrote:
> > > 
> > > The NVML chooses appropriate addresses and gets a properly aligned
> > > address without any kernel code.
> >
> > An application like NVML can continue to specify a specific address to
> > mmap().  Most existing applications, however, do not specify an address
> > to mmap().  With this patch, specifying an address will remain
> > optional.
>
> The point is that this *can* be done in userspace.  You need to sell us
> on the advantages of doing it in the kernel.

Sure.  As I said, the point is that we do not need to modify existing
applications for using DAX PMD mappings.

For instance, fio with "ioengine=mmap" performs I/Os with mmap().
https://github.com/caius/fio/blob/master/engines/mmap.c

With this change, unmodified fio can be used for testing with DAX PMD
mappings.  There are many examples like this, and I do not think we want to
modify all applications that we want to evaluate/test with.

> > > I think this is the wrong place for it, if we decide that this is the
> > > right thing to do.  The filesystem has a get_unmapped_area() which
> > > should be used instead.
> >
> > Yes, I considered adding a filesystem entry point, but decided going
> > this way because:
> >  - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are
> > arch-specific code.  Therefore, this filesystem entry point will need
> > arch-specific implementation. 
> >  - There is nothing filesystem specific about requesting PMD alignment.
>
> See http://article.gmane.org/gmane.linux.kernel.mm/149227 for Hugh's
> approach for shmem.  I strongly believe that if we're going to do this
> i the kernel, we should build on this approach, and not hack something
> into each architecture's generic get_unmapped_area.

Thanks for the pointer.  Yes, we can call current->mm->get_unmapped_area()
with size + PMD_SIZE, and adjust with the alignment in a filesystem entry
point.  I will update the patch with this approach.

-Toshi
diff mbox

Patch

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 10e0272..a294c66 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -157,6 +157,13 @@  arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+	if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) {
+		unsigned long off_end = info.align_offset + len;
+		unsigned long off_pmd = round_up(info.align_offset, PMD_SIZE);
+
+		if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE))
+			info.align_mask |= (PMD_SIZE - 1);
+	}
 	return vm_unmapped_area(&info);
 }
 
@@ -200,6 +207,13 @@  arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+	if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) {
+		unsigned long off_end = info.align_offset + len;
+		unsigned long off_pmd = round_up(info.align_offset, PMD_SIZE);
+
+		if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE))
+			info.align_mask |= (PMD_SIZE - 1);
+	}
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
 		return addr;