[RFC] pmem: advertise page alignment for pmem devices supporting fsdax
diff mbox series

Message ID 20190222182008.GT6503@magnolia
State New
Headers show
Series
  • [RFC] pmem: advertise page alignment for pmem devices supporting fsdax
Related show

Commit Message

Darrick J. Wong Feb. 22, 2019, 6:20 p.m. UTC
Hi all!

Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
on pmem, and they've observed that one has to do a fair amount of
legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
so the PMD mappings are much more efficient.

I started poking around w.r.t. what mkfs.xfs was doing and realized that
if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
set up all the parameters automatically.  Below is my ham-handed attempt
to teach the kernel to do this.

Comments, flames, "WTF is this guy smoking?" are all welcome. :)

--D

---
Configure pmem devices to advertise the default page alignment when said
block device supports fsdax.  Certain filesystems use these iomin/ioopt
hints to try to create aligned file extents, which makes it much easier
for mmaps to take advantage of huge page table entries.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 drivers/nvdimm/pmem.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Dan Williams Feb. 22, 2019, 6:28 p.m. UTC | #1
On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> Hi all!
>
> Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> on pmem, and they've observed that one has to do a fair amount of
> legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> so the PMD mappings are much more efficient.
>
> I started poking around w.r.t. what mkfs.xfs was doing and realized that
> if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> set up all the parameters automatically.  Below is my ham-handed attempt
> to teach the kernel to do this.
>
> Comments, flames, "WTF is this guy smoking?" are all welcome. :)
>
> --D
>
> ---
> Configure pmem devices to advertise the default page alignment when said
> block device supports fsdax.  Certain filesystems use these iomin/ioopt
> hints to try to create aligned file extents, which makes it much easier
> for mmaps to take advantage of huge page table entries.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  drivers/nvdimm/pmem.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index bc2f700feef8..3eeb9dd117d5 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
>         blk_queue_logical_block_size(q, pmem_sector_size(ndns));
>         blk_queue_max_hw_sectors(q, UINT_MAX);
>         blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> -       if (pmem->pfn_flags & PFN_MAP)
> +       if (pmem->pfn_flags & PFN_MAP) {
>                 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> +               blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> +               blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);

The device alignment might sometimes be bigger than this default.
Would there be any detrimental effects for filesystems if io_min and
io_opt were set to 1GB?

I'm thinking and xfs-realtime configuration might be able to support
1GB mappings in the future.
Darrick J. Wong Feb. 22, 2019, 6:45 p.m. UTC | #2
On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> >
> > Hi all!
> >
> > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > on pmem, and they've observed that one has to do a fair amount of
> > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > so the PMD mappings are much more efficient.
> >
> > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > set up all the parameters automatically.  Below is my ham-handed attempt
> > to teach the kernel to do this.
> >
> > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> >
> > --D
> >
> > ---
> > Configure pmem devices to advertise the default page alignment when said
> > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > hints to try to create aligned file extents, which makes it much easier
> > for mmaps to take advantage of huge page table entries.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  drivers/nvdimm/pmem.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index bc2f700feef8..3eeb9dd117d5 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> >         blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> >         blk_queue_max_hw_sectors(q, UINT_MAX);
> >         blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > -       if (pmem->pfn_flags & PFN_MAP)
> > +       if (pmem->pfn_flags & PFN_MAP) {
> >                 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > +               blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > +               blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> 
> The device alignment might sometimes be bigger than this default.
> Would there be any detrimental effects for filesystems if io_min and
> io_opt were set to 1GB?

Hmmm, that's going to be a struggle on ext4 and the xfs data device
because we'd be preferentially skipping the 1023.8MB immediately after
each allocation group's metadata.  It already does this now with a 2MB
io hint, but losing 1.8MB here and there isn't so bad.

We'd have to study it further, though; filesystems historically have
interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
think very many people set up 1GB raid stripe units.

(I doubt very many people have done 2M raid stripes either, but it seems
to work easily where we've tried it...)

> I'm thinking and xfs-realtime configuration might be able to support
> 1GB mappings in the future.

The xfs realtime device ought to be able to support 1g alignment pretty
easily though. :)

--D
Dave Chinner Feb. 22, 2019, 11:11 p.m. UTC | #3
On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote:
> Hi all!
> 
> Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> on pmem, and they've observed that one has to do a fair amount of
> legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> so the PMD mappings are much more efficient.
> 
> I started poking around w.r.t. what mkfs.xfs was doing and realized that
> if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> set up all the parameters automatically.  Below is my ham-handed attempt
> to teach the kernel to do this.

What's the before and after mkfs output?

(need to see the context that this "fixes" before I comment)

Cheers,

Dave.
Darrick J. Wong Feb. 22, 2019, 11:28 p.m. UTC | #4
On Sat, Feb 23, 2019 at 10:11:36AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote:
> > Hi all!
> > 
> > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > on pmem, and they've observed that one has to do a fair amount of
> > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > so the PMD mappings are much more efficient.
> > 
> > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > set up all the parameters automatically.  Below is my ham-handed attempt
> > to teach the kernel to do this.
> 
> What's the before and after mkfs output?
> 
> (need to see the context that this "fixes" before I comment)

Here's what we do today assuming no options and 800GB pmem devices:

# blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1
4096
0
4096
0
# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=52428800 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=209715200, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=102400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1             extsz=4096   blocks=209715200, rtextents=209715200

And here's what we do to get 2M aligned mappings:

# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1,extsize=2m -d su=2m,sw=1
meta-data=/dev/pmem0             isize=512    agcount=32, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=209715200, imaxpct=25
         =                       sunit=512    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=102400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1             extsz=2097152 blocks=209715200, rtextents=409600

With this patch, things change as such:

# blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1
2097152
2097152
2097152
2097152
# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1
meta-data=/dev/pmem0             isize=512    agcount=32, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=209715200, imaxpct=25
         =                       sunit=512    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=102400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1             extsz=2097152 blocks=209715200, rtextents=409600

I think the only change is the agcount, which for 2M mappings probably
isn't a huge deal.  It's obviously a bigger deal for 1G pages, assuming
we decide that's even advisable.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Feb. 22, 2019, 11:30 p.m. UTC | #5
On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> > <darrick.wong@oracle.com> wrote:
> > >
> > > Hi all!
> > >
> > > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > > on pmem, and they've observed that one has to do a fair amount of
> > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > > so the PMD mappings are much more efficient.

Are you really saying that "mkfs.xfs -d su=2MB,sw=1 <dev>" is
considered "too much legwork" to set up the filesystem for DAX and
PMD alignment?


> > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > set up all the parameters automatically.  Below is my ham-handed attempt
> > > to teach the kernel to do this.

Still need extent size hints so that writes that are smaller than
the PMD size are allocated correctly aligned and sized to map to
PMDs...

> > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > >
> > > --D
> > >
> > > ---
> > > Configure pmem devices to advertise the default page alignment when said
> > > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > > hints to try to create aligned file extents, which makes it much easier
> > > for mmaps to take advantage of huge page table entries.
> > >
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  drivers/nvdimm/pmem.c |    5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > index bc2f700feef8..3eeb9dd117d5 100644
> > > --- a/drivers/nvdimm/pmem.c
> > > +++ b/drivers/nvdimm/pmem.c
> > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > >         blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > >         blk_queue_max_hw_sectors(q, UINT_MAX);
> > >         blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > -       if (pmem->pfn_flags & PFN_MAP)
> > > +       if (pmem->pfn_flags & PFN_MAP) {
> > >                 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > +               blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > +               blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > 
> > The device alignment might sometimes be bigger than this default.
> > Would there be any detrimental effects for filesystems if io_min and
> > io_opt were set to 1GB?
> 
> Hmmm, that's going to be a struggle on ext4 and the xfs data device
> because we'd be preferentially skipping the 1023.8MB immediately after
> each allocation group's metadata.  It already does this now with a 2MB
> io hint, but losing 1.8MB here and there isn't so bad.
> 
> We'd have to study it further, though; filesystems historically have
> interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> think very many people set up 1GB raid stripe units.

Setting sunit=1GB is really going to cause havoc with things like
inode chunk allocation alignment, and the first write() will either
have to be >=1GB or use 1GB extent size hints to trigger alignment.
And, AFAICT, it will prevent us from doing 2MB alignment on other
files, even with 2MB extent size hints set.

IOWs, I don't think 1GB alignment is a good idea as a default.

> (I doubt very many people have done 2M raid stripes either, but it seems
> to work easily where we've tried it...)

That's been pretty common with stacked hardware raid for as long as
I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6
luns was pretty common with large storage arrays in HPC environments
(i.e. huge streaming read/write bandwidth). In these cases, XFS was
set up with the RAID5/6 lun width as the stripe unit (commonly 2MB
with 8+1 and 256k raid chunk size), and the RAID 0
width as the stripe width (commonly 8-16 wide spread across 8-16 FC
portsi w/ multipath) and it wasn't uncommon to see widths in the
16-32MB range.

This aligned the filesystem to the underlying RAID5/6 luns, and
allows stripe width IO to be aligned an hit every RAID5/6 lun
evenly. Ensuring applications could do this easily with large direct
IO reads and writes is where the swalloc and largeio mount
options come into their own....

> > I'm thinking and xfs-realtime configuration might be able to support
> > 1GB mappings in the future.
> 
> The xfs realtime device ought to be able to support 1g alignment pretty
> easily though. :)

Yup, but I think that's the maximum "block" size it can support and
DAX will have some serious long tail latency and CPU usage issues at
allocation time because each new 1GB "block" that is dynamically
allocated will have to be completely zeroed during the allocation
inside the page fault handler.....

Cheers,

Dave.
Darrick J. Wong Feb. 23, 2019, 12:46 a.m. UTC | #6
On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> > > <darrick.wong@oracle.com> wrote:
> > > >
> > > > Hi all!
> > > >
> > > > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > > > on pmem, and they've observed that one has to do a fair amount of
> > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > > > so the PMD mappings are much more efficient.
> 
> Are you really saying that "mkfs.xfs -d su=2MB,sw=1 <dev>" is
> considered "too much legwork" to set up the filesystem for DAX and
> PMD alignment?

Yes.  I mean ... userspace /can/ figure out the page sizes on arm64 &
ppc64le (or extract it from sysfs), but why not just advertise it as a
io hint on the pmem "block" device?

Hmm, now having watched various xfstests blow up because they don't
expect blocks to be larger than 64k, maybe I'll rethink this as a
default behavior. :)

> > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > > set up all the parameters automatically.  Below is my ham-handed attempt
> > > > to teach the kernel to do this.
> 
> Still need extent size hints so that writes that are smaller than
> the PMD size are allocated correctly aligned and sized to map to
> PMDs...

I think we're generally planning to use the RT device where we can make
2M alignment mandatory, so for the data device the effectiveness of the
extent hint doesn't really matter.

> > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > > >
> > > > --D
> > > >
> > > > ---
> > > > Configure pmem devices to advertise the default page alignment when said
> > > > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > > > hints to try to create aligned file extents, which makes it much easier
> > > > for mmaps to take advantage of huge page table entries.
> > > >
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  drivers/nvdimm/pmem.c |    5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > index bc2f700feef8..3eeb9dd117d5 100644
> > > > --- a/drivers/nvdimm/pmem.c
> > > > +++ b/drivers/nvdimm/pmem.c
> > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > > >         blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > > >         blk_queue_max_hw_sectors(q, UINT_MAX);
> > > >         blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > > -       if (pmem->pfn_flags & PFN_MAP)
> > > > +       if (pmem->pfn_flags & PFN_MAP) {
> > > >                 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > > +               blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > > +               blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > > 
> > > The device alignment might sometimes be bigger than this default.
> > > Would there be any detrimental effects for filesystems if io_min and
> > > io_opt were set to 1GB?
> > 
> > Hmmm, that's going to be a struggle on ext4 and the xfs data device
> > because we'd be preferentially skipping the 1023.8MB immediately after
> > each allocation group's metadata.  It already does this now with a 2MB
> > io hint, but losing 1.8MB here and there isn't so bad.
> > 
> > We'd have to study it further, though; filesystems historically have
> > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> > think very many people set up 1GB raid stripe units.
> 
> Setting sunit=1GB is really going to cause havoc with things like
> inode chunk allocation alignment, and the first write() will either
> have to be >=1GB or use 1GB extent size hints to trigger alignment.
> And, AFAICT, it will prevent us from doing 2MB alignment on other
> files, even with 2MB extent size hints set.
> 
> IOWs, I don't think 1GB alignment is a good idea as a default.

<nods>

> > (I doubt very many people have done 2M raid stripes either, but it seems
> > to work easily where we've tried it...)
> 
> That's been pretty common with stacked hardware raid for as long as
> I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6
> luns was pretty common with large storage arrays in HPC environments
> (i.e. huge streaming read/write bandwidth). In these cases, XFS was
> set up with the RAID5/6 lun width as the stripe unit (commonly 2MB
> with 8+1 and 256k raid chunk size), and the RAID 0
> width as the stripe width (commonly 8-16 wide spread across 8-16 FC
> portsi w/ multipath) and it wasn't uncommon to see widths in the
> 16-32MB range.
> 
> This aligned the filesystem to the underlying RAID5/6 luns, and
> allows stripe width IO to be aligned an hit every RAID5/6 lun
> evenly. Ensuring applications could do this easily with large direct
> IO reads and writes is where the swalloc and largeio mount
> options come into their own....

<nod>

> > > I'm thinking and xfs-realtime configuration might be able to support
> > > 1GB mappings in the future.
> > 
> > The xfs realtime device ought to be able to support 1g alignment pretty
> > easily though. :)
> 
> Yup, but I think that's the maximum "block" size it can support and

It is; our users with 16G page size are out of luck.

> DAX will have some serious long tail latency and CPU usage issues at
> allocation time because each new 1GB "block" that is dynamically
> allocated will have to be completely zeroed during the allocation
> inside the page fault handler.....

Agreed, 1G pages are most probably too unwieldly to be worth advertising.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Patch
diff mbox series

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..3eeb9dd117d5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -441,8 +441,11 @@  static int pmem_attach_disk(struct device *dev,
 	blk_queue_logical_block_size(q, pmem_sector_size(ndns));
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
-	if (pmem->pfn_flags & PFN_MAP)
+	if (pmem->pfn_flags & PFN_MAP) {
 		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+		blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
+		blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
+	}
 	q->queuedata = pmem;
 
 	disk = alloc_disk_node(0, nid);