diff mbox

libnvdimm, pmem: Change default pmem sector size to 4K

Message ID 1437519983-17306-1-git-send-email-vishal.l.verma@intel.com (mailing list archive)
State Superseded
Delegated to: Dan Williams
Headers show

Commit Message

Vishal Verma July 21, 2015, 11:06 p.m. UTC
4K is more naturally aligned w.r.t pages, allows us to receive simpler
bios, and rejects creation of filesystems with sub-4K block sizes (DAX
requires -b 4096).

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/nvdimm/pmem.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Boaz Harrosh July 22, 2015, noon UTC | #1
On 07/22/2015 02:06 AM, Vishal Verma wrote:
> 4K is more naturally aligned w.r.t pages, allows us to receive simpler
> bios, and rejects creation of filesystems with sub-4K block sizes (DAX
> requires -b 4096).
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/nvdimm/pmem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ade9eb9..7f85537 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -162,6 +162,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
>  		return -ENOMEM;
>  
>  	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
> +	blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K);

Did you actually test this? I think logical_block_size means the addressing
is 4k (and phys == 4k). ie address(1) means read from 4kB offset.

As opposed to logical (addressing) 512B but physical=4k ie. atomic unit
is 4k.

In current code it does not matter match but it might be confusing to
external tools.

>  	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
>  	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
>  	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
> 

Thanks
Boaz
Matthew Wilcox July 22, 2015, 1:40 p.m. UTC | #2
On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote:
> 4K is more naturally aligned w.r.t pages, allows us to receive simpler
> bios, and rejects creation of filesystems with sub-4K block sizes (DAX
> requires -b 4096).

Should we say 'PAGE_SIZE' instead of 'SZ_4K'?

>  	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
> +	blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K);
>  	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
Vishal Verma July 22, 2015, 5:15 p.m. UTC | #3
On Wed, 2015-07-22 at 15:00 +0300, Boaz Harrosh wrote:
> On 07/22/2015 02:06 AM, Vishal Verma wrote:
> > 4K is more naturally aligned w.r.t pages, allows us to receive 
> > simpler
> > bios, and rejects creation of filesystems with sub-4K block sizes 
> > (DAX
> > requires -b 4096).
> > 
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> > ---
> >  drivers/nvdimm/pmem.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index ade9eb9..7f85537 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -162,6 +162,7 @@ static int pmem_attach_disk(struct 
> > nd_namespace_common *ndns,
> >  		return -ENOMEM;
> >  
> >  	blk_queue_make_request(pmem->pmem_queue, 
> > pmem_make_request);
> > +	blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K);
> 
> Did you actually test this? I think logical_block_size means the 
> addressing
> is 4k (and phys == 4k). ie address(1) means read from 4kB offset.
> 
> As opposed to logical (addressing) 512B but physical=4k ie. atomic 
> unit
> is 4k.

I ran xfstests on it, yes :)
The only problem a 4k (or PAGE_SIZE, rather) logical block size might
cause is some extra read-modify-writes in the page cache if something
is sending down sub-4K IOs. In return we get less confusion with DAX
mounts not failing.

> 
> In current code it does not matter match but it might be confusing to
> external tools.

mkfs, mount, fdisk, fio, and everything xfstests does are fine with
this - what other tools might find it confusing?

> 
> >  	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
> >  	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
> >  	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem
> > ->pmem_queue);
> > 
> 
> Thanks
> Boaz
>
Vishal Verma July 22, 2015, 5:16 p.m. UTC | #4
On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote:
> On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote:
> > 4K is more naturally aligned w.r.t pages, allows us to receive 
> > simpler
> > bios, and rejects creation of filesystems with sub-4K block sizes 
> > (DAX
> > requires -b 4096).
> 
> Should we say 'PAGE_SIZE' instead of 'SZ_4K'?
> 


Good point, I'll send an update!
Dan Williams July 22, 2015, 5:18 p.m. UTC | #5
On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L
<vishal.l.verma@intel.com> wrote:
> On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote:
>> On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote:
>> > 4K is more naturally aligned w.r.t pages, allows us to receive
>> > simpler
>> > bios, and rejects creation of filesystems with sub-4K block sizes
>> > (DAX
>> > requires -b 4096).
>>
>> Should we say 'PAGE_SIZE' instead of 'SZ_4K'?
>>
>
>
> Good point, I'll send an update!

Hmm, what about cases where the arch PAGE_SIZE > 4K?
Vishal Verma July 22, 2015, 5:20 p.m. UTC | #6
On Wed, 2015-07-22 at 10:18 -0700, Dan Williams wrote:
> On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L
> <vishal.l.verma@intel.com> wrote:
> > On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote:
> > > On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote:
> > > > 4K is more naturally aligned w.r.t pages, allows us to receive
> > > > simpler
> > > > bios, and rejects creation of filesystems with sub-4K block 
> > > > sizes
> > > > (DAX
> > > > requires -b 4096).
> > > 
> > > Should we say 'PAGE_SIZE' instead of 'SZ_4K'?
> > > 
> > 
> > 
> > Good point, I'll send an update!
> 
> Hmm, what about cases where the arch PAGE_SIZE > 4K?

DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't
the rest of the block stack Just Work?
Dan Williams July 22, 2015, 5:27 p.m. UTC | #7
On Wed, Jul 22, 2015 at 10:20 AM, Verma, Vishal L
<vishal.l.verma@intel.com> wrote:
> On Wed, 2015-07-22 at 10:18 -0700, Dan Williams wrote:
>> On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L
>> <vishal.l.verma@intel.com> wrote:
>> > On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote:
>> > > On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote:
>> > > > 4K is more naturally aligned w.r.t pages, allows us to receive
>> > > > simpler
>> > > > bios, and rejects creation of filesystems with sub-4K block
>> > > > sizes
>> > > > (DAX
>> > > > requires -b 4096).
>> > >
>> > > Should we say 'PAGE_SIZE' instead of 'SZ_4K'?
>> > >
>> >
>> >
>> > Good point, I'll send an update!
>>
>> Hmm, what about cases where the arch PAGE_SIZE > 4K?
>
> DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't
> the rest of the block stack Just Work?

I honestly don't know if the block layer will "just work" with a 64K
sector device.  Filesystems need special consideration for PAGE_SIZE >
fs-block-size... I wouldn't bet on it blindly working, but I haven't
looked closely.
Matthew Wilcox July 22, 2015, 5:33 p.m. UTC | #8
On Wed, Jul 22, 2015 at 10:27:46AM -0700, Dan Williams wrote:
> On Wed, Jul 22, 2015 at 10:20 AM, Verma, Vishal L
> <vishal.l.verma@intel.com> wrote:
> > DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't
> > the rest of the block stack Just Work?
> 
> I honestly don't know if the block layer will "just work" with a 64K
> sector device.  Filesystems need special consideration for PAGE_SIZE >
> fs-block-size... I wouldn't bet on it blindly working, but I haven't
> looked closely.

I'm pretty sure the PowerPC people have been using 64k block size ext3/4
filesytems for a while.  I know some people used 16k block size ext3
filesystems when I was working on itanium.  The problem with them was
that you couldn't mount those filesystems on a kernel with a 4k PAGE_SIZE.

For a filesystem on DIMMs, that's less of a problem ... but if you make
PAGE_SIZE a config option (like itanium and powerpc do), then it can be
a bit of a pain.
Boaz Harrosh July 22, 2015, 6:18 p.m. UTC | #9
On 07/22/2015 08:15 PM, Verma, Vishal L wrote:
> 
> I ran xfstests on it, yes :)
> The only problem a 4k (or PAGE_SIZE, rather) logical block size might
> cause is some extra read-modify-writes in the page cache if something
> is sending down sub-4K IOs. 

Exactly this is the part I do not like. We are actually byte capable
and now we'll do all kind of crazy read-modify-write slowness for
no good reason.

> In return we get less confusion with DAX mounts not failing.
> 

So I've put a very similar patch for brd that sets the PHYS sector
and it soles all these problems all the same, with out the added
read-modify-write, and address translations at upper stack.

I actually had the same patch for pmem, out-of-tree at the time,
and Ross never took it. so it got lost.

In anyway I think both brd and pmem should have the same setting
since it is trying to solve the same problem. So what ever is decided
it should be the same at both places.

<>
> 
> mkfs, mount, fdisk, fio, and everything xfstests does are fine with
> this - what other tools might find it confusing?
> 

Right, by setting:
		blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);

All your problems are solved and no un-needed copies are made

please see this patch:
	[c8fa317] brd: Request from fdisk 4k alignment

Thanks
Boaz
Vishal Verma July 22, 2015, 8:57 p.m. UTC | #10
On Wed, 2015-07-22 at 21:18 +0300, Boaz Harrosh wrote:
> On 07/22/2015 08:15 PM, Verma, Vishal L wrote:
> > 
> > I ran xfstests on it, yes :)
> > The only problem a 4k (or PAGE_SIZE, rather) logical block size 
> > might
> > cause is some extra read-modify-writes in the page cache if 
> > something
> > is sending down sub-4K IOs. 
> 
> Exactly this is the part I do not like. We are actually byte capable
> and now we'll do all kind of crazy read-modify-write slowness for
> no good reason.
> 
> > In return we get less confusion with DAX mounts not failing.
> > 
> 
> So I've put a very similar patch for brd that sets the PHYS sector
> and it soles all these problems all the same, with out the added
> read-modify-write, and address translations at upper stack.
> 

Note that with just the physical sector size set to 4K, mkfs only emits
a warning, but with a logical sector size of 4K, it fails[1]. I suppose
this should be OK as we don't want to force people to create DAX
capable file systems only.

	-Vishal


[1] Comparing logical & physical sector size at 4K to only physical at
4K:

Logical sector size of 4K (implies physical):

$ sudo fdisk -l /dev/pmem0

Disk /dev/pmem0: 8589 MB, 8589934592 bytes, 2097152 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

$ sudo mkfs.ext4 -b 1024 /dev/pmem0
mke2fs 1.42.9 (28-Dec-2013)
mkfs.ext4: Invalid argument while setting blocksize; too small for
device

Physical sector size of 4K (with logical = 512):

$ sudo fdisk -l /dev/pmem0

Disk /dev/pmem0: 8589 MB, 8589934592 bytes, 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

$ sudo mkfs.ext4 -b 1024 /dev/pmem0
mke2fs 1.42.9 (28-Dec-2013)
Warning: specified blocksize 1024 is less than device physical
sectorsize 4096
...
...
<succeeds>
Elliott, Robert (Server Storage) July 22, 2015, 9:41 p.m. UTC | #11
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Verma, Vishal L
> Sent: Wednesday, July 22, 2015 3:58 PM
> To: linux-nvdimm@lists.01.org; boaz@plexistor.com
> Cc: hch@lst.de; Wilcox, Matthew R <matthew.r.wilcox@intel.com>
> Subject: Re: [PATCH] libnvdimm, pmem: Change default pmem sector size to
> 4K
> 
> On Wed, 2015-07-22 at 21:18 +0300, Boaz Harrosh wrote:
> > On 07/22/2015 08:15 PM, Verma, Vishal L wrote:
> > >
> > > I ran xfstests on it, yes :)
> > > The only problem a 4k (or PAGE_SIZE, rather) logical block size
> > > might
> > > cause is some extra read-modify-writes in the page cache if
> > > something
> > > is sending down sub-4K IOs.
> >
> > Exactly this is the part I do not like. We are actually byte capable
> > and now we'll do all kind of crazy read-modify-write slowness for
> > no good reason.
> >
> > > In return we get less confusion with DAX mounts not failing.
> > >
> >
> > So I've put a very similar patch for brd that sets the PHYS sector
> > and it soles all these problems all the same, with out the added
> > read-modify-write, and address translations at upper stack.

The physical block size is just advisory; software is not required
to align to it.  The device is still expected to support arbitrary
logical block accesses, performing read-modify-writes as needed.

If DAX and O_DIRECT capability requires page size (e.g., 4 KiB) 
alignment, then the only way to ensure compliance is to report that 
as the logical block size.  The open(2) documentation for O_DIRECT
says that alignment to the logical block size (BLKSSZGET ioctl) 
is all that has been required since kernel 2.6.

However, using a non-512 byte logical block size runs into a variety
of glaring and subtle software bugs that assume 512 byte sectors.
It's the same issue "Advanced Format" drives faced with 512e vs. 4Kn.
UEFI based systems have a bit more hope of working than legacy 
BIOS systems.

Even if you sidestep all the bugs, you cannot copy an image that 
contains LBA and Size values between a 512-byte device and a
non-512 byte device; they all need to be recalculated.  

For pmem, if you were using the device on a system with a 4 KiB
page size and move it to a system with a larger page size, you'd
silently render the data incorrect in its new home by assuming
the system page size is the logical block size.

Leaving the logical block size at 512 bytes avoids data compatibility
issues (like not growing "byte" to mean more than 8 bits even 
as CPUs grew to have wider registers).

We could add a new physical block size-like field that conveys
any mandatory alignment for O_DIRECT usage, but that requires
software changes too.

> Note that with just the physical sector size set to 4K, mkfs only emits
> a warning, but with a logical sector size of 4K, it fails[1]. I suppose
> this should be OK as we don't want to force people to create DAX
> capable file systems only.

That's expected - filesystems use integer values for LBAs, and cannot
point to a fraction of an LBA.  If each LBA points to 4 KiB, then
the filesystem block size (aka cluster size) needs to be an integer 
multiple of 4 KiB.
Boaz Harrosh July 23, 2015, 8:53 a.m. UTC | #12
On 07/23/2015 12:41 AM, Elliott, Robert (Server Storage) wrote:
> 
> We could add a new physical block size-like field that conveys
> any  usage, but that requires
> software changes too.
> 
>> Note that with just the physical sector size set to 4K, mkfs only emits
>> a warning, but with a logical sector size of 4K, it fails[1]. I suppose
>> this should be OK as we don't want to force people to create DAX
>> capable file systems only.
> 
> That's expected - filesystems use integer values for LBAs, and cannot
> point to a fraction of an LBA.  If each LBA points to 4 KiB, then
> the filesystem block size (aka cluster size) needs to be an integer 
> multiple of 4 KiB.
> 

Guys what are you talking about?? pmem has no such mandatory alignment for O_DIRECT
and no restriction of any alignment "what so ever". It is memory and you can
do *byte aligned* direct accesses. Please stop the nonsense. (In fact it works,
do a dax mount take any pointer open O_DIRECT and do IO, it works. Actually
any "buffered" IO is done O_DIRECT, with mmap you have byte access directly to
storage)

Nor does it have any kind of other restrictions.

The only *small* problem is the partition alignment. It must be PAGE_SIZE aligned
and that is not because of any hardware limitations, it is because of the
->direct_access() API and the way we chose to implement DAX. So the only thing
that needs to do its thing is fdisk, which does so usually. fdisk default is
1M. (And who needs partitions on pmem anyway?)

mkfs.ext4 and the "-o dax" mounting is a different story. You must not force
anyone to enable support for "-o dax". The FS on pmem is perfectly usable
for any block size only the "-o dax" will complain.

Cheers
Boaz
diff mbox

Patch

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ade9eb9..7f85537 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -162,6 +162,7 @@  static int pmem_attach_disk(struct nd_namespace_common *ndns,
 		return -ENOMEM;
 
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);