diff mbox series

[01/19] dax: remove block device dependencies

Message ID 20190821175720.25901-2-vgoyal@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-fs: Enable DAX support | expand

Commit Message

Vivek Goyal Aug. 21, 2019, 5:57 p.m. UTC
From: Stefan Hajnoczi <stefanha@redhat.com>

Although struct dax_device itself is not tied to a block device, some
DAX code assumes there is a block device.  Make block devices optional
by allowing bdev to be NULL in commonly used DAX APIs.

When there is no block device:
 * Skip the partition offset calculation in bdev_dax_pgoff()
 * Skip the blkdev_issue_zeroout() optimization

Note that more block device assumptions remain but I haven't reach those
code paths yet.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/dax/super.c | 3 ++-
 fs/dax.c            | 7 ++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Aug. 26, 2019, 11:51 a.m. UTC | #1
On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote:
> From: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Although struct dax_device itself is not tied to a block device, some
> DAX code assumes there is a block device.  Make block devices optional
> by allowing bdev to be NULL in commonly used DAX APIs.
> 
> When there is no block device:
>  * Skip the partition offset calculation in bdev_dax_pgoff()
>  * Skip the blkdev_issue_zeroout() optimization
> 
> Note that more block device assumptions remain but I haven't reach those
> code paths yet.

I think this should be split into two patches.  For bdev_dax_pgoff
I'd much rather have the partition offset if there is on in the daxdev
somehow so that we can get rid of the block device entirely.

Similarly for dax_range_is_aligned I'd rather have a pure dax way
to offload zeroing rather than this bdev hack.

In the long run I'd really like to make the bdev vs daxdev in iomap a
union instead of having to carry both around.
Vivek Goyal Aug. 27, 2019, 4:38 p.m. UTC | #2
On Mon, Aug 26, 2019 at 04:51:52AM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote:
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Although struct dax_device itself is not tied to a block device, some
> > DAX code assumes there is a block device.  Make block devices optional
> > by allowing bdev to be NULL in commonly used DAX APIs.
> > 
> > When there is no block device:
> >  * Skip the partition offset calculation in bdev_dax_pgoff()
> >  * Skip the blkdev_issue_zeroout() optimization
> > 
> > Note that more block device assumptions remain but I haven't reach those
> > code paths yet.
> 
> I think this should be split into two patches.

Hi Christoph,

Ok, will split in two patches. In fact, I think will completley drop
the second change right now as I think we might not be hitting that
path yet.

> For bdev_dax_pgoff
> I'd much rather have the partition offset if there is on in the daxdev
> somehow so that we can get rid of the block device entirely.

IIUC, there is one block_device per partition while there is only one
dax_device for the whole disk. So we can't directly move bdev logical
offset into dax_device.

We probably could put this in "iomap" and leave it to filesystems to
report offset into dax_dev in iomap that way dax generic code does not
have to deal with it. But that probably will be a bigger change.

Did I misunderstand your suggestion.

> 
> Similarly for dax_range_is_aligned I'd rather have a pure dax way
> to offload zeroing rather than this bdev hack.

Following commig introduced the change to write zeros through block
device path.

commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a
Author: Vishal Verma <vishal.l.verma@intel.com>
Date:   Thu Apr 21 15:13:46 2016 -0400

 dax: for truncate/hole-punch, do zeroing through the driver if possible

IIUC, they are doing it so that they can clear gendisk->badblocks list.

So even if there is pure dax way to do it, there will have to some
involvment of block layer to clear gendisk->badblocks list.

I am not sure I fully understand your suggestion. But I am hoping its
not a must for these changes to make a progress. For now, I will drop
change to dax_range_is_aligned().

Thanks
Vivek
Christoph Hellwig Aug. 28, 2019, 6:58 a.m. UTC | #3
On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > For bdev_dax_pgoff
> > I'd much rather have the partition offset if there is on in the daxdev
> > somehow so that we can get rid of the block device entirely.
> 
> IIUC, there is one block_device per partition while there is only one
> dax_device for the whole disk. So we can't directly move bdev logical
> offset into dax_device.

Well, then we need to find a way to get partitions for dax devices,
as we really should not expect a block device hiding behind a dax
dev.  That is just a weird legacy assumption - block device need to
layer on top of the dax device optionally.

> 
> We probably could put this in "iomap" and leave it to filesystems to
> report offset into dax_dev in iomap that way dax generic code does not
> have to deal with it. But that probably will be a bigger change.

And where would the file system get that information from?

> commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a
> Author: Vishal Verma <vishal.l.verma@intel.com>
> Date:   Thu Apr 21 15:13:46 2016 -0400
> 
>  dax: for truncate/hole-punch, do zeroing through the driver if possible
> 
> IIUC, they are doing it so that they can clear gendisk->badblocks list.
> 
> So even if there is pure dax way to do it, there will have to some
> involvment of block layer to clear gendisk->badblocks list.

Once again we need to move that list to the dax device, as the
assumption that there is a block device associated with the dax dev
is flawed.

> I am not sure I fully understand your suggestion. But I am hoping its
> not a must for these changes to make a progress. For now, I will drop
> change to dax_range_is_aligned().

Well, someone needs to clean this mess up, and as you have an actual
real life example of a dax dev without the block device I think the
burden naturally falls on you.
Vivek Goyal Aug. 28, 2019, 5:58 p.m. UTC | #4
On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > For bdev_dax_pgoff
> > > I'd much rather have the partition offset if there is on in the daxdev
> > > somehow so that we can get rid of the block device entirely.
> > 
> > IIUC, there is one block_device per partition while there is only one
> > dax_device for the whole disk. So we can't directly move bdev logical
> > offset into dax_device.
> 
> Well, then we need to find a way to get partitions for dax devices,
> as we really should not expect a block device hiding behind a dax
> dev.  That is just a weird legacy assumption - block device need to
> layer on top of the dax device optionally.
> 
> > 
> > We probably could put this in "iomap" and leave it to filesystems to
> > report offset into dax_dev in iomap that way dax generic code does not
> > have to deal with it. But that probably will be a bigger change.
> 
> And where would the file system get that information from?

File system knows about block device, can it just call get_start_sect()
while filling iomap->addr. And this means we don't have to have
parition information in dax device. Will something like following work?
(Just a proof of concept patch).


---
 drivers/dax/super.c |   11 +++++++++++
 fs/dax.c            |    6 +++---
 fs/ext4/inode.c     |    6 +++++-
 include/linux/dax.h |    1 +
 4 files changed, 20 insertions(+), 4 deletions(-)

Index: rhvgoyal-linux/fs/ext4/inode.c
===================================================================
--- rhvgoyal-linux.orig/fs/ext4/inode.c	2019-08-28 13:51:16.051937204 -0400
+++ rhvgoyal-linux/fs/ext4/inode.c	2019-08-28 13:51:44.453937204 -0400
@@ -3589,7 +3589,11 @@ retry:
 			WARN_ON_ONCE(1);
 			return -EIO;
 		}
-		iomap->addr = (u64)map.m_pblk << blkbits;
+		if (IS_DAX(inode))
+			iomap->addr = ((u64)map.m_pblk << blkbits) +
+				      (get_start_sect(iomap->bdev) * 512);
+		else
+			iomap->addr = (u64)map.m_pblk << blkbits;
 	}
 
 	if (map.m_flags & EXT4_MAP_NEW)
Index: rhvgoyal-linux/fs/dax.c
===================================================================
--- rhvgoyal-linux.orig/fs/dax.c	2019-08-28 13:51:16.051937204 -0400
+++ rhvgoyal-linux/fs/dax.c	2019-08-28 13:51:44.457937204 -0400
@@ -688,7 +688,7 @@ static int copy_user_dax(struct block_de
 	long rc;
 	int id;
 
-	rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+	rc = dax_pgoff(sector, size, &pgoff);
 	if (rc)
 		return rc;
 
@@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *i
 	int id, rc;
 	long length;
 
-	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
+	rc = dax_pgoff(sector, size, &pgoff);
 	if (rc)
 		return rc;
 	id = dax_read_lock();
@@ -1137,7 +1137,7 @@ dax_iomap_actor(struct inode *inode, lof
 			break;
 		}
 
-		ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+		ret = dax_pgoff(sector, size, &pgoff);
 		if (ret)
 			break;
 
Index: rhvgoyal-linux/drivers/dax/super.c
===================================================================
--- rhvgoyal-linux.orig/drivers/dax/super.c	2019-08-28 13:51:51.802937204 -0400
+++ rhvgoyal-linux/drivers/dax/super.c	2019-08-28 13:51:56.905937204 -0400
@@ -43,6 +43,17 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 #ifdef CONFIG_BLOCK
 #include <linux/blkdev.h>
 
+int dax_pgoff(sector_t sector, size_t size, pgoff_t *pgoff)
+{
+	phys_addr_t phys_off = sector * 512;
+
+	if (pgoff)
+		*pgoff = PHYS_PFN(phys_off);
+	if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+		return -EINVAL;
+	return 0;
+}
+
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
Index: rhvgoyal-linux/include/linux/dax.h
===================================================================
--- rhvgoyal-linux.orig/include/linux/dax.h	2019-08-28 13:51:51.802937204 -0400
+++ rhvgoyal-linux/include/linux/dax.h	2019-08-28 13:51:56.908937204 -0400
@@ -111,6 +111,7 @@ static inline bool daxdev_mapping_suppor
 
 struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
+int dax_pgoff(sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 bool __bdev_dax_supported(struct block_device *bdev, int blocksize);
 static inline bool bdev_dax_supported(struct block_device *bdev, int blocksize)
Dave Chinner Aug. 28, 2019, 10:53 p.m. UTC | #5
On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > For bdev_dax_pgoff
> > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > somehow so that we can get rid of the block device entirely.
> > > 
> > > IIUC, there is one block_device per partition while there is only one
> > > dax_device for the whole disk. So we can't directly move bdev logical
> > > offset into dax_device.
> > 
> > Well, then we need to find a way to get partitions for dax devices,
> > as we really should not expect a block device hiding behind a dax
> > dev.  That is just a weird legacy assumption - block device need to
> > layer on top of the dax device optionally.
> > 
> > > 
> > > We probably could put this in "iomap" and leave it to filesystems to
> > > report offset into dax_dev in iomap that way dax generic code does not
> > > have to deal with it. But that probably will be a bigger change.
> > 
> > And where would the file system get that information from?
> 
> File system knows about block device, can it just call get_start_sect()
> while filling iomap->addr. And this means we don't have to have
> parition information in dax device. Will something like following work?
> (Just a proof of concept patch).
> 
> 
> ---
>  drivers/dax/super.c |   11 +++++++++++
>  fs/dax.c            |    6 +++---
>  fs/ext4/inode.c     |    6 +++++-
>  include/linux/dax.h |    1 +
>  4 files changed, 20 insertions(+), 4 deletions(-)
> 
> Index: rhvgoyal-linux/fs/ext4/inode.c
> ===================================================================
> --- rhvgoyal-linux.orig/fs/ext4/inode.c	2019-08-28 13:51:16.051937204 -0400
> +++ rhvgoyal-linux/fs/ext4/inode.c	2019-08-28 13:51:44.453937204 -0400
> @@ -3589,7 +3589,11 @@ retry:
>  			WARN_ON_ONCE(1);
>  			return -EIO;
>  		}
> -		iomap->addr = (u64)map.m_pblk << blkbits;
> +		if (IS_DAX(inode))
> +			iomap->addr = ((u64)map.m_pblk << blkbits) +
> +				      (get_start_sect(iomap->bdev) * 512);
> +		else
> +			iomap->addr = (u64)map.m_pblk << blkbits;

I'm not a fan of returning a physical device sector address from an
interface where ever other user/caller expects this address to be a
logical block address into the block device. It creates a landmine
in the iomap API that callers may not be aware of and that's going
to cause bugs. We're trying really hard to keep special case hacks
like this out of the iomap infrastructure, so on those grounds alone
I'd suggest this is a dead end approach.

Hence I think that if the dax device needs a physical offset from
the start of the block device the filesystem sits on, it should be
set up at dax device instantiation time and so the filesystem/bdev
never needs to be queried again for this information.

Cheers,

Dave.
Dan Williams Aug. 29, 2019, 12:04 a.m. UTC | #6
On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > > For bdev_dax_pgoff
> > > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > > somehow so that we can get rid of the block device entirely.
> > > >
> > > > IIUC, there is one block_device per partition while there is only one
> > > > dax_device for the whole disk. So we can't directly move bdev logical
> > > > offset into dax_device.
> > >
> > > Well, then we need to find a way to get partitions for dax devices,
> > > as we really should not expect a block device hiding behind a dax
> > > dev.  That is just a weird legacy assumption - block device need to
> > > layer on top of the dax device optionally.
> > >
> > > >
> > > > We probably could put this in "iomap" and leave it to filesystems to
> > > > report offset into dax_dev in iomap that way dax generic code does not
> > > > have to deal with it. But that probably will be a bigger change.
> > >
> > > And where would the file system get that information from?
> >
> > File system knows about block device, can it just call get_start_sect()
> > while filling iomap->addr. And this means we don't have to have
> > parition information in dax device. Will something like following work?
> > (Just a proof of concept patch).
> >
> >
> > ---
> >  drivers/dax/super.c |   11 +++++++++++
> >  fs/dax.c            |    6 +++---
> >  fs/ext4/inode.c     |    6 +++++-
> >  include/linux/dax.h |    1 +
> >  4 files changed, 20 insertions(+), 4 deletions(-)
> >
> > Index: rhvgoyal-linux/fs/ext4/inode.c
> > ===================================================================
> > --- rhvgoyal-linux.orig/fs/ext4/inode.c       2019-08-28 13:51:16.051937204 -0400
> > +++ rhvgoyal-linux/fs/ext4/inode.c    2019-08-28 13:51:44.453937204 -0400
> > @@ -3589,7 +3589,11 @@ retry:
> >                       WARN_ON_ONCE(1);
> >                       return -EIO;
> >               }
> > -             iomap->addr = (u64)map.m_pblk << blkbits;
> > +             if (IS_DAX(inode))
> > +                     iomap->addr = ((u64)map.m_pblk << blkbits) +
> > +                                   (get_start_sect(iomap->bdev) * 512);
> > +             else
> > +                     iomap->addr = (u64)map.m_pblk << blkbits;
>
> I'm not a fan of returning a physical device sector address from an
> interface where ever other user/caller expects this address to be a
> logical block address into the block device. It creates a landmine
> in the iomap API that callers may not be aware of and that's going
> to cause bugs. We're trying really hard to keep special case hacks
> like this out of the iomap infrastructure, so on those grounds alone
> I'd suggest this is a dead end approach.
>
> Hence I think that if the dax device needs a physical offset from
> the start of the block device the filesystem sits on, it should be
> set up at dax device instantiation time and so the filesystem/bdev
> never needs to be queried again for this information.
>

Agree. In retrospect it was my laziness in the dax-device
implementation to expect the block-device to be available.

It looks like fs_dax_get_by_bdev() is an intercept point where a
dax_device could be dynamically created to represent the subset range
indicated by the block-device partition. That would open up more
cleanup opportunities.
Christoph Hellwig Aug. 29, 2019, 9:32 a.m. UTC | #7
On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote:
> Agree. In retrospect it was my laziness in the dax-device
> implementation to expect the block-device to be available.
> 
> It looks like fs_dax_get_by_bdev() is an intercept point where a
> dax_device could be dynamically created to represent the subset range
> indicated by the block-device partition. That would open up more
> cleanup opportunities.

That seems like a decent short-term plan.  But in the long I'd just let
dax call into the partition table parser directly, as we might want to
support partitions without first having to create a block device on top
of the dax device.  Same for the badblocks handling story.
Vivek Goyal Dec. 16, 2019, 6:10 p.m. UTC | #8
On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote:
> On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> > > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > > > For bdev_dax_pgoff
> > > > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > > > somehow so that we can get rid of the block device entirely.
> > > > >
> > > > > IIUC, there is one block_device per partition while there is only one
> > > > > dax_device for the whole disk. So we can't directly move bdev logical
> > > > > offset into dax_device.
> > > >
> > > > Well, then we need to find a way to get partitions for dax devices,
> > > > as we really should not expect a block device hiding behind a dax
> > > > dev.  That is just a weird legacy assumption - block device need to
> > > > layer on top of the dax device optionally.
> > > >
> > > > >
> > > > > We probably could put this in "iomap" and leave it to filesystems to
> > > > > report offset into dax_dev in iomap that way dax generic code does not
> > > > > have to deal with it. But that probably will be a bigger change.
> > > >
> > > > And where would the file system get that information from?
> > >
> > > File system knows about block device, can it just call get_start_sect()
> > > while filling iomap->addr. And this means we don't have to have
> > > parition information in dax device. Will something like following work?
> > > (Just a proof of concept patch).
> > >
> > >
> > > ---
> > >  drivers/dax/super.c |   11 +++++++++++
> > >  fs/dax.c            |    6 +++---
> > >  fs/ext4/inode.c     |    6 +++++-
> > >  include/linux/dax.h |    1 +
> > >  4 files changed, 20 insertions(+), 4 deletions(-)
> > >
> > > Index: rhvgoyal-linux/fs/ext4/inode.c
> > > ===================================================================
> > > --- rhvgoyal-linux.orig/fs/ext4/inode.c       2019-08-28 13:51:16.051937204 -0400
> > > +++ rhvgoyal-linux/fs/ext4/inode.c    2019-08-28 13:51:44.453937204 -0400
> > > @@ -3589,7 +3589,11 @@ retry:
> > >                       WARN_ON_ONCE(1);
> > >                       return -EIO;
> > >               }
> > > -             iomap->addr = (u64)map.m_pblk << blkbits;
> > > +             if (IS_DAX(inode))
> > > +                     iomap->addr = ((u64)map.m_pblk << blkbits) +
> > > +                                   (get_start_sect(iomap->bdev) * 512);
> > > +             else
> > > +                     iomap->addr = (u64)map.m_pblk << blkbits;
> >
> > I'm not a fan of returning a physical device sector address from an
> > interface where ever other user/caller expects this address to be a
> > logical block address into the block device. It creates a landmine
> > in the iomap API that callers may not be aware of and that's going
> > to cause bugs. We're trying really hard to keep special case hacks
> > like this out of the iomap infrastructure, so on those grounds alone
> > I'd suggest this is a dead end approach.
> >
> > Hence I think that if the dax device needs a physical offset from
> > the start of the block device the filesystem sits on, it should be
> > set up at dax device instantiation time and so the filesystem/bdev
> > never needs to be queried again for this information.
> >
> 
> Agree. In retrospect it was my laziness in the dax-device
> implementation to expect the block-device to be available.
> 
> It looks like fs_dax_get_by_bdev() is an intercept point where a
> dax_device could be dynamically created to represent the subset range
> indicated by the block-device partition. That would open up more
> cleanup opportunities.

Hi Dan,

After a long time I got time to look at it again. Want to work on this
cleanup so that I can make progress with virtiofs DAX paches.

I am not sure I understand the requirements fully. I see that right now
dax_device is created per device and all block partitions refer to it. If
we want to create one dax_device per partition, then it looks like this
will be structured more along the lines how block layer handles disk and
partitions. (One gendisk for disk and block_devices for partitions,
including partition 0). That probably means state belong to whole device
will be in common structure say dax_device_common, and per partition state
will be in dax_device and dax_device can carry a pointer to
dax_device_common.

I am also not sure what does it mean to partition dax devices. How will
partitions be exported to user space.

Thanks
Vivek
Christoph Hellwig Jan. 7, 2020, 12:51 p.m. UTC | #9
On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > Agree. In retrospect it was my laziness in the dax-device
> > implementation to expect the block-device to be available.
> > 
> > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > dax_device could be dynamically created to represent the subset range
> > indicated by the block-device partition. That would open up more
> > cleanup opportunities.
> 
> Hi Dan,
> 
> After a long time I got time to look at it again. Want to work on this
> cleanup so that I can make progress with virtiofs DAX paches.
> 
> I am not sure I understand the requirements fully. I see that right now
> dax_device is created per device and all block partitions refer to it. If
> we want to create one dax_device per partition, then it looks like this
> will be structured more along the lines how block layer handles disk and
> partitions. (One gendisk for disk and block_devices for partitions,
> including partition 0). That probably means state belong to whole device
> will be in common structure say dax_device_common, and per partition state
> will be in dax_device and dax_device can carry a pointer to
> dax_device_common.
> 
> I am also not sure what does it mean to partition dax devices. How will
> partitions be exported to user space.

Dan, last time we talked you agreed that partitioned dax devices are
rather pointless IIRC.  Should we just deprecate partitions on DAX
devices and then remove them after a cycle or two?
Dan Williams Jan. 7, 2020, 2:22 p.m. UTC | #10
On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > Agree. In retrospect it was my laziness in the dax-device
> > > implementation to expect the block-device to be available.
> > >
> > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > dax_device could be dynamically created to represent the subset range
> > > indicated by the block-device partition. That would open up more
> > > cleanup opportunities.
> >
> > Hi Dan,
> >
> > After a long time I got time to look at it again. Want to work on this
> > cleanup so that I can make progress with virtiofs DAX paches.
> >
> > I am not sure I understand the requirements fully. I see that right now
> > dax_device is created per device and all block partitions refer to it. If
> > we want to create one dax_device per partition, then it looks like this
> > will be structured more along the lines how block layer handles disk and
> > partitions. (One gendisk for disk and block_devices for partitions,
> > including partition 0). That probably means state belong to whole device
> > will be in common structure say dax_device_common, and per partition state
> > will be in dax_device and dax_device can carry a pointer to
> > dax_device_common.
> >
> > I am also not sure what does it mean to partition dax devices. How will
> > partitions be exported to user space.
>
> Dan, last time we talked you agreed that partitioned dax devices are
> rather pointless IIRC.  Should we just deprecate partitions on DAX
> devices and then remove them after a cycle or two?

That does seem a better plan than trying to force partition support
where it is not needed.
Darrick J. Wong Jan. 7, 2020, 5:07 p.m. UTC | #11
On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > Agree. In retrospect it was my laziness in the dax-device
> > > > implementation to expect the block-device to be available.
> > > >
> > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > dax_device could be dynamically created to represent the subset range
> > > > indicated by the block-device partition. That would open up more
> > > > cleanup opportunities.
> > >
> > > Hi Dan,
> > >
> > > After a long time I got time to look at it again. Want to work on this
> > > cleanup so that I can make progress with virtiofs DAX paches.
> > >
> > > I am not sure I understand the requirements fully. I see that right now
> > > dax_device is created per device and all block partitions refer to it. If
> > > we want to create one dax_device per partition, then it looks like this
> > > will be structured more along the lines how block layer handles disk and
> > > partitions. (One gendisk for disk and block_devices for partitions,
> > > including partition 0). That probably means state belong to whole device
> > > will be in common structure say dax_device_common, and per partition state
> > > will be in dax_device and dax_device can carry a pointer to
> > > dax_device_common.
> > >
> > > I am also not sure what does it mean to partition dax devices. How will
> > > partitions be exported to user space.
> >
> > Dan, last time we talked you agreed that partitioned dax devices are
> > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > devices and then remove them after a cycle or two?
> 
> That does seem a better plan than trying to force partition support
> where it is not needed.

Question: if one /did/ have a partitioned DAX device and used kpartx to
create dm-linear devices for each partition, will DAX still work through
that?

--D
Dan Williams Jan. 7, 2020, 5:29 p.m. UTC | #12
On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > implementation to expect the block-device to be available.
> > > > >
> > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > dax_device could be dynamically created to represent the subset range
> > > > > indicated by the block-device partition. That would open up more
> > > > > cleanup opportunities.
> > > >
> > > > Hi Dan,
> > > >
> > > > After a long time I got time to look at it again. Want to work on this
> > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > >
> > > > I am not sure I understand the requirements fully. I see that right now
> > > > dax_device is created per device and all block partitions refer to it. If
> > > > we want to create one dax_device per partition, then it looks like this
> > > > will be structured more along the lines how block layer handles disk and
> > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > including partition 0). That probably means state belong to whole device
> > > > will be in common structure say dax_device_common, and per partition state
> > > > will be in dax_device and dax_device can carry a pointer to
> > > > dax_device_common.
> > > >
> > > > I am also not sure what does it mean to partition dax devices. How will
> > > > partitions be exported to user space.
> > >
> > > Dan, last time we talked you agreed that partitioned dax devices are
> > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > devices and then remove them after a cycle or two?
> >
> > That does seem a better plan than trying to force partition support
> > where it is not needed.
>
> Question: if one /did/ have a partitioned DAX device and used kpartx to
> create dm-linear devices for each partition, will DAX still work through
> that?

The device-mapper support will continue, but it will be limited to
whole device sub-components. I.e. you could use kpartx to carve up
/dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
Vivek Goyal Jan. 7, 2020, 6:01 p.m. UTC | #13
On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > implementation to expect the block-device to be available.
> > > > > >
> > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > indicated by the block-device partition. That would open up more
> > > > > > cleanup opportunities.
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > After a long time I got time to look at it again. Want to work on this
> > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > >
> > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > we want to create one dax_device per partition, then it looks like this
> > > > > will be structured more along the lines how block layer handles disk and
> > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > including partition 0). That probably means state belong to whole device
> > > > > will be in common structure say dax_device_common, and per partition state
> > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > dax_device_common.
> > > > >
> > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > partitions be exported to user space.
> > > >
> > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > devices and then remove them after a cycle or two?
> > >
> > > That does seem a better plan than trying to force partition support
> > > where it is not needed.
> >
> > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > create dm-linear devices for each partition, will DAX still work through
> > that?
> 
> The device-mapper support will continue, but it will be limited to
> whole device sub-components. I.e. you could use kpartx to carve up
> /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.

So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
is a block device, I thought tools will expect it to be partitioned.
Sometimes I create those partitions and use /dev/pmem0. So what's
the replacement for this. People often have tools/scripts which might
want to partition the device and these will start failing. 

IOW, I do not understand that why being able to partition /dev/pmem0
(which is a block device from user space point of view), is pointless.

Thanks
Vivek
Dan Williams Jan. 7, 2020, 6:07 p.m. UTC | #14
On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > implementation to expect the block-device to be available.
> > > > > > >
> > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > cleanup opportunities.
> > > > > >
> > > > > > Hi Dan,
> > > > > >
> > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > >
> > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > including partition 0). That probably means state belong to whole device
> > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > dax_device_common.
> > > > > >
> > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > partitions be exported to user space.
> > > > >
> > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > devices and then remove them after a cycle or two?
> > > >
> > > > That does seem a better plan than trying to force partition support
> > > > where it is not needed.
> > >
> > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > create dm-linear devices for each partition, will DAX still work through
> > > that?
> >
> > The device-mapper support will continue, but it will be limited to
> > whole device sub-components. I.e. you could use kpartx to carve up
> > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
>
> So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> is a block device, I thought tools will expect it to be partitioned.
> Sometimes I create those partitions and use /dev/pmem0. So what's
> the replacement for this. People often have tools/scripts which might
> want to partition the device and these will start failing.

Partitioning will still work, but dax operation will be declined and
fall back to page-cache.

> IOW, I do not understand that why being able to partition /dev/pmem0
> (which is a block device from user space point of view), is pointless.

How about s/pointless/redundant/. Persistent memory can already be
"partitioned" via namespace boundaries. Block device partitioning is
then redundant and needlessly complicates, as you have found, the
kernel implementation.

The problem will be people that were on dax+ext4 on partitions. Those
people will see a hard failure at mount whereas XFS will fallback to
page cache with a warning in the log. I think ext4 must convert to the
xfs dax handling model before partition support is dropped.
Vivek Goyal Jan. 7, 2020, 6:33 p.m. UTC | #15
On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > implementation to expect the block-device to be available.
> > > > > > > >
> > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > cleanup opportunities.
> > > > > > >
> > > > > > > Hi Dan,
> > > > > > >
> > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > >
> > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > dax_device_common.
> > > > > > >
> > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > partitions be exported to user space.
> > > > > >
> > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > devices and then remove them after a cycle or two?
> > > > >
> > > > > That does seem a better plan than trying to force partition support
> > > > > where it is not needed.
> > > >
> > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > create dm-linear devices for each partition, will DAX still work through
> > > > that?
> > >
> > > The device-mapper support will continue, but it will be limited to
> > > whole device sub-components. I.e. you could use kpartx to carve up
> > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> >
> > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > is a block device, I thought tools will expect it to be partitioned.
> > Sometimes I create those partitions and use /dev/pmem0. So what's
> > the replacement for this. People often have tools/scripts which might
> > want to partition the device and these will start failing.
> 
> Partitioning will still work, but dax operation will be declined and
> fall back to page-cache.

Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
filesystem will fall back to using page cache. (But dax will not be
enabled).

> 
> > IOW, I do not understand that why being able to partition /dev/pmem0
> > (which is a block device from user space point of view), is pointless.
> 
> How about s/pointless/redundant/. Persistent memory can already be
> "partitioned" via namespace boundaries.

But that's an entirely different way of partitioning. To me being able
to use block devices (with dax capability) in same way as any other
block device makes sense.

> Block device partitioning is
> then redundant and needlessly complicates, as you have found, the
> kernel implementation.

It does complicate kernel implementation. Is it too hard to solve the
problem in kernel.

W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
dax code refers back to block device to figure out partition offset in
dax device. If we create a dax object corresponding to "struct block_device"
and store sector offset in that, then we could pass that object to dax
code and not worry about referring back to bdev. I have written some
proof of concept code and called that object "dax_handle". I can post
that code if there is interest.

IMHO, it feels useful to be able to partition and use a dax capable
block device in same way as non-dax block device. It will be really
odd to think that if filesystem is on /dev/pmem0p1, then dax can't
be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
will work.

Thanks
Vivek

> 
> The problem will be people that were on dax+ext4 on partitions. Those
> people will see a hard failure at mount whereas XFS will fallback to
> page cache with a warning in the log. I think ext4 must convert to the
> xfs dax handling model before partition support is dropped.
>
Dan Williams Jan. 7, 2020, 6:49 p.m. UTC | #16
On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > >
> > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > > implementation to expect the block-device to be available.
> > > > > > > > >
> > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > > cleanup opportunities.
> > > > > > > >
> > > > > > > > Hi Dan,
> > > > > > > >
> > > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > > >
> > > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > > dax_device_common.
> > > > > > > >
> > > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > > partitions be exported to user space.
> > > > > > >
> > > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > > devices and then remove them after a cycle or two?
> > > > > >
> > > > > > That does seem a better plan than trying to force partition support
> > > > > > where it is not needed.
> > > > >
> > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > > create dm-linear devices for each partition, will DAX still work through
> > > > > that?
> > > >
> > > > The device-mapper support will continue, but it will be limited to
> > > > whole device sub-components. I.e. you could use kpartx to carve up
> > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> > >
> > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > > is a block device, I thought tools will expect it to be partitioned.
> > > Sometimes I create those partitions and use /dev/pmem0. So what's
> > > the replacement for this. People often have tools/scripts which might
> > > want to partition the device and these will start failing.
> >
> > Partitioning will still work, but dax operation will be declined and
> > fall back to page-cache.
>
> Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
> filesystem will fall back to using page cache. (But dax will not be
> enabled).
>
> >
> > > IOW, I do not understand that why being able to partition /dev/pmem0
> > > (which is a block device from user space point of view), is pointless.
> >
> > How about s/pointless/redundant/. Persistent memory can already be
> > "partitioned" via namespace boundaries.
>
> But that's an entirely different way of partitioning. To me being able
> to use block devices (with dax capability) in same way as any other
> block device makes sense.
>
> > Block device partitioning is
> > then redundant and needlessly complicates, as you have found, the
> > kernel implementation.
>
> It does complicate kernel implementation. Is it too hard to solve the
> problem in kernel.
>
> W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> dax code refers back to block device to figure out partition offset in
> dax device. If we create a dax object corresponding to "struct block_device"
> and store sector offset in that, then we could pass that object to dax
> code and not worry about referring back to bdev. I have written some
> proof of concept code and called that object "dax_handle". I can post
> that code if there is interest.

I don't think it's worth it in the end especially considering
filesystems are looking to operate on /dev/dax devices directly and
remove block entanglements entirely.

> IMHO, it feels useful to be able to partition and use a dax capable
> block device in same way as non-dax block device. It will be really
> odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> will work.

That can already happen today. If you do not properly align the
partition then dax operations will be disabled. This proposal just
extends that existing failure domain to make all partitions fail to
support dax.
Darrick J. Wong Jan. 7, 2020, 7:02 p.m. UTC | #17
On Tue, Jan 07, 2020 at 10:49:55AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > > > implementation to expect the block-device to be available.
> > > > > > > > > >
> > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > > > cleanup opportunities.
> > > > > > > > >
> > > > > > > > > Hi Dan,
> > > > > > > > >
> > > > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > > > >
> > > > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > > > dax_device_common.
> > > > > > > > >
> > > > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > > > partitions be exported to user space.
> > > > > > > >
> > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > > > devices and then remove them after a cycle or two?
> > > > > > >
> > > > > > > That does seem a better plan than trying to force partition support
> > > > > > > where it is not needed.
> > > > > >
> > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > > > create dm-linear devices for each partition, will DAX still work through
> > > > > > that?
> > > > >
> > > > > The device-mapper support will continue, but it will be limited to
> > > > > whole device sub-components. I.e. you could use kpartx to carve up
> > > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> > > >
> > > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > > > is a block device, I thought tools will expect it to be partitioned.
> > > > Sometimes I create those partitions and use /dev/pmem0. So what's
> > > > the replacement for this. People often have tools/scripts which might
> > > > want to partition the device and these will start failing.
> > >
> > > Partitioning will still work, but dax operation will be declined and
> > > fall back to page-cache.
> >
> > Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
> > filesystem will fall back to using page cache. (But dax will not be
> > enabled).
> >
> > >
> > > > IOW, I do not understand that why being able to partition /dev/pmem0
> > > > (which is a block device from user space point of view), is pointless.
> > >
> > > How about s/pointless/redundant/. Persistent memory can already be
> > > "partitioned" via namespace boundaries.
> >
> > But that's an entirely different way of partitioning. To me being able
> > to use block devices (with dax capability) in same way as any other
> > block device makes sense.
> >
> > > Block device partitioning is
> > > then redundant and needlessly complicates, as you have found, the
> > > kernel implementation.
> >
> > It does complicate kernel implementation. Is it too hard to solve the
> > problem in kernel.
> >
> > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > dax code refers back to block device to figure out partition offset in
> > dax device. If we create a dax object corresponding to "struct block_device"
> > and store sector offset in that, then we could pass that object to dax
> > code and not worry about referring back to bdev. I have written some
> > proof of concept code and called that object "dax_handle". I can post
> > that code if there is interest.
> 
> I don't think it's worth it in the end especially considering
> filesystems are looking to operate on /dev/dax devices directly and
> remove block entanglements entirely.
> 
> > IMHO, it feels useful to be able to partition and use a dax capable
> > block device in same way as non-dax block device. It will be really
> > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > will work.
> 
> That can already happen today. If you do not properly align the
> partition then dax operations will be disabled.

Er... is this conversation getting confused?  I was talking about
kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel
creating /dev/pmem0p1.  I thnk Vivek was complaining about the
inconsistent behavior between the two, even if the partition is aligned
properly.

I'm not sure how alignment leaked in here?

> This proposal just
> extends that existing failure domain to make all partitions fail to
> support dax.

Oh, wait.  You're proposing that "partitions of pmem devices don't
support DAX", not "the kernel will not create partitions for pmem
devices".

Yeah, that would be inconsistent and weird.  I'd say deprecate the
kernel automounting partitions, but I guess it already does that, and
removing it would break /something/.  I guess you could put
"/dev/pmemXpY" on the deprecation schedule.

--D
Dan Williams Jan. 7, 2020, 7:46 p.m. UTC | #18
On Tue, Jan 7, 2020 at 11:03 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
[..]
> > That can already happen today. If you do not properly align the
> > partition then dax operations will be disabled.
>
> Er... is this conversation getting confused?  I was talking about
> kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel
> creating /dev/pmem0p1.  I thnk Vivek was complaining about the
> inconsistent behavior between the two, even if the partition is aligned
> properly.
>
> I'm not sure how alignment leaked in here?

Oh, whoops, I was jumping to the mismatch between host device and
partition and whether we had precedent to fail to support dax on the
partition when the base block device does support it.

But yes, the mismatch between kpartx and native partitions is weird.
That said kpartx is there to add partition support where the kernel
for whatever reason fails to, or chooses not to, and dax is looking
like such a place.

> > This proposal just
> > extends that existing failure domain to make all partitions fail to
> > support dax.
>
> Oh, wait.  You're proposing that "partitions of pmem devices don't
> support DAX", not "the kernel will not create partitions for pmem
> devices".
>
> Yeah, that would be inconsistent and weird.

More weird than the current constraints?

> I'd say deprecate the
> kernel automounting partitions, but I guess it already does that, and

Ok, now I don't know why automounting is leaking into this discussion?

> removing it would break /something/.

Yes, the breakage risk is anyone that was using ext4 mount failure as
a dax capability detector.

> I guess you could put
> "/dev/pmemXpY" on the deprecation schedule.

...but why deprecate /dev/pmemXpY partitions altogether? If someone
doesn't care about dax then they can do all the legacy block things.
If they do care about dax then work with whole device namespaces.

The proposal is to detect dax on partitions and warn people to move to
kpartx. Let the core fs/dax implementation continue to shed block
dependencies.
Dan Williams Jan. 7, 2020, 11:38 p.m. UTC | #19
On Tue, Jan 7, 2020 at 11:46 AM Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> > I'd say deprecate the
> > kernel automounting partitions, but I guess it already does that, and
>
> Ok, now I don't know why automounting is leaking into this discussion?
>
> > removing it would break /something/.
>
> Yes, the breakage risk is anyone that was using ext4 mount failure as
> a dax capability detector.
>
> > I guess you could put
> > "/dev/pmemXpY" on the deprecation schedule.
>
> ...but why deprecate /dev/pmemXpY partitions altogether? If someone
> doesn't care about dax then they can do all the legacy block things.
> If they do care about dax then work with whole device namespaces.

Circling back on this point now that I understand what you meant by
automount. It would need to be a full deprecation of /dev/pmemXpY
devices if kpartx dax support is going to fully take over for people
that want to use disk partition tables instead of EFI Namespace Labels
to carve up pmem.
Jan Kara Jan. 9, 2020, 11:24 a.m. UTC | #20
On Tue 07-01-20 10:49:55, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > dax code refers back to block device to figure out partition offset in
> > dax device. If we create a dax object corresponding to "struct block_device"
> > and store sector offset in that, then we could pass that object to dax
> > code and not worry about referring back to bdev. I have written some
> > proof of concept code and called that object "dax_handle". I can post
> > that code if there is interest.
> 
> I don't think it's worth it in the end especially considering
> filesystems are looking to operate on /dev/dax devices directly and
> remove block entanglements entirely.
> 
> > IMHO, it feels useful to be able to partition and use a dax capable
> > block device in same way as non-dax block device. It will be really
> > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > will work.
> 
> That can already happen today. If you do not properly align the
> partition then dax operations will be disabled. This proposal just
> extends that existing failure domain to make all partitions fail to
> support dax.

Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
decides to create partitions on it for whatever (possibly misguided)
reason and then ponders why the hell DAX is not working? And PAGE_SIZE
partition alignment is so obvious and widespread that I don't count it as a
realistic error case sysadmins would be pondering about currently.

So I'd find two options reasonably consistent:
1) Keep status quo where partitions are created and support DAX.
2) Stop partition creation altogether, if anyones wants to split pmem
device further, he can use dm-linear for that (i.e., kpartx).

But I'm not sure if the ship hasn't already sailed for option 2) to be
feasible without angry users and Linus reverting the change.

								Honza
Dan Williams Jan. 9, 2020, 8:03 p.m. UTC | #21
On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > dax code refers back to block device to figure out partition offset in
> > > dax device. If we create a dax object corresponding to "struct block_device"
> > > and store sector offset in that, then we could pass that object to dax
> > > code and not worry about referring back to bdev. I have written some
> > > proof of concept code and called that object "dax_handle". I can post
> > > that code if there is interest.
> >
> > I don't think it's worth it in the end especially considering
> > filesystems are looking to operate on /dev/dax devices directly and
> > remove block entanglements entirely.
> >
> > > IMHO, it feels useful to be able to partition and use a dax capable
> > > block device in same way as non-dax block device. It will be really
> > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > will work.
> >
> > That can already happen today. If you do not properly align the
> > partition then dax operations will be disabled. This proposal just
> > extends that existing failure domain to make all partitions fail to
> > support dax.
>
> Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> decides to create partitions on it for whatever (possibly misguided)
> reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> partition alignment is so obvious and widespread that I don't count it as a
> realistic error case sysadmins would be pondering about currently.
>
> So I'd find two options reasonably consistent:
> 1) Keep status quo where partitions are created and support DAX.
> 2) Stop partition creation altogether, if anyones wants to split pmem
> device further, he can use dm-linear for that (i.e., kpartx).
>
> But I'm not sure if the ship hasn't already sailed for option 2) to be
> feasible without angry users and Linus reverting the change.

Christoph? I feel myself leaning more and more to the "keep pmem
partitions" camp.

I don't see "drop partition support" effort ending well given the long
standing "ext4 fails to mount when dax is not available" precedent.

I think the next least bad option is to have a dax_get_by_host()
variant that passes an offset and length pair rather than requiring a
later bdev_dax_pgoff() to recall the offset. This also prevents
needing to add another dax-device object representation.
Christoph Hellwig Jan. 10, 2020, 12:36 p.m. UTC | #22
On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > So I'd find two options reasonably consistent:
> > 1) Keep status quo where partitions are created and support DAX.
> > 2) Stop partition creation altogether, if anyones wants to split pmem
> > device further, he can use dm-linear for that (i.e., kpartx).
> >
> > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > feasible without angry users and Linus reverting the change.
> 
> Christoph? I feel myself leaning more and more to the "keep pmem
> partitions" camp.
> 
> I don't see "drop partition support" effort ending well given the long
> standing "ext4 fails to mount when dax is not available" precedent.

Do we have any evidence of existing setups with DAX and partitions?
Can we just throw in a patch to reject that case for now before actually
removing the code and see if anyone screams.  And fix ext4 up while
we are at it.

> I think the next least bad option is to have a dax_get_by_host()
> variant that passes an offset and length pair rather than requiring a
> later bdev_dax_pgoff() to recall the offset. This also prevents
> needing to add another dax-device object representation.

IFF we have to keep partition support, yes.  But keeping it just seems
like a really bad idea.
Vivek Goyal Jan. 14, 2020, 8:31 p.m. UTC | #23
On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > dax code refers back to block device to figure out partition offset in
> > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > and store sector offset in that, then we could pass that object to dax
> > > > code and not worry about referring back to bdev. I have written some
> > > > proof of concept code and called that object "dax_handle". I can post
> > > > that code if there is interest.
> > >
> > > I don't think it's worth it in the end especially considering
> > > filesystems are looking to operate on /dev/dax devices directly and
> > > remove block entanglements entirely.
> > >
> > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > block device in same way as non-dax block device. It will be really
> > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > will work.
> > >
> > > That can already happen today. If you do not properly align the
> > > partition then dax operations will be disabled. This proposal just
> > > extends that existing failure domain to make all partitions fail to
> > > support dax.
> >
> > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > decides to create partitions on it for whatever (possibly misguided)
> > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > partition alignment is so obvious and widespread that I don't count it as a
> > realistic error case sysadmins would be pondering about currently.
> >
> > So I'd find two options reasonably consistent:
> > 1) Keep status quo where partitions are created and support DAX.
> > 2) Stop partition creation altogether, if anyones wants to split pmem
> > device further, he can use dm-linear for that (i.e., kpartx).
> >
> > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > feasible without angry users and Linus reverting the change.
> 
> Christoph? I feel myself leaning more and more to the "keep pmem
> partitions" camp.
> 
> I don't see "drop partition support" effort ending well given the long
> standing "ext4 fails to mount when dax is not available" precedent.
> 
> I think the next least bad option is to have a dax_get_by_host()
> variant that passes an offset and length pair rather than requiring a
> later bdev_dax_pgoff() to recall the offset. This also prevents
> needing to add another dax-device object representation.

I am wondering what's the conclusion on this. I want to this to make
progress in some direction so that I can make progress on virtiofs DAX
support.

Thanks
Vivek
Dan Williams Jan. 14, 2020, 8:39 p.m. UTC | #24
On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > dax code refers back to block device to figure out partition offset in
> > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > and store sector offset in that, then we could pass that object to dax
> > > > > code and not worry about referring back to bdev. I have written some
> > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > that code if there is interest.
> > > >
> > > > I don't think it's worth it in the end especially considering
> > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > remove block entanglements entirely.
> > > >
> > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > block device in same way as non-dax block device. It will be really
> > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > will work.
> > > >
> > > > That can already happen today. If you do not properly align the
> > > > partition then dax operations will be disabled. This proposal just
> > > > extends that existing failure domain to make all partitions fail to
> > > > support dax.
> > >
> > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > decides to create partitions on it for whatever (possibly misguided)
> > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > partition alignment is so obvious and widespread that I don't count it as a
> > > realistic error case sysadmins would be pondering about currently.
> > >
> > > So I'd find two options reasonably consistent:
> > > 1) Keep status quo where partitions are created and support DAX.
> > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > device further, he can use dm-linear for that (i.e., kpartx).
> > >
> > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > feasible without angry users and Linus reverting the change.
> >
> > Christoph? I feel myself leaning more and more to the "keep pmem
> > partitions" camp.
> >
> > I don't see "drop partition support" effort ending well given the long
> > standing "ext4 fails to mount when dax is not available" precedent.
> >
> > I think the next least bad option is to have a dax_get_by_host()
> > variant that passes an offset and length pair rather than requiring a
> > later bdev_dax_pgoff() to recall the offset. This also prevents
> > needing to add another dax-device object representation.
>
> I am wondering what's the conclusion on this. I want to this to make
> progress in some direction so that I can make progress on virtiofs DAX
> support.

I think we should at least try to delete the partition support and see
if anyone screams. Have a module option to revert the behavior so
people are not stuck waiting for the revert to land, but if it stays
quiet then we're in a better place with that support pushed out of the
dax core.
Vivek Goyal Jan. 14, 2020, 9:28 p.m. UTC | #25
On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > that code if there is interest.
> > > > >
> > > > > I don't think it's worth it in the end especially considering
> > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > remove block entanglements entirely.
> > > > >
> > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > block device in same way as non-dax block device. It will be really
> > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > will work.
> > > > >
> > > > > That can already happen today. If you do not properly align the
> > > > > partition then dax operations will be disabled. This proposal just
> > > > > extends that existing failure domain to make all partitions fail to
> > > > > support dax.
> > > >
> > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > decides to create partitions on it for whatever (possibly misguided)
> > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > realistic error case sysadmins would be pondering about currently.
> > > >
> > > > So I'd find two options reasonably consistent:
> > > > 1) Keep status quo where partitions are created and support DAX.
> > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > >
> > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > feasible without angry users and Linus reverting the change.
> > >
> > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > partitions" camp.
> > >
> > > I don't see "drop partition support" effort ending well given the long
> > > standing "ext4 fails to mount when dax is not available" precedent.
> > >
> > > I think the next least bad option is to have a dax_get_by_host()
> > > variant that passes an offset and length pair rather than requiring a
> > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > needing to add another dax-device object representation.
> >
> > I am wondering what's the conclusion on this. I want to this to make
> > progress in some direction so that I can make progress on virtiofs DAX
> > support.
> 
> I think we should at least try to delete the partition support and see
> if anyone screams. Have a module option to revert the behavior so
> people are not stuck waiting for the revert to land, but if it stays
> quiet then we're in a better place with that support pushed out of the
> dax core.

Hi Dan,

So basically keep partition support code just that disable it by default
and it is enabled by some knob say kernel command line option/module
option.

At what point of time will we remove that code completely. I mean what
if people scream after two kernel releases, after we have removed the
code.

Also, from distribution's perspective, we might not hear from our
customers for a very long time (till we backport that code in to
existing releases or release this new code in next major release). From
that view point I will not like to break existing user visible behavior.

How bad it is to keep partition support around. To me it feels reasonaly
simple where we just have to store offset into dax device into another
dax object and pass that object around (instead of dax_device). If that's
the case, I am not sure why to even venture into a direction where some
user's setup might be broken.

Also from an application perspective, /dev/pmem is a block device, so it
should behave like a block device, (including kernel partition table support).
From that view, dax looks like just an additional feature of that device
which can be enabled by passing option "-o dax".

IOW, can we reconsider the idea of not supporting kernel partition tables
for dax capable block devices. I can only see downsides of removing kernel
partition table support and only upside seems to be little cleanup of dax
core code.

Thanks
Vivek
Dan Williams Jan. 14, 2020, 10:23 p.m. UTC | #26
On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > > that code if there is interest.
> > > > > >
> > > > > > I don't think it's worth it in the end especially considering
> > > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > > remove block entanglements entirely.
> > > > > >
> > > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > > block device in same way as non-dax block device. It will be really
> > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > > will work.
> > > > > >
> > > > > > That can already happen today. If you do not properly align the
> > > > > > partition then dax operations will be disabled. This proposal just
> > > > > > extends that existing failure domain to make all partitions fail to
> > > > > > support dax.
> > > > >
> > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > > decides to create partitions on it for whatever (possibly misguided)
> > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > > realistic error case sysadmins would be pondering about currently.
> > > > >
> > > > > So I'd find two options reasonably consistent:
> > > > > 1) Keep status quo where partitions are created and support DAX.
> > > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > > >
> > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > > feasible without angry users and Linus reverting the change.
> > > >
> > > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > > partitions" camp.
> > > >
> > > > I don't see "drop partition support" effort ending well given the long
> > > > standing "ext4 fails to mount when dax is not available" precedent.
> > > >
> > > > I think the next least bad option is to have a dax_get_by_host()
> > > > variant that passes an offset and length pair rather than requiring a
> > > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > > needing to add another dax-device object representation.
> > >
> > > I am wondering what's the conclusion on this. I want to this to make
> > > progress in some direction so that I can make progress on virtiofs DAX
> > > support.
> >
> > I think we should at least try to delete the partition support and see
> > if anyone screams. Have a module option to revert the behavior so
> > people are not stuck waiting for the revert to land, but if it stays
> > quiet then we're in a better place with that support pushed out of the
> > dax core.
>
> Hi Dan,
>
> So basically keep partition support code just that disable it by default
> and it is enabled by some knob say kernel command line option/module
> option.

Yes.

> At what point of time will we remove that code completely. I mean what
> if people scream after two kernel releases, after we have removed the
> code.

I'd follow the typical timelines of Documentation/ABI/obsolete which
is a year or more.

>
> Also, from distribution's perspective, we might not hear from our
> customers for a very long time (till we backport that code in to
> existing releases or release this new code in next major release). From
> that view point I will not like to break existing user visible behavior.
>
> How bad it is to keep partition support around. To me it feels reasonaly
> simple where we just have to store offset into dax device into another
> dax object:

If we end up keeping partition support, we're not adding another object.

> and pass that object around (instead of dax_device). If that's
> the case, I am not sure why to even venture into a direction where some
> user's setup might be broken.

It was a mistake to support them. If that mistake can be undone
without breaking existing deployments the code base is better off
without the concept.

> Also from an application perspective, /dev/pmem is a block device, so it
> should behave like a block device, (including kernel partition table support).
> From that view, dax looks like just an additional feature of that device
> which can be enabled by passing option "-o dax".

dax via block devices was a crutch that we leaned on too heavily, and
the implementation has slowly been moving away from it ever since.

> IOW, can we reconsider the idea of not supporting kernel partition tables
> for dax capable block devices. I can only see downsides of removing kernel
> partition table support and only upside seems to be little cleanup of dax
> core code.

Can you help find end users that depend on it? Even the Red Hat
installation guide example shows mounting on pmem0 directly. [1]

My primary concern is people that might be booting from pmem as boot
support requires an EFI partition table, and initramfs images would
need to be respun to move to kpartx.

[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/storage_administration_guide/index#Configuring-Persistent-Memory-for-File-System-Direct-Access-DAX
Jan Kara Jan. 15, 2020, 9:03 a.m. UTC | #27
On Tue 14-01-20 16:28:05, Vivek Goyal wrote:
> On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > I think we should at least try to delete the partition support and see
> > if anyone screams. Have a module option to revert the behavior so
> > people are not stuck waiting for the revert to land, but if it stays
> > quiet then we're in a better place with that support pushed out of the
> > dax core.
> 
> Hi Dan,
> 
> So basically keep partition support code just that disable it by default
> and it is enabled by some knob say kernel command line option/module
> option.
> 
> At what point of time will we remove that code completely. I mean what
> if people scream after two kernel releases, after we have removed the
> code.
> 
> Also, from distribution's perspective, we might not hear from our
> customers for a very long time (till we backport that code in to
> existing releases or release this new code in next major release). From
> that view point I will not like to break existing user visible behavior.
> 
> How bad it is to keep partition support around. To me it feels reasonaly
> simple where we just have to store offset into dax device into another
> dax object and pass that object around (instead of dax_device). If that's
> the case, I am not sure why to even venture into a direction where some
> user's setup might be broken.
> 
> Also from an application perspective, /dev/pmem is a block device, so it
> should behave like a block device, (including kernel partition table support).
> From that view, dax looks like just an additional feature of that device
> which can be enabled by passing option "-o dax".

Well, not all block devices are partitionable. For example cdroms are
standard block devices but partitioning does not run for them. Similarly
device mapper devices are block devices but not partitioned. So there is
some precedens in not doing partitioning for some types of block devices.

For the rest I agree that kernels where pmem devices are partitionable have
shipped in enterprise distros and are going to be supported (and used) for
5-10 years before users decide to move on to something newer - at which
point we'll only find out whether someone used the feature or not. So
deprecation is going to be somewhat interesting. On the other hand clever
udev rule that detects partition table on pmem device and uses kpartx to
partition these devices (like what happens e.g. for dm-multipath devices)
could possibly be used as a replacement for kernel support so there's a way
out of this...

So I don't care too deeply about what the decision is going to be.

								Honza
Vivek Goyal Jan. 15, 2020, 7:56 p.m. UTC | #28
On Tue, Jan 14, 2020 at 02:23:04PM -0800, Dan Williams wrote:
> On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > > > that code if there is interest.
> > > > > > >
> > > > > > > I don't think it's worth it in the end especially considering
> > > > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > > > remove block entanglements entirely.
> > > > > > >
> > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > > > block device in same way as non-dax block device. It will be really
> > > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > > > will work.
> > > > > > >
> > > > > > > That can already happen today. If you do not properly align the
> > > > > > > partition then dax operations will be disabled. This proposal just
> > > > > > > extends that existing failure domain to make all partitions fail to
> > > > > > > support dax.
> > > > > >
> > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > > > decides to create partitions on it for whatever (possibly misguided)
> > > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > > > realistic error case sysadmins would be pondering about currently.
> > > > > >
> > > > > > So I'd find two options reasonably consistent:
> > > > > > 1) Keep status quo where partitions are created and support DAX.
> > > > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > > > >
> > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > > > feasible without angry users and Linus reverting the change.
> > > > >
> > > > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > > > partitions" camp.
> > > > >
> > > > > I don't see "drop partition support" effort ending well given the long
> > > > > standing "ext4 fails to mount when dax is not available" precedent.
> > > > >
> > > > > I think the next least bad option is to have a dax_get_by_host()
> > > > > variant that passes an offset and length pair rather than requiring a
> > > > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > > > needing to add another dax-device object representation.
> > > >
> > > > I am wondering what's the conclusion on this. I want to this to make
> > > > progress in some direction so that I can make progress on virtiofs DAX
> > > > support.
> > >
> > > I think we should at least try to delete the partition support and see
> > > if anyone screams. Have a module option to revert the behavior so
> > > people are not stuck waiting for the revert to land, but if it stays
> > > quiet then we're in a better place with that support pushed out of the
> > > dax core.
> >
> > Hi Dan,
> >
> > So basically keep partition support code just that disable it by default
> > and it is enabled by some knob say kernel command line option/module
> > option.
> 
> Yes.
> 
> > At what point of time will we remove that code completely. I mean what
> > if people scream after two kernel releases, after we have removed the
> > code.
> 
> I'd follow the typical timelines of Documentation/ABI/obsolete which
> is a year or more.
> 
> >
> > Also, from distribution's perspective, we might not hear from our
> > customers for a very long time (till we backport that code in to
> > existing releases or release this new code in next major release). From
> > that view point I will not like to break existing user visible behavior.
> >
> > How bad it is to keep partition support around. To me it feels reasonaly
> > simple where we just have to store offset into dax device into another
> > dax object:
> 
> If we end up keeping partition support, we're not adding another object.
> 
> > and pass that object around (instead of dax_device). If that's
> > the case, I am not sure why to even venture into a direction where some
> > user's setup might be broken.
> 
> It was a mistake to support them. If that mistake can be undone
> without breaking existing deployments the code base is better off
> without the concept.
> 
> > Also from an application perspective, /dev/pmem is a block device, so it
> > should behave like a block device, (including kernel partition table support).
> > From that view, dax looks like just an additional feature of that device
> > which can be enabled by passing option "-o dax".
> 
> dax via block devices was a crutch that we leaned on too heavily, and
> the implementation has slowly been moving away from it ever since.
> 
> > IOW, can we reconsider the idea of not supporting kernel partition tables
> > for dax capable block devices. I can only see downsides of removing kernel
> > partition table support and only upside seems to be little cleanup of dax
> > core code.
> 
> Can you help find end users that depend on it?

I can't think of a real user at this point of time. Just that I am
concerned, once the change goes in, somebody will get affected at later
point of time and comes out complainig and this change will be seen as
breaking user space and hence regression.

> Even the Red Hat
> installation guide example shows mounting on pmem0 directly. [1]

Below that example it also says.

"When creating partitions on a pmem device to be used for direct access,
partitions must be aligned on page boundaries. On the Intel 64 and AMD64
architecture, at least 4KiB alignment for the start and end of the
partition, but 2MiB is the preferred alignment. By default, the parted
tool aligns partitions on 1MiB boundaries. For the first partition,
specify 2MiB as the start of the partition. If the size of the partition
is a multiple of 2MiB, all other partitions are also aligned."

So documentation is clearly saying dax will work with partitions as well.
And some user might decide to just do that.

Thanks
Vivek
Dan Williams Jan. 15, 2020, 8:17 p.m. UTC | #29
On Wed, Jan 15, 2020 at 11:56 AM Vivek Goyal <vgoyal@redhat.com> wrote:
[..]
> > Even the Red Hat
> > installation guide example shows mounting on pmem0 directly. [1]
>
> Below that example it also says.
>
> "When creating partitions on a pmem device to be used for direct access,
> partitions must be aligned on page boundaries. On the Intel 64 and AMD64
> architecture, at least 4KiB alignment for the start and end of the
> partition, but 2MiB is the preferred alignment. By default, the parted
> tool aligns partitions on 1MiB boundaries. For the first partition,
> specify 2MiB as the start of the partition. If the size of the partition
> is a multiple of 2MiB, all other partitions are also aligned."
>
> So documentation is clearly saying dax will work with partitions as well.
> And some user might decide to just do that.

Yes, of course but my point is that it was ambiguous.

I'm going to take a look at how hard it would be to develop a kpartx
fallback in udev. If that can live across the driver transition then
maybe this can be a non-event for end users that already have that
udev update deployed.
Jeff Moyer Jan. 15, 2020, 9:08 p.m. UTC | #30
Hi, Dan,

Dan Williams <dan.j.williams@intel.com> writes:

> I'm going to take a look at how hard it would be to develop a kpartx
> fallback in udev. If that can live across the driver transition then
> maybe this can be a non-event for end users that already have that
> udev update deployed.

I just wanted to remind you that label-less dimms still exist, and are
still being shipped.  For those devices, the only way to subdivide the
storage is via partitioning.

-Jeff
Dan Williams Jan. 16, 2020, 6:09 p.m. UTC | #31
On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Hi, Dan,
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > I'm going to take a look at how hard it would be to develop a kpartx
> > fallback in udev. If that can live across the driver transition then
> > maybe this can be a non-event for end users that already have that
> > udev update deployed.
>
> I just wanted to remind you that label-less dimms still exist, and are
> still being shipped.  For those devices, the only way to subdivide the
> storage is via partitioning.

True, but if kpartx + udev can make this transparent then I don't
think users lose any functionality. They just gain a device-mapper
dependency.
Vivek Goyal Jan. 16, 2020, 6:39 p.m. UTC | #32
On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> > Hi, Dan,
> >
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> > > I'm going to take a look at how hard it would be to develop a kpartx
> > > fallback in udev. If that can live across the driver transition then
> > > maybe this can be a non-event for end users that already have that
> > > udev update deployed.
> >
> > I just wanted to remind you that label-less dimms still exist, and are
> > still being shipped.  For those devices, the only way to subdivide the
> > storage is via partitioning.
> 
> True, but if kpartx + udev can make this transparent then I don't
> think users lose any functionality. They just gain a device-mapper
> dependency.

So udev rules will trigger when a /dev/pmemX device shows up and run
kpartx which in turn will create dm-linear devices and device nodes
will show up in /dev/mapper/pmemXpY.

IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
systemd unit files are depenent on /dev/pmemXpY, these will still be
broken out of the box and will have to be modified to use device nodes
in /dev/mapper/ directory instead. Do I understand it right, Or I missed
the idea completely.

Vivek
Dan Williams Jan. 16, 2020, 7:09 p.m. UTC | #33
On Thu, Jan 16, 2020 at 10:39 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> > On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> > >
> > > Hi, Dan,
> > >
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > >
> > > > I'm going to take a look at how hard it would be to develop a kpartx
> > > > fallback in udev. If that can live across the driver transition then
> > > > maybe this can be a non-event for end users that already have that
> > > > udev update deployed.
> > >
> > > I just wanted to remind you that label-less dimms still exist, and are
> > > still being shipped.  For those devices, the only way to subdivide the
> > > storage is via partitioning.
> >
> > True, but if kpartx + udev can make this transparent then I don't
> > think users lose any functionality. They just gain a device-mapper
> > dependency.
>
> So udev rules will trigger when a /dev/pmemX device shows up and run
> kpartx which in turn will create dm-linear devices and device nodes
> will show up in /dev/mapper/pmemXpY.
>
> IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
> systemd unit files are depenent on /dev/pmemXpY, these will still be
> broken out of the box and will have to be modified to use device nodes
> in /dev/mapper/ directory instead. Do I understand it right, Or I missed
> the idea completely.

No, I'd write the udev rule to create links from /dev/pmemXpY to the
/dev/mapper device, and that rule would be gated by a new pmem device
attribute to trigger when kpartx needs to run vs the kernel native
partitions.
Vivek Goyal Jan. 16, 2020, 7:23 p.m. UTC | #34
On Thu, Jan 16, 2020 at 11:09:00AM -0800, Dan Williams wrote:

[..]
> > > True, but if kpartx + udev can make this transparent then I don't
> > > think users lose any functionality. They just gain a device-mapper
> > > dependency.
> >
> > So udev rules will trigger when a /dev/pmemX device shows up and run
> > kpartx which in turn will create dm-linear devices and device nodes
> > will show up in /dev/mapper/pmemXpY.
> >
> > IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
> > systemd unit files are depenent on /dev/pmemXpY, these will still be
> > broken out of the box and will have to be modified to use device nodes
> > in /dev/mapper/ directory instead. Do I understand it right, Or I missed
> > the idea completely.
> 
> No, I'd write the udev rule to create links from /dev/pmemXpY to the
> /dev/mapper device, and that rule would be gated by a new pmem device
> attribute to trigger when kpartx needs to run vs the kernel native
> partitions.

Got it. This sounds much better.

Vivek
Vivek Goyal Feb. 11, 2020, 5:33 p.m. UTC | #35
On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> > Hi, Dan,
> >
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> > > I'm going to take a look at how hard it would be to develop a kpartx
> > > fallback in udev. If that can live across the driver transition then
> > > maybe this can be a non-event for end users that already have that
> > > udev update deployed.
> >
> > I just wanted to remind you that label-less dimms still exist, and are
> > still being shipped.  For those devices, the only way to subdivide the
> > storage is via partitioning.
> 
> True, but if kpartx + udev can make this transparent then I don't
> think users lose any functionality. They just gain a device-mapper
> dependency.

Hi Dan,

Are you planning to look into making this work?

We can easily disable partition scanning by specifying gendisk
GENHD_FL_NO_PART_SCAN flag. But what about partition additiona path,
ioctl(BLKPG_ADD_PARTITION). That does not seem to do any checks whether
block device supports in kernel partitions or not. 

So kernel partitions (hence /dev/pmemXpY) objects are created anyway and
this will conflict with all the new planned udev rules.

If you block ioctl(BLKPG_ADD_PARTITION), then user space tools like
parted and fdisk started breaking when trying to create a partition
on /dev/pmeme0. IIUC, we have to allow partition table creation on
/dev/pmem0 so that later kpartx can parse it and create dm-linear
partitions.

Thanks
Vivek
diff mbox series

Patch

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 26a654dbc69a..3cbc97f3e653 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@  EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
-	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+	phys_addr_t phys_off = (start_sect + sector) * 512;
 
 	if (pgoff)
 		*pgoff = PHYS_PFN(phys_off);
diff --git a/fs/dax.c b/fs/dax.c
index 6bf81f931de3..a11147bbaf9e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1046,7 +1046,12 @@  static vm_fault_t dax_load_hole(struct xa_state *xas,
 static bool dax_range_is_aligned(struct block_device *bdev,
 				 unsigned int offset, unsigned int length)
 {
-	unsigned short sector_size = bdev_logical_block_size(bdev);
+	unsigned short sector_size;
+
+	if (!bdev)
+		return false;
+
+	sector_size = bdev_logical_block_size(bdev);
 
 	if (!IS_ALIGNED(offset, sector_size))
 		return false;