Message ID | 20190821175720.25901-2-vgoyal@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtio-fs: Enable DAX support | expand |
On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote: > From: Stefan Hajnoczi <stefanha@redhat.com> > > Although struct dax_device itself is not tied to a block device, some > DAX code assumes there is a block device. Make block devices optional > by allowing bdev to be NULL in commonly used DAX APIs. > > When there is no block device: > * Skip the partition offset calculation in bdev_dax_pgoff() > * Skip the blkdev_issue_zeroout() optimization > > Note that more block device assumptions remain but I haven't reach those > code paths yet. I think this should be split into two patches. For bdev_dax_pgoff I'd much rather have the partition offset if there is on in the daxdev somehow so that we can get rid of the block device entirely. Similarly for dax_range_is_aligned I'd rather have a pure dax way to offload zeroing rather than this bdev hack. In the long run I'd really like to make the bdev vs daxdev in iomap a union instead of having to carry both around.
On Mon, Aug 26, 2019 at 04:51:52AM -0700, Christoph Hellwig wrote: > On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote: > > From: Stefan Hajnoczi <stefanha@redhat.com> > > > > Although struct dax_device itself is not tied to a block device, some > > DAX code assumes there is a block device. Make block devices optional > > by allowing bdev to be NULL in commonly used DAX APIs. > > > > When there is no block device: > > * Skip the partition offset calculation in bdev_dax_pgoff() > > * Skip the blkdev_issue_zeroout() optimization > > > > Note that more block device assumptions remain but I haven't reach those > > code paths yet. > > I think this should be split into two patches. Hi Christoph, Ok, will split in two patches. In fact, I think will completley drop the second change right now as I think we might not be hitting that path yet. > For bdev_dax_pgoff > I'd much rather have the partition offset if there is on in the daxdev > somehow so that we can get rid of the block device entirely. IIUC, there is one block_device per partition while there is only one dax_device for the whole disk. So we can't directly move bdev logical offset into dax_device. We probably could put this in "iomap" and leave it to filesystems to report offset into dax_dev in iomap that way dax generic code does not have to deal with it. But that probably will be a bigger change. Did I misunderstand your suggestion. > > Similarly for dax_range_is_aligned I'd rather have a pure dax way > to offload zeroing rather than this bdev hack. Following commig introduced the change to write zeros through block device path. commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a Author: Vishal Verma <vishal.l.verma@intel.com> Date: Thu Apr 21 15:13:46 2016 -0400 dax: for truncate/hole-punch, do zeroing through the driver if possible IIUC, they are doing it so that they can clear gendisk->badblocks list. So even if there is pure dax way to do it, there will have to some involvment of block layer to clear gendisk->badblocks list. I am not sure I fully understand your suggestion. But I am hoping its not a must for these changes to make a progress. For now, I will drop change to dax_range_is_aligned(). Thanks Vivek
On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote: > > For bdev_dax_pgoff > > I'd much rather have the partition offset if there is on in the daxdev > > somehow so that we can get rid of the block device entirely. > > IIUC, there is one block_device per partition while there is only one > dax_device for the whole disk. So we can't directly move bdev logical > offset into dax_device. Well, then we need to find a way to get partitions for dax devices, as we really should not expect a block device hiding behind a dax dev. That is just a weird legacy assumption - block device need to layer on top of the dax device optionally. > > We probably could put this in "iomap" and leave it to filesystems to > report offset into dax_dev in iomap that way dax generic code does not > have to deal with it. But that probably will be a bigger change. And where would the file system get that information from? > commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a > Author: Vishal Verma <vishal.l.verma@intel.com> > Date: Thu Apr 21 15:13:46 2016 -0400 > > dax: for truncate/hole-punch, do zeroing through the driver if possible > > IIUC, they are doing it so that they can clear gendisk->badblocks list. > > So even if there is pure dax way to do it, there will have to some > involvment of block layer to clear gendisk->badblocks list. Once again we need to move that list to the dax device, as the assumption that there is a block device associated with the dax dev is flawed. > I am not sure I fully understand your suggestion. But I am hoping its > not a must for these changes to make a progress. For now, I will drop > change to dax_range_is_aligned(). Well, someone needs to clean this mess up, and as you have an actual real life example of a dax dev without the block device I think the burden naturally falls on you.
On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote: > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote: > > > For bdev_dax_pgoff > > > I'd much rather have the partition offset if there is on in the daxdev > > > somehow so that we can get rid of the block device entirely. > > > > IIUC, there is one block_device per partition while there is only one > > dax_device for the whole disk. So we can't directly move bdev logical > > offset into dax_device. > > Well, then we need to find a way to get partitions for dax devices, > as we really should not expect a block device hiding behind a dax > dev. That is just a weird legacy assumption - block device need to > layer on top of the dax device optionally. > > > > > We probably could put this in "iomap" and leave it to filesystems to > > report offset into dax_dev in iomap that way dax generic code does not > > have to deal with it. But that probably will be a bigger change. > > And where would the file system get that information from? File system knows about block device, can it just call get_start_sect() while filling iomap->addr. And this means we don't have to have parition information in dax device. Will something like following work? (Just a proof of concept patch). --- drivers/dax/super.c | 11 +++++++++++ fs/dax.c | 6 +++--- fs/ext4/inode.c | 6 +++++- include/linux/dax.h | 1 + 4 files changed, 20 insertions(+), 4 deletions(-) Index: rhvgoyal-linux/fs/ext4/inode.c =================================================================== --- rhvgoyal-linux.orig/fs/ext4/inode.c 2019-08-28 13:51:16.051937204 -0400 +++ rhvgoyal-linux/fs/ext4/inode.c 2019-08-28 13:51:44.453937204 -0400 @@ -3589,7 +3589,11 @@ retry: WARN_ON_ONCE(1); return -EIO; } - iomap->addr = (u64)map.m_pblk << blkbits; + if (IS_DAX(inode)) + iomap->addr = ((u64)map.m_pblk << blkbits) + + (get_start_sect(iomap->bdev) * 512); + else + iomap->addr = (u64)map.m_pblk << blkbits; } if (map.m_flags & EXT4_MAP_NEW) Index: rhvgoyal-linux/fs/dax.c =================================================================== --- rhvgoyal-linux.orig/fs/dax.c 2019-08-28 13:51:16.051937204 -0400 +++ rhvgoyal-linux/fs/dax.c 2019-08-28 13:51:44.457937204 -0400 @@ -688,7 +688,7 @@ static int copy_user_dax(struct block_de long rc; int id; - rc = bdev_dax_pgoff(bdev, sector, size, &pgoff); + rc = dax_pgoff(sector, size, &pgoff); if (rc) return rc; @@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *i int id, rc; long length; - rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff); + rc = dax_pgoff(sector, size, &pgoff); if (rc) return rc; id = dax_read_lock(); @@ -1137,7 +1137,7 @@ dax_iomap_actor(struct inode *inode, lof break; } - ret = bdev_dax_pgoff(bdev, sector, size, &pgoff); + ret = dax_pgoff(sector, size, &pgoff); if (ret) break; Index: rhvgoyal-linux/drivers/dax/super.c =================================================================== --- rhvgoyal-linux.orig/drivers/dax/super.c 2019-08-28 13:51:51.802937204 -0400 +++ rhvgoyal-linux/drivers/dax/super.c 2019-08-28 13:51:56.905937204 -0400 @@ -43,6 +43,17 @@ EXPORT_SYMBOL_GPL(dax_read_unlock); #ifdef CONFIG_BLOCK #include <linux/blkdev.h> +int dax_pgoff(sector_t sector, size_t size, pgoff_t *pgoff) +{ + phys_addr_t phys_off = sector * 512; + + if (pgoff) + *pgoff = PHYS_PFN(phys_off); + if (phys_off % PAGE_SIZE || size % PAGE_SIZE) + return -EINVAL; + return 0; +} + int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size, pgoff_t *pgoff) { Index: rhvgoyal-linux/include/linux/dax.h =================================================================== --- rhvgoyal-linux.orig/include/linux/dax.h 2019-08-28 13:51:51.802937204 -0400 +++ rhvgoyal-linux/include/linux/dax.h 2019-08-28 13:51:56.908937204 -0400 @@ -111,6 +111,7 @@ static inline bool daxdev_mapping_suppor struct writeback_control; int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff); +int dax_pgoff(sector_t, size_t, pgoff_t *pgoff); #if IS_ENABLED(CONFIG_FS_DAX) bool __bdev_dax_supported(struct block_device *bdev, int blocksize); static inline bool bdev_dax_supported(struct block_device *bdev, int blocksize)
On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote: > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote: > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote: > > > > For bdev_dax_pgoff > > > > I'd much rather have the partition offset if there is on in the daxdev > > > > somehow so that we can get rid of the block device entirely. > > > > > > IIUC, there is one block_device per partition while there is only one > > > dax_device for the whole disk. So we can't directly move bdev logical > > > offset into dax_device. > > > > Well, then we need to find a way to get partitions for dax devices, > > as we really should not expect a block device hiding behind a dax > > dev. That is just a weird legacy assumption - block device need to > > layer on top of the dax device optionally. > > > > > > > > We probably could put this in "iomap" and leave it to filesystems to > > > report offset into dax_dev in iomap that way dax generic code does not > > > have to deal with it. But that probably will be a bigger change. > > > > And where would the file system get that information from? > > File system knows about block device, can it just call get_start_sect() > while filling iomap->addr. And this means we don't have to have > parition information in dax device. Will something like following work? > (Just a proof of concept patch). > > > --- > drivers/dax/super.c | 11 +++++++++++ > fs/dax.c | 6 +++--- > fs/ext4/inode.c | 6 +++++- > include/linux/dax.h | 1 + > 4 files changed, 20 insertions(+), 4 deletions(-) > > Index: rhvgoyal-linux/fs/ext4/inode.c > =================================================================== > --- rhvgoyal-linux.orig/fs/ext4/inode.c 2019-08-28 13:51:16.051937204 -0400 > +++ rhvgoyal-linux/fs/ext4/inode.c 2019-08-28 13:51:44.453937204 -0400 > @@ -3589,7 +3589,11 @@ retry: > WARN_ON_ONCE(1); > return -EIO; > } > - iomap->addr = (u64)map.m_pblk << blkbits; > + if (IS_DAX(inode)) > + iomap->addr = ((u64)map.m_pblk << blkbits) + > + (get_start_sect(iomap->bdev) * 512); > + else > + iomap->addr = (u64)map.m_pblk << blkbits; I'm not a fan of returning a physical device sector address from an interface where ever other user/caller expects this address to be a logical block address into the block device. It creates a landmine in the iomap API that callers may not be aware of and that's going to cause bugs. We're trying really hard to keep special case hacks like this out of the iomap infrastructure, so on those grounds alone I'd suggest this is a dead end approach. Hence I think that if the dax device needs a physical offset from the start of the block device the filesystem sits on, it should be set up at dax device instantiation time and so the filesystem/bdev never needs to be queried again for this information. Cheers, Dave.
On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote: > > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote: > > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote: > > > > > For bdev_dax_pgoff > > > > > I'd much rather have the partition offset if there is on in the daxdev > > > > > somehow so that we can get rid of the block device entirely. > > > > > > > > IIUC, there is one block_device per partition while there is only one > > > > dax_device for the whole disk. So we can't directly move bdev logical > > > > offset into dax_device. > > > > > > Well, then we need to find a way to get partitions for dax devices, > > > as we really should not expect a block device hiding behind a dax > > > dev. That is just a weird legacy assumption - block device need to > > > layer on top of the dax device optionally. > > > > > > > > > > > We probably could put this in "iomap" and leave it to filesystems to > > > > report offset into dax_dev in iomap that way dax generic code does not > > > > have to deal with it. But that probably will be a bigger change. > > > > > > And where would the file system get that information from? > > > > File system knows about block device, can it just call get_start_sect() > > while filling iomap->addr. And this means we don't have to have > > parition information in dax device. Will something like following work? > > (Just a proof of concept patch). > > > > > > --- > > drivers/dax/super.c | 11 +++++++++++ > > fs/dax.c | 6 +++--- > > fs/ext4/inode.c | 6 +++++- > > include/linux/dax.h | 1 + > > 4 files changed, 20 insertions(+), 4 deletions(-) > > > > Index: rhvgoyal-linux/fs/ext4/inode.c > > =================================================================== > > --- rhvgoyal-linux.orig/fs/ext4/inode.c 2019-08-28 13:51:16.051937204 -0400 > > +++ rhvgoyal-linux/fs/ext4/inode.c 2019-08-28 13:51:44.453937204 -0400 > > @@ -3589,7 +3589,11 @@ retry: > > WARN_ON_ONCE(1); > > return -EIO; > > } > > - iomap->addr = (u64)map.m_pblk << blkbits; > > + if (IS_DAX(inode)) > > + iomap->addr = ((u64)map.m_pblk << blkbits) + > > + (get_start_sect(iomap->bdev) * 512); > > + else > > + iomap->addr = (u64)map.m_pblk << blkbits; > > I'm not a fan of returning a physical device sector address from an > interface where ever other user/caller expects this address to be a > logical block address into the block device. It creates a landmine > in the iomap API that callers may not be aware of and that's going > to cause bugs. We're trying really hard to keep special case hacks > like this out of the iomap infrastructure, so on those grounds alone > I'd suggest this is a dead end approach. > > Hence I think that if the dax device needs a physical offset from > the start of the block device the filesystem sits on, it should be > set up at dax device instantiation time and so the filesystem/bdev > never needs to be queried again for this information. > Agree. In retrospect it was my laziness in the dax-device implementation to expect the block-device to be available. It looks like fs_dax_get_by_bdev() is an intercept point where a dax_device could be dynamically created to represent the subset range indicated by the block-device partition. That would open up more cleanup opportunities.
On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote: > Agree. In retrospect it was my laziness in the dax-device > implementation to expect the block-device to be available. > > It looks like fs_dax_get_by_bdev() is an intercept point where a > dax_device could be dynamically created to represent the subset range > indicated by the block-device partition. That would open up more > cleanup opportunities. That seems like a decent short-term plan. But in the long I'd just let dax call into the partition table parser directly, as we might want to support partitions without first having to create a block device on top of the dax device. Same for the badblocks handling story.
On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote: > On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote: > > > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote: > > > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote: > > > > > > For bdev_dax_pgoff > > > > > > I'd much rather have the partition offset if there is on in the daxdev > > > > > > somehow so that we can get rid of the block device entirely. > > > > > > > > > > IIUC, there is one block_device per partition while there is only one > > > > > dax_device for the whole disk. So we can't directly move bdev logical > > > > > offset into dax_device. > > > > > > > > Well, then we need to find a way to get partitions for dax devices, > > > > as we really should not expect a block device hiding behind a dax > > > > dev. That is just a weird legacy assumption - block device need to > > > > layer on top of the dax device optionally. > > > > > > > > > > > > > > We probably could put this in "iomap" and leave it to filesystems to > > > > > report offset into dax_dev in iomap that way dax generic code does not > > > > > have to deal with it. But that probably will be a bigger change. > > > > > > > > And where would the file system get that information from? > > > > > > File system knows about block device, can it just call get_start_sect() > > > while filling iomap->addr. And this means we don't have to have > > > parition information in dax device. Will something like following work? > > > (Just a proof of concept patch). > > > > > > > > > --- > > > drivers/dax/super.c | 11 +++++++++++ > > > fs/dax.c | 6 +++--- > > > fs/ext4/inode.c | 6 +++++- > > > include/linux/dax.h | 1 + > > > 4 files changed, 20 insertions(+), 4 deletions(-) > > > > > > Index: rhvgoyal-linux/fs/ext4/inode.c > > > =================================================================== > > > --- rhvgoyal-linux.orig/fs/ext4/inode.c 2019-08-28 13:51:16.051937204 -0400 > > > +++ rhvgoyal-linux/fs/ext4/inode.c 2019-08-28 13:51:44.453937204 -0400 > > > @@ -3589,7 +3589,11 @@ retry: > > > WARN_ON_ONCE(1); > > > return -EIO; > > > } > > > - iomap->addr = (u64)map.m_pblk << blkbits; > > > + if (IS_DAX(inode)) > > > + iomap->addr = ((u64)map.m_pblk << blkbits) + > > > + (get_start_sect(iomap->bdev) * 512); > > > + else > > > + iomap->addr = (u64)map.m_pblk << blkbits; > > > > I'm not a fan of returning a physical device sector address from an > > interface where ever other user/caller expects this address to be a > > logical block address into the block device. It creates a landmine > > in the iomap API that callers may not be aware of and that's going > > to cause bugs. We're trying really hard to keep special case hacks > > like this out of the iomap infrastructure, so on those grounds alone > > I'd suggest this is a dead end approach. > > > > Hence I think that if the dax device needs a physical offset from > > the start of the block device the filesystem sits on, it should be > > set up at dax device instantiation time and so the filesystem/bdev > > never needs to be queried again for this information. > > > > Agree. In retrospect it was my laziness in the dax-device > implementation to expect the block-device to be available. > > It looks like fs_dax_get_by_bdev() is an intercept point where a > dax_device could be dynamically created to represent the subset range > indicated by the block-device partition. That would open up more > cleanup opportunities. Hi Dan, After a long time I got time to look at it again. Want to work on this cleanup so that I can make progress with virtiofs DAX paches. I am not sure I understand the requirements fully. I see that right now dax_device is created per device and all block partitions refer to it. If we want to create one dax_device per partition, then it looks like this will be structured more along the lines how block layer handles disk and partitions. (One gendisk for disk and block_devices for partitions, including partition 0). That probably means state belong to whole device will be in common structure say dax_device_common, and per partition state will be in dax_device and dax_device can carry a pointer to dax_device_common. I am also not sure what does it mean to partition dax devices. How will partitions be exported to user space. Thanks Vivek
On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > Agree. In retrospect it was my laziness in the dax-device > > implementation to expect the block-device to be available. > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > dax_device could be dynamically created to represent the subset range > > indicated by the block-device partition. That would open up more > > cleanup opportunities. > > Hi Dan, > > After a long time I got time to look at it again. Want to work on this > cleanup so that I can make progress with virtiofs DAX paches. > > I am not sure I understand the requirements fully. I see that right now > dax_device is created per device and all block partitions refer to it. If > we want to create one dax_device per partition, then it looks like this > will be structured more along the lines how block layer handles disk and > partitions. (One gendisk for disk and block_devices for partitions, > including partition 0). That probably means state belong to whole device > will be in common structure say dax_device_common, and per partition state > will be in dax_device and dax_device can carry a pointer to > dax_device_common. > > I am also not sure what does it mean to partition dax devices. How will > partitions be exported to user space. Dan, last time we talked you agreed that partitioned dax devices are rather pointless IIRC. Should we just deprecate partitions on DAX devices and then remove them after a cycle or two?
On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > Agree. In retrospect it was my laziness in the dax-device > > > implementation to expect the block-device to be available. > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > dax_device could be dynamically created to represent the subset range > > > indicated by the block-device partition. That would open up more > > > cleanup opportunities. > > > > Hi Dan, > > > > After a long time I got time to look at it again. Want to work on this > > cleanup so that I can make progress with virtiofs DAX paches. > > > > I am not sure I understand the requirements fully. I see that right now > > dax_device is created per device and all block partitions refer to it. If > > we want to create one dax_device per partition, then it looks like this > > will be structured more along the lines how block layer handles disk and > > partitions. (One gendisk for disk and block_devices for partitions, > > including partition 0). That probably means state belong to whole device > > will be in common structure say dax_device_common, and per partition state > > will be in dax_device and dax_device can carry a pointer to > > dax_device_common. > > > > I am also not sure what does it mean to partition dax devices. How will > > partitions be exported to user space. > > Dan, last time we talked you agreed that partitioned dax devices are > rather pointless IIRC. Should we just deprecate partitions on DAX > devices and then remove them after a cycle or two? That does seem a better plan than trying to force partition support where it is not needed.
On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > Agree. In retrospect it was my laziness in the dax-device > > > > implementation to expect the block-device to be available. > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > dax_device could be dynamically created to represent the subset range > > > > indicated by the block-device partition. That would open up more > > > > cleanup opportunities. > > > > > > Hi Dan, > > > > > > After a long time I got time to look at it again. Want to work on this > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > I am not sure I understand the requirements fully. I see that right now > > > dax_device is created per device and all block partitions refer to it. If > > > we want to create one dax_device per partition, then it looks like this > > > will be structured more along the lines how block layer handles disk and > > > partitions. (One gendisk for disk and block_devices for partitions, > > > including partition 0). That probably means state belong to whole device > > > will be in common structure say dax_device_common, and per partition state > > > will be in dax_device and dax_device can carry a pointer to > > > dax_device_common. > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > partitions be exported to user space. > > > > Dan, last time we talked you agreed that partitioned dax devices are > > rather pointless IIRC. Should we just deprecate partitions on DAX > > devices and then remove them after a cycle or two? > > That does seem a better plan than trying to force partition support > where it is not needed. Question: if one /did/ have a partitioned DAX device and used kpartx to create dm-linear devices for each partition, will DAX still work through that? --D
On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > implementation to expect the block-device to be available. > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > dax_device could be dynamically created to represent the subset range > > > > > indicated by the block-device partition. That would open up more > > > > > cleanup opportunities. > > > > > > > > Hi Dan, > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > dax_device is created per device and all block partitions refer to it. If > > > > we want to create one dax_device per partition, then it looks like this > > > > will be structured more along the lines how block layer handles disk and > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > including partition 0). That probably means state belong to whole device > > > > will be in common structure say dax_device_common, and per partition state > > > > will be in dax_device and dax_device can carry a pointer to > > > > dax_device_common. > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > partitions be exported to user space. > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > devices and then remove them after a cycle or two? > > > > That does seem a better plan than trying to force partition support > > where it is not needed. > > Question: if one /did/ have a partitioned DAX device and used kpartx to > create dm-linear devices for each partition, will DAX still work through > that? The device-mapper support will continue, but it will be limited to whole device sub-components. I.e. you could use kpartx to carve up /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote: > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > > implementation to expect the block-device to be available. > > > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > > dax_device could be dynamically created to represent the subset range > > > > > > indicated by the block-device partition. That would open up more > > > > > > cleanup opportunities. > > > > > > > > > > Hi Dan, > > > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > > dax_device is created per device and all block partitions refer to it. If > > > > > we want to create one dax_device per partition, then it looks like this > > > > > will be structured more along the lines how block layer handles disk and > > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > > including partition 0). That probably means state belong to whole device > > > > > will be in common structure say dax_device_common, and per partition state > > > > > will be in dax_device and dax_device can carry a pointer to > > > > > dax_device_common. > > > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > > partitions be exported to user space. > > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > > devices and then remove them after a cycle or two? > > > > > > That does seem a better plan than trying to force partition support > > > where it is not needed. > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to > > create dm-linear devices for each partition, will DAX still work through > > that? > > The device-mapper support will continue, but it will be limited to > whole device sub-components. I.e. you could use kpartx to carve up > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0. So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0 is a block device, I thought tools will expect it to be partitioned. Sometimes I create those partitions and use /dev/pmem0. So what's the replacement for this. People often have tools/scripts which might want to partition the device and these will start failing. IOW, I do not understand that why being able to partition /dev/pmem0 (which is a block device from user space point of view), is pointless. Thanks Vivek
On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote: > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > > > implementation to expect the block-device to be available. > > > > > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > > > dax_device could be dynamically created to represent the subset range > > > > > > > indicated by the block-device partition. That would open up more > > > > > > > cleanup opportunities. > > > > > > > > > > > > Hi Dan, > > > > > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > > > dax_device is created per device and all block partitions refer to it. If > > > > > > we want to create one dax_device per partition, then it looks like this > > > > > > will be structured more along the lines how block layer handles disk and > > > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > > > including partition 0). That probably means state belong to whole device > > > > > > will be in common structure say dax_device_common, and per partition state > > > > > > will be in dax_device and dax_device can carry a pointer to > > > > > > dax_device_common. > > > > > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > > > partitions be exported to user space. > > > > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > > > devices and then remove them after a cycle or two? > > > > > > > > That does seem a better plan than trying to force partition support > > > > where it is not needed. > > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to > > > create dm-linear devices for each partition, will DAX still work through > > > that? > > > > The device-mapper support will continue, but it will be limited to > > whole device sub-components. I.e. you could use kpartx to carve up > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0. > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0 > is a block device, I thought tools will expect it to be partitioned. > Sometimes I create those partitions and use /dev/pmem0. So what's > the replacement for this. People often have tools/scripts which might > want to partition the device and these will start failing. Partitioning will still work, but dax operation will be declined and fall back to page-cache. > IOW, I do not understand that why being able to partition /dev/pmem0 > (which is a block device from user space point of view), is pointless. How about s/pointless/redundant/. Persistent memory can already be "partitioned" via namespace boundaries. Block device partitioning is then redundant and needlessly complicates, as you have found, the kernel implementation. The problem will be people that were on dax+ext4 on partitions. Those people will see a hard failure at mount whereas XFS will fallback to page cache with a warning in the log. I think ext4 must convert to the xfs dax handling model before partition support is dropped.
On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote: > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote: > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > > > > implementation to expect the block-device to be available. > > > > > > > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > > > > dax_device could be dynamically created to represent the subset range > > > > > > > > indicated by the block-device partition. That would open up more > > > > > > > > cleanup opportunities. > > > > > > > > > > > > > > Hi Dan, > > > > > > > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > > > > dax_device is created per device and all block partitions refer to it. If > > > > > > > we want to create one dax_device per partition, then it looks like this > > > > > > > will be structured more along the lines how block layer handles disk and > > > > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > > > > including partition 0). That probably means state belong to whole device > > > > > > > will be in common structure say dax_device_common, and per partition state > > > > > > > will be in dax_device and dax_device can carry a pointer to > > > > > > > dax_device_common. > > > > > > > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > > > > partitions be exported to user space. > > > > > > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > > > > devices and then remove them after a cycle or two? > > > > > > > > > > That does seem a better plan than trying to force partition support > > > > > where it is not needed. > > > > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to > > > > create dm-linear devices for each partition, will DAX still work through > > > > that? > > > > > > The device-mapper support will continue, but it will be limited to > > > whole device sub-components. I.e. you could use kpartx to carve up > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0. > > > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0 > > is a block device, I thought tools will expect it to be partitioned. > > Sometimes I create those partitions and use /dev/pmem0. So what's > > the replacement for this. People often have tools/scripts which might > > want to partition the device and these will start failing. > > Partitioning will still work, but dax operation will be declined and > fall back to page-cache. Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or filesystem will fall back to using page cache. (But dax will not be enabled). > > > IOW, I do not understand that why being able to partition /dev/pmem0 > > (which is a block device from user space point of view), is pointless. > > How about s/pointless/redundant/. Persistent memory can already be > "partitioned" via namespace boundaries. But that's an entirely different way of partitioning. To me being able to use block devices (with dax capability) in same way as any other block device makes sense. > Block device partitioning is > then redundant and needlessly complicates, as you have found, the > kernel implementation. It does complicate kernel implementation. Is it too hard to solve the problem in kernel. W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where dax code refers back to block device to figure out partition offset in dax device. If we create a dax object corresponding to "struct block_device" and store sector offset in that, then we could pass that object to dax code and not worry about referring back to bdev. I have written some proof of concept code and called that object "dax_handle". I can post that code if there is interest. IMHO, it feels useful to be able to partition and use a dax capable block device in same way as non-dax block device. It will be really odd to think that if filesystem is on /dev/pmem0p1, then dax can't be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax will work. Thanks Vivek > > The problem will be people that were on dax+ext4 on partitions. Those > people will see a hard failure at mount whereas XFS will fallback to > page cache with a warning in the log. I think ext4 must convert to the > xfs dax handling model before partition support is dropped. >
On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote: > > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote: > > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > > > > > implementation to expect the block-device to be available. > > > > > > > > > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > > > > > dax_device could be dynamically created to represent the subset range > > > > > > > > > indicated by the block-device partition. That would open up more > > > > > > > > > cleanup opportunities. > > > > > > > > > > > > > > > > Hi Dan, > > > > > > > > > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > > > > > dax_device is created per device and all block partitions refer to it. If > > > > > > > > we want to create one dax_device per partition, then it looks like this > > > > > > > > will be structured more along the lines how block layer handles disk and > > > > > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > > > > > including partition 0). That probably means state belong to whole device > > > > > > > > will be in common structure say dax_device_common, and per partition state > > > > > > > > will be in dax_device and dax_device can carry a pointer to > > > > > > > > dax_device_common. > > > > > > > > > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > > > > > partitions be exported to user space. > > > > > > > > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > > > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > > > > > devices and then remove them after a cycle or two? > > > > > > > > > > > > That does seem a better plan than trying to force partition support > > > > > > where it is not needed. > > > > > > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to > > > > > create dm-linear devices for each partition, will DAX still work through > > > > > that? > > > > > > > > The device-mapper support will continue, but it will be limited to > > > > whole device sub-components. I.e. you could use kpartx to carve up > > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0. > > > > > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0 > > > is a block device, I thought tools will expect it to be partitioned. > > > Sometimes I create those partitions and use /dev/pmem0. So what's > > > the replacement for this. People often have tools/scripts which might > > > want to partition the device and these will start failing. > > > > Partitioning will still work, but dax operation will be declined and > > fall back to page-cache. > > Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or > filesystem will fall back to using page cache. (But dax will not be > enabled). > > > > > > IOW, I do not understand that why being able to partition /dev/pmem0 > > > (which is a block device from user space point of view), is pointless. > > > > How about s/pointless/redundant/. Persistent memory can already be > > "partitioned" via namespace boundaries. > > But that's an entirely different way of partitioning. To me being able > to use block devices (with dax capability) in same way as any other > block device makes sense. > > > Block device partitioning is > > then redundant and needlessly complicates, as you have found, the > > kernel implementation. > > It does complicate kernel implementation. Is it too hard to solve the > problem in kernel. > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > dax code refers back to block device to figure out partition offset in > dax device. If we create a dax object corresponding to "struct block_device" > and store sector offset in that, then we could pass that object to dax > code and not worry about referring back to bdev. I have written some > proof of concept code and called that object "dax_handle". I can post > that code if there is interest. I don't think it's worth it in the end especially considering filesystems are looking to operate on /dev/dax devices directly and remove block entanglements entirely. > IMHO, it feels useful to be able to partition and use a dax capable > block device in same way as non-dax block device. It will be really > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > will work. That can already happen today. If you do not properly align the partition then dax operations will be disabled. This proposal just extends that existing failure domain to make all partitions fail to support dax.
On Tue, Jan 07, 2020 at 10:49:55AM -0800, Dan Williams wrote: > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote: > > > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote: > > > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote: > > > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote: > > > > > > > > > > Agree. In retrospect it was my laziness in the dax-device > > > > > > > > > > implementation to expect the block-device to be available. > > > > > > > > > > > > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a > > > > > > > > > > dax_device could be dynamically created to represent the subset range > > > > > > > > > > indicated by the block-device partition. That would open up more > > > > > > > > > > cleanup opportunities. > > > > > > > > > > > > > > > > > > Hi Dan, > > > > > > > > > > > > > > > > > > After a long time I got time to look at it again. Want to work on this > > > > > > > > > cleanup so that I can make progress with virtiofs DAX paches. > > > > > > > > > > > > > > > > > > I am not sure I understand the requirements fully. I see that right now > > > > > > > > > dax_device is created per device and all block partitions refer to it. If > > > > > > > > > we want to create one dax_device per partition, then it looks like this > > > > > > > > > will be structured more along the lines how block layer handles disk and > > > > > > > > > partitions. (One gendisk for disk and block_devices for partitions, > > > > > > > > > including partition 0). That probably means state belong to whole device > > > > > > > > > will be in common structure say dax_device_common, and per partition state > > > > > > > > > will be in dax_device and dax_device can carry a pointer to > > > > > > > > > dax_device_common. > > > > > > > > > > > > > > > > > > I am also not sure what does it mean to partition dax devices. How will > > > > > > > > > partitions be exported to user space. > > > > > > > > > > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are > > > > > > > > rather pointless IIRC. Should we just deprecate partitions on DAX > > > > > > > > devices and then remove them after a cycle or two? > > > > > > > > > > > > > > That does seem a better plan than trying to force partition support > > > > > > > where it is not needed. > > > > > > > > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to > > > > > > create dm-linear devices for each partition, will DAX still work through > > > > > > that? > > > > > > > > > > The device-mapper support will continue, but it will be limited to > > > > > whole device sub-components. I.e. you could use kpartx to carve up > > > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0. > > > > > > > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0 > > > > is a block device, I thought tools will expect it to be partitioned. > > > > Sometimes I create those partitions and use /dev/pmem0. So what's > > > > the replacement for this. People often have tools/scripts which might > > > > want to partition the device and these will start failing. > > > > > > Partitioning will still work, but dax operation will be declined and > > > fall back to page-cache. > > > > Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or > > filesystem will fall back to using page cache. (But dax will not be > > enabled). > > > > > > > > > IOW, I do not understand that why being able to partition /dev/pmem0 > > > > (which is a block device from user space point of view), is pointless. > > > > > > How about s/pointless/redundant/. Persistent memory can already be > > > "partitioned" via namespace boundaries. > > > > But that's an entirely different way of partitioning. To me being able > > to use block devices (with dax capability) in same way as any other > > block device makes sense. > > > > > Block device partitioning is > > > then redundant and needlessly complicates, as you have found, the > > > kernel implementation. > > > > It does complicate kernel implementation. Is it too hard to solve the > > problem in kernel. > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > dax code refers back to block device to figure out partition offset in > > dax device. If we create a dax object corresponding to "struct block_device" > > and store sector offset in that, then we could pass that object to dax > > code and not worry about referring back to bdev. I have written some > > proof of concept code and called that object "dax_handle". I can post > > that code if there is interest. > > I don't think it's worth it in the end especially considering > filesystems are looking to operate on /dev/dax devices directly and > remove block entanglements entirely. > > > IMHO, it feels useful to be able to partition and use a dax capable > > block device in same way as non-dax block device. It will be really > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > will work. > > That can already happen today. If you do not properly align the > partition then dax operations will be disabled. Er... is this conversation getting confused? I was talking about kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel creating /dev/pmem0p1. I thnk Vivek was complaining about the inconsistent behavior between the two, even if the partition is aligned properly. I'm not sure how alignment leaked in here? > This proposal just > extends that existing failure domain to make all partitions fail to > support dax. Oh, wait. You're proposing that "partitions of pmem devices don't support DAX", not "the kernel will not create partitions for pmem devices". Yeah, that would be inconsistent and weird. I'd say deprecate the kernel automounting partitions, but I guess it already does that, and removing it would break /something/. I guess you could put "/dev/pmemXpY" on the deprecation schedule. --D
On Tue, Jan 7, 2020 at 11:03 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: [..] > > That can already happen today. If you do not properly align the > > partition then dax operations will be disabled. > > Er... is this conversation getting confused? I was talking about > kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel > creating /dev/pmem0p1. I thnk Vivek was complaining about the > inconsistent behavior between the two, even if the partition is aligned > properly. > > I'm not sure how alignment leaked in here? Oh, whoops, I was jumping to the mismatch between host device and partition and whether we had precedent to fail to support dax on the partition when the base block device does support it. But yes, the mismatch between kpartx and native partitions is weird. That said kpartx is there to add partition support where the kernel for whatever reason fails to, or chooses not to, and dax is looking like such a place. > > This proposal just > > extends that existing failure domain to make all partitions fail to > > support dax. > > Oh, wait. You're proposing that "partitions of pmem devices don't > support DAX", not "the kernel will not create partitions for pmem > devices". > > Yeah, that would be inconsistent and weird. More weird than the current constraints? > I'd say deprecate the > kernel automounting partitions, but I guess it already does that, and Ok, now I don't know why automounting is leaking into this discussion? > removing it would break /something/. Yes, the breakage risk is anyone that was using ext4 mount failure as a dax capability detector. > I guess you could put > "/dev/pmemXpY" on the deprecation schedule. ...but why deprecate /dev/pmemXpY partitions altogether? If someone doesn't care about dax then they can do all the legacy block things. If they do care about dax then work with whole device namespaces. The proposal is to detect dax on partitions and warn people to move to kpartx. Let the core fs/dax implementation continue to shed block dependencies.
On Tue, Jan 7, 2020 at 11:46 AM Dan Williams <dan.j.williams@intel.com> wrote: [..] > > I'd say deprecate the > > kernel automounting partitions, but I guess it already does that, and > > Ok, now I don't know why automounting is leaking into this discussion? > > > removing it would break /something/. > > Yes, the breakage risk is anyone that was using ext4 mount failure as > a dax capability detector. > > > I guess you could put > > "/dev/pmemXpY" on the deprecation schedule. > > ...but why deprecate /dev/pmemXpY partitions altogether? If someone > doesn't care about dax then they can do all the legacy block things. > If they do care about dax then work with whole device namespaces. Circling back on this point now that I understand what you meant by automount. It would need to be a full deprecation of /dev/pmemXpY devices if kpartx dax support is going to fully take over for people that want to use disk partition tables instead of EFI Namespace Labels to carve up pmem.
On Tue 07-01-20 10:49:55, Dan Williams wrote: > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > dax code refers back to block device to figure out partition offset in > > dax device. If we create a dax object corresponding to "struct block_device" > > and store sector offset in that, then we could pass that object to dax > > code and not worry about referring back to bdev. I have written some > > proof of concept code and called that object "dax_handle". I can post > > that code if there is interest. > > I don't think it's worth it in the end especially considering > filesystems are looking to operate on /dev/dax devices directly and > remove block entanglements entirely. > > > IMHO, it feels useful to be able to partition and use a dax capable > > block device in same way as non-dax block device. It will be really > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > will work. > > That can already happen today. If you do not properly align the > partition then dax operations will be disabled. This proposal just > extends that existing failure domain to make all partitions fail to > support dax. Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, decides to create partitions on it for whatever (possibly misguided) reason and then ponders why the hell DAX is not working? And PAGE_SIZE partition alignment is so obvious and widespread that I don't count it as a realistic error case sysadmins would be pondering about currently. So I'd find two options reasonably consistent: 1) Keep status quo where partitions are created and support DAX. 2) Stop partition creation altogether, if anyones wants to split pmem device further, he can use dm-linear for that (i.e., kpartx). But I'm not sure if the ship hasn't already sailed for option 2) to be feasible without angry users and Linus reverting the change. Honza
On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > dax code refers back to block device to figure out partition offset in > > > dax device. If we create a dax object corresponding to "struct block_device" > > > and store sector offset in that, then we could pass that object to dax > > > code and not worry about referring back to bdev. I have written some > > > proof of concept code and called that object "dax_handle". I can post > > > that code if there is interest. > > > > I don't think it's worth it in the end especially considering > > filesystems are looking to operate on /dev/dax devices directly and > > remove block entanglements entirely. > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > block device in same way as non-dax block device. It will be really > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > will work. > > > > That can already happen today. If you do not properly align the > > partition then dax operations will be disabled. This proposal just > > extends that existing failure domain to make all partitions fail to > > support dax. > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > decides to create partitions on it for whatever (possibly misguided) > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > partition alignment is so obvious and widespread that I don't count it as a > realistic error case sysadmins would be pondering about currently. > > So I'd find two options reasonably consistent: > 1) Keep status quo where partitions are created and support DAX. > 2) Stop partition creation altogether, if anyones wants to split pmem > device further, he can use dm-linear for that (i.e., kpartx). > > But I'm not sure if the ship hasn't already sailed for option 2) to be > feasible without angry users and Linus reverting the change. Christoph? I feel myself leaning more and more to the "keep pmem partitions" camp. I don't see "drop partition support" effort ending well given the long standing "ext4 fails to mount when dax is not available" precedent. I think the next least bad option is to have a dax_get_by_host() variant that passes an offset and length pair rather than requiring a later bdev_dax_pgoff() to recall the offset. This also prevents needing to add another dax-device object representation.
On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > > So I'd find two options reasonably consistent: > > 1) Keep status quo where partitions are created and support DAX. > > 2) Stop partition creation altogether, if anyones wants to split pmem > > device further, he can use dm-linear for that (i.e., kpartx). > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > feasible without angry users and Linus reverting the change. > > Christoph? I feel myself leaning more and more to the "keep pmem > partitions" camp. > > I don't see "drop partition support" effort ending well given the long > standing "ext4 fails to mount when dax is not available" precedent. Do we have any evidence of existing setups with DAX and partitions? Can we just throw in a patch to reject that case for now before actually removing the code and see if anyone screams. And fix ext4 up while we are at it. > I think the next least bad option is to have a dax_get_by_host() > variant that passes an offset and length pair rather than requiring a > later bdev_dax_pgoff() to recall the offset. This also prevents > needing to add another dax-device object representation. IFF we have to keep partition support, yes. But keeping it just seems like a really bad idea.
On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > > dax code refers back to block device to figure out partition offset in > > > > dax device. If we create a dax object corresponding to "struct block_device" > > > > and store sector offset in that, then we could pass that object to dax > > > > code and not worry about referring back to bdev. I have written some > > > > proof of concept code and called that object "dax_handle". I can post > > > > that code if there is interest. > > > > > > I don't think it's worth it in the end especially considering > > > filesystems are looking to operate on /dev/dax devices directly and > > > remove block entanglements entirely. > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > > block device in same way as non-dax block device. It will be really > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > > will work. > > > > > > That can already happen today. If you do not properly align the > > > partition then dax operations will be disabled. This proposal just > > > extends that existing failure domain to make all partitions fail to > > > support dax. > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > > decides to create partitions on it for whatever (possibly misguided) > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > > partition alignment is so obvious and widespread that I don't count it as a > > realistic error case sysadmins would be pondering about currently. > > > > So I'd find two options reasonably consistent: > > 1) Keep status quo where partitions are created and support DAX. > > 2) Stop partition creation altogether, if anyones wants to split pmem > > device further, he can use dm-linear for that (i.e., kpartx). > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > feasible without angry users and Linus reverting the change. > > Christoph? I feel myself leaning more and more to the "keep pmem > partitions" camp. > > I don't see "drop partition support" effort ending well given the long > standing "ext4 fails to mount when dax is not available" precedent. > > I think the next least bad option is to have a dax_get_by_host() > variant that passes an offset and length pair rather than requiring a > later bdev_dax_pgoff() to recall the offset. This also prevents > needing to add another dax-device object representation. I am wondering what's the conclusion on this. I want to this to make progress in some direction so that I can make progress on virtiofs DAX support. Thanks Vivek
On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > > > dax code refers back to block device to figure out partition offset in > > > > > dax device. If we create a dax object corresponding to "struct block_device" > > > > > and store sector offset in that, then we could pass that object to dax > > > > > code and not worry about referring back to bdev. I have written some > > > > > proof of concept code and called that object "dax_handle". I can post > > > > > that code if there is interest. > > > > > > > > I don't think it's worth it in the end especially considering > > > > filesystems are looking to operate on /dev/dax devices directly and > > > > remove block entanglements entirely. > > > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > > > block device in same way as non-dax block device. It will be really > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > > > will work. > > > > > > > > That can already happen today. If you do not properly align the > > > > partition then dax operations will be disabled. This proposal just > > > > extends that existing failure domain to make all partitions fail to > > > > support dax. > > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > > > decides to create partitions on it for whatever (possibly misguided) > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > > > partition alignment is so obvious and widespread that I don't count it as a > > > realistic error case sysadmins would be pondering about currently. > > > > > > So I'd find two options reasonably consistent: > > > 1) Keep status quo where partitions are created and support DAX. > > > 2) Stop partition creation altogether, if anyones wants to split pmem > > > device further, he can use dm-linear for that (i.e., kpartx). > > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > > feasible without angry users and Linus reverting the change. > > > > Christoph? I feel myself leaning more and more to the "keep pmem > > partitions" camp. > > > > I don't see "drop partition support" effort ending well given the long > > standing "ext4 fails to mount when dax is not available" precedent. > > > > I think the next least bad option is to have a dax_get_by_host() > > variant that passes an offset and length pair rather than requiring a > > later bdev_dax_pgoff() to recall the offset. This also prevents > > needing to add another dax-device object representation. > > I am wondering what's the conclusion on this. I want to this to make > progress in some direction so that I can make progress on virtiofs DAX > support. I think we should at least try to delete the partition support and see if anyone screams. Have a module option to revert the behavior so people are not stuck waiting for the revert to land, but if it stays quiet then we're in a better place with that support pushed out of the dax core.
On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote: > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > > > > dax code refers back to block device to figure out partition offset in > > > > > > dax device. If we create a dax object corresponding to "struct block_device" > > > > > > and store sector offset in that, then we could pass that object to dax > > > > > > code and not worry about referring back to bdev. I have written some > > > > > > proof of concept code and called that object "dax_handle". I can post > > > > > > that code if there is interest. > > > > > > > > > > I don't think it's worth it in the end especially considering > > > > > filesystems are looking to operate on /dev/dax devices directly and > > > > > remove block entanglements entirely. > > > > > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > > > > block device in same way as non-dax block device. It will be really > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > > > > will work. > > > > > > > > > > That can already happen today. If you do not properly align the > > > > > partition then dax operations will be disabled. This proposal just > > > > > extends that existing failure domain to make all partitions fail to > > > > > support dax. > > > > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > > > > decides to create partitions on it for whatever (possibly misguided) > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > > > > partition alignment is so obvious and widespread that I don't count it as a > > > > realistic error case sysadmins would be pondering about currently. > > > > > > > > So I'd find two options reasonably consistent: > > > > 1) Keep status quo where partitions are created and support DAX. > > > > 2) Stop partition creation altogether, if anyones wants to split pmem > > > > device further, he can use dm-linear for that (i.e., kpartx). > > > > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > > > feasible without angry users and Linus reverting the change. > > > > > > Christoph? I feel myself leaning more and more to the "keep pmem > > > partitions" camp. > > > > > > I don't see "drop partition support" effort ending well given the long > > > standing "ext4 fails to mount when dax is not available" precedent. > > > > > > I think the next least bad option is to have a dax_get_by_host() > > > variant that passes an offset and length pair rather than requiring a > > > later bdev_dax_pgoff() to recall the offset. This also prevents > > > needing to add another dax-device object representation. > > > > I am wondering what's the conclusion on this. I want to this to make > > progress in some direction so that I can make progress on virtiofs DAX > > support. > > I think we should at least try to delete the partition support and see > if anyone screams. Have a module option to revert the behavior so > people are not stuck waiting for the revert to land, but if it stays > quiet then we're in a better place with that support pushed out of the > dax core. Hi Dan, So basically keep partition support code just that disable it by default and it is enabled by some knob say kernel command line option/module option. At what point of time will we remove that code completely. I mean what if people scream after two kernel releases, after we have removed the code. Also, from distribution's perspective, we might not hear from our customers for a very long time (till we backport that code in to existing releases or release this new code in next major release). From that view point I will not like to break existing user visible behavior. How bad it is to keep partition support around. To me it feels reasonaly simple where we just have to store offset into dax device into another dax object and pass that object around (instead of dax_device). If that's the case, I am not sure why to even venture into a direction where some user's setup might be broken. Also from an application perspective, /dev/pmem is a block device, so it should behave like a block device, (including kernel partition table support). From that view, dax looks like just an additional feature of that device which can be enabled by passing option "-o dax". IOW, can we reconsider the idea of not supporting kernel partition tables for dax capable block devices. I can only see downsides of removing kernel partition table support and only upside seems to be little cleanup of dax core code. Thanks Vivek
On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote: > > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > > > > > dax code refers back to block device to figure out partition offset in > > > > > > > dax device. If we create a dax object corresponding to "struct block_device" > > > > > > > and store sector offset in that, then we could pass that object to dax > > > > > > > code and not worry about referring back to bdev. I have written some > > > > > > > proof of concept code and called that object "dax_handle". I can post > > > > > > > that code if there is interest. > > > > > > > > > > > > I don't think it's worth it in the end especially considering > > > > > > filesystems are looking to operate on /dev/dax devices directly and > > > > > > remove block entanglements entirely. > > > > > > > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > > > > > block device in same way as non-dax block device. It will be really > > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > > > > > will work. > > > > > > > > > > > > That can already happen today. If you do not properly align the > > > > > > partition then dax operations will be disabled. This proposal just > > > > > > extends that existing failure domain to make all partitions fail to > > > > > > support dax. > > > > > > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > > > > > decides to create partitions on it for whatever (possibly misguided) > > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > > > > > partition alignment is so obvious and widespread that I don't count it as a > > > > > realistic error case sysadmins would be pondering about currently. > > > > > > > > > > So I'd find two options reasonably consistent: > > > > > 1) Keep status quo where partitions are created and support DAX. > > > > > 2) Stop partition creation altogether, if anyones wants to split pmem > > > > > device further, he can use dm-linear for that (i.e., kpartx). > > > > > > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > > > > feasible without angry users and Linus reverting the change. > > > > > > > > Christoph? I feel myself leaning more and more to the "keep pmem > > > > partitions" camp. > > > > > > > > I don't see "drop partition support" effort ending well given the long > > > > standing "ext4 fails to mount when dax is not available" precedent. > > > > > > > > I think the next least bad option is to have a dax_get_by_host() > > > > variant that passes an offset and length pair rather than requiring a > > > > later bdev_dax_pgoff() to recall the offset. This also prevents > > > > needing to add another dax-device object representation. > > > > > > I am wondering what's the conclusion on this. I want to this to make > > > progress in some direction so that I can make progress on virtiofs DAX > > > support. > > > > I think we should at least try to delete the partition support and see > > if anyone screams. Have a module option to revert the behavior so > > people are not stuck waiting for the revert to land, but if it stays > > quiet then we're in a better place with that support pushed out of the > > dax core. > > Hi Dan, > > So basically keep partition support code just that disable it by default > and it is enabled by some knob say kernel command line option/module > option. Yes. > At what point of time will we remove that code completely. I mean what > if people scream after two kernel releases, after we have removed the > code. I'd follow the typical timelines of Documentation/ABI/obsolete which is a year or more. > > Also, from distribution's perspective, we might not hear from our > customers for a very long time (till we backport that code in to > existing releases or release this new code in next major release). From > that view point I will not like to break existing user visible behavior. > > How bad it is to keep partition support around. To me it feels reasonaly > simple where we just have to store offset into dax device into another > dax object: If we end up keeping partition support, we're not adding another object. > and pass that object around (instead of dax_device). If that's > the case, I am not sure why to even venture into a direction where some > user's setup might be broken. It was a mistake to support them. If that mistake can be undone without breaking existing deployments the code base is better off without the concept. > Also from an application perspective, /dev/pmem is a block device, so it > should behave like a block device, (including kernel partition table support). > From that view, dax looks like just an additional feature of that device > which can be enabled by passing option "-o dax". dax via block devices was a crutch that we leaned on too heavily, and the implementation has slowly been moving away from it ever since. > IOW, can we reconsider the idea of not supporting kernel partition tables > for dax capable block devices. I can only see downsides of removing kernel > partition table support and only upside seems to be little cleanup of dax > core code. Can you help find end users that depend on it? Even the Red Hat installation guide example shows mounting on pmem0 directly. [1] My primary concern is people that might be booting from pmem as boot support requires an EFI partition table, and initramfs images would need to be respun to move to kpartx. [1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/storage_administration_guide/index#Configuring-Persistent-Memory-for-File-System-Direct-Access-DAX
On Tue 14-01-20 16:28:05, Vivek Goyal wrote: > On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote: > > I think we should at least try to delete the partition support and see > > if anyone screams. Have a module option to revert the behavior so > > people are not stuck waiting for the revert to land, but if it stays > > quiet then we're in a better place with that support pushed out of the > > dax core. > > Hi Dan, > > So basically keep partition support code just that disable it by default > and it is enabled by some knob say kernel command line option/module > option. > > At what point of time will we remove that code completely. I mean what > if people scream after two kernel releases, after we have removed the > code. > > Also, from distribution's perspective, we might not hear from our > customers for a very long time (till we backport that code in to > existing releases or release this new code in next major release). From > that view point I will not like to break existing user visible behavior. > > How bad it is to keep partition support around. To me it feels reasonaly > simple where we just have to store offset into dax device into another > dax object and pass that object around (instead of dax_device). If that's > the case, I am not sure why to even venture into a direction where some > user's setup might be broken. > > Also from an application perspective, /dev/pmem is a block device, so it > should behave like a block device, (including kernel partition table support). > From that view, dax looks like just an additional feature of that device > which can be enabled by passing option "-o dax". Well, not all block devices are partitionable. For example cdroms are standard block devices but partitioning does not run for them. Similarly device mapper devices are block devices but not partitioned. So there is some precedens in not doing partitioning for some types of block devices. For the rest I agree that kernels where pmem devices are partitionable have shipped in enterprise distros and are going to be supported (and used) for 5-10 years before users decide to move on to something newer - at which point we'll only find out whether someone used the feature or not. So deprecation is going to be somewhat interesting. On the other hand clever udev rule that detects partition table on pmem device and uses kpartx to partition these devices (like what happens e.g. for dm-multipath devices) could possibly be used as a replacement for kernel support so there's a way out of this... So I don't care too deeply about what the decision is going to be. Honza
On Tue, Jan 14, 2020 at 02:23:04PM -0800, Dan Williams wrote: > On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote: > > > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote: > > > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote: > > > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where > > > > > > > > dax code refers back to block device to figure out partition offset in > > > > > > > > dax device. If we create a dax object corresponding to "struct block_device" > > > > > > > > and store sector offset in that, then we could pass that object to dax > > > > > > > > code and not worry about referring back to bdev. I have written some > > > > > > > > proof of concept code and called that object "dax_handle". I can post > > > > > > > > that code if there is interest. > > > > > > > > > > > > > > I don't think it's worth it in the end especially considering > > > > > > > filesystems are looking to operate on /dev/dax devices directly and > > > > > > > remove block entanglements entirely. > > > > > > > > > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable > > > > > > > > block device in same way as non-dax block device. It will be really > > > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't > > > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax > > > > > > > > will work. > > > > > > > > > > > > > > That can already happen today. If you do not properly align the > > > > > > > partition then dax operations will be disabled. This proposal just > > > > > > > extends that existing failure domain to make all partitions fail to > > > > > > > support dax. > > > > > > > > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device, > > > > > > decides to create partitions on it for whatever (possibly misguided) > > > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE > > > > > > partition alignment is so obvious and widespread that I don't count it as a > > > > > > realistic error case sysadmins would be pondering about currently. > > > > > > > > > > > > So I'd find two options reasonably consistent: > > > > > > 1) Keep status quo where partitions are created and support DAX. > > > > > > 2) Stop partition creation altogether, if anyones wants to split pmem > > > > > > device further, he can use dm-linear for that (i.e., kpartx). > > > > > > > > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be > > > > > > feasible without angry users and Linus reverting the change. > > > > > > > > > > Christoph? I feel myself leaning more and more to the "keep pmem > > > > > partitions" camp. > > > > > > > > > > I don't see "drop partition support" effort ending well given the long > > > > > standing "ext4 fails to mount when dax is not available" precedent. > > > > > > > > > > I think the next least bad option is to have a dax_get_by_host() > > > > > variant that passes an offset and length pair rather than requiring a > > > > > later bdev_dax_pgoff() to recall the offset. This also prevents > > > > > needing to add another dax-device object representation. > > > > > > > > I am wondering what's the conclusion on this. I want to this to make > > > > progress in some direction so that I can make progress on virtiofs DAX > > > > support. > > > > > > I think we should at least try to delete the partition support and see > > > if anyone screams. Have a module option to revert the behavior so > > > people are not stuck waiting for the revert to land, but if it stays > > > quiet then we're in a better place with that support pushed out of the > > > dax core. > > > > Hi Dan, > > > > So basically keep partition support code just that disable it by default > > and it is enabled by some knob say kernel command line option/module > > option. > > Yes. > > > At what point of time will we remove that code completely. I mean what > > if people scream after two kernel releases, after we have removed the > > code. > > I'd follow the typical timelines of Documentation/ABI/obsolete which > is a year or more. > > > > > Also, from distribution's perspective, we might not hear from our > > customers for a very long time (till we backport that code in to > > existing releases or release this new code in next major release). From > > that view point I will not like to break existing user visible behavior. > > > > How bad it is to keep partition support around. To me it feels reasonaly > > simple where we just have to store offset into dax device into another > > dax object: > > If we end up keeping partition support, we're not adding another object. > > > and pass that object around (instead of dax_device). If that's > > the case, I am not sure why to even venture into a direction where some > > user's setup might be broken. > > It was a mistake to support them. If that mistake can be undone > without breaking existing deployments the code base is better off > without the concept. > > > Also from an application perspective, /dev/pmem is a block device, so it > > should behave like a block device, (including kernel partition table support). > > From that view, dax looks like just an additional feature of that device > > which can be enabled by passing option "-o dax". > > dax via block devices was a crutch that we leaned on too heavily, and > the implementation has slowly been moving away from it ever since. > > > IOW, can we reconsider the idea of not supporting kernel partition tables > > for dax capable block devices. I can only see downsides of removing kernel > > partition table support and only upside seems to be little cleanup of dax > > core code. > > Can you help find end users that depend on it? I can't think of a real user at this point of time. Just that I am concerned, once the change goes in, somebody will get affected at later point of time and comes out complainig and this change will be seen as breaking user space and hence regression. > Even the Red Hat > installation guide example shows mounting on pmem0 directly. [1] Below that example it also says. "When creating partitions on a pmem device to be used for direct access, partitions must be aligned on page boundaries. On the Intel 64 and AMD64 architecture, at least 4KiB alignment for the start and end of the partition, but 2MiB is the preferred alignment. By default, the parted tool aligns partitions on 1MiB boundaries. For the first partition, specify 2MiB as the start of the partition. If the size of the partition is a multiple of 2MiB, all other partitions are also aligned." So documentation is clearly saying dax will work with partitions as well. And some user might decide to just do that. Thanks Vivek
On Wed, Jan 15, 2020 at 11:56 AM Vivek Goyal <vgoyal@redhat.com> wrote: [..] > > Even the Red Hat > > installation guide example shows mounting on pmem0 directly. [1] > > Below that example it also says. > > "When creating partitions on a pmem device to be used for direct access, > partitions must be aligned on page boundaries. On the Intel 64 and AMD64 > architecture, at least 4KiB alignment for the start and end of the > partition, but 2MiB is the preferred alignment. By default, the parted > tool aligns partitions on 1MiB boundaries. For the first partition, > specify 2MiB as the start of the partition. If the size of the partition > is a multiple of 2MiB, all other partitions are also aligned." > > So documentation is clearly saying dax will work with partitions as well. > And some user might decide to just do that. Yes, of course but my point is that it was ambiguous. I'm going to take a look at how hard it would be to develop a kpartx fallback in udev. If that can live across the driver transition then maybe this can be a non-event for end users that already have that udev update deployed.
Hi, Dan, Dan Williams <dan.j.williams@intel.com> writes: > I'm going to take a look at how hard it would be to develop a kpartx > fallback in udev. If that can live across the driver transition then > maybe this can be a non-event for end users that already have that > udev update deployed. I just wanted to remind you that label-less dimms still exist, and are still being shipped. For those devices, the only way to subdivide the storage is via partitioning. -Jeff
On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > Hi, Dan, > > Dan Williams <dan.j.williams@intel.com> writes: > > > I'm going to take a look at how hard it would be to develop a kpartx > > fallback in udev. If that can live across the driver transition then > > maybe this can be a non-event for end users that already have that > > udev update deployed. > > I just wanted to remind you that label-less dimms still exist, and are > still being shipped. For those devices, the only way to subdivide the > storage is via partitioning. True, but if kpartx + udev can make this transparent then I don't think users lose any functionality. They just gain a device-mapper dependency.
On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote: > On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > > > Hi, Dan, > > > > Dan Williams <dan.j.williams@intel.com> writes: > > > > > I'm going to take a look at how hard it would be to develop a kpartx > > > fallback in udev. If that can live across the driver transition then > > > maybe this can be a non-event for end users that already have that > > > udev update deployed. > > > > I just wanted to remind you that label-less dimms still exist, and are > > still being shipped. For those devices, the only way to subdivide the > > storage is via partitioning. > > True, but if kpartx + udev can make this transparent then I don't > think users lose any functionality. They just gain a device-mapper > dependency. So udev rules will trigger when a /dev/pmemX device shows up and run kpartx which in turn will create dm-linear devices and device nodes will show up in /dev/mapper/pmemXpY. IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or systemd unit files are depenent on /dev/pmemXpY, these will still be broken out of the box and will have to be modified to use device nodes in /dev/mapper/ directory instead. Do I understand it right, Or I missed the idea completely. Vivek
On Thu, Jan 16, 2020 at 10:39 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote: > > On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > > > > > Hi, Dan, > > > > > > Dan Williams <dan.j.williams@intel.com> writes: > > > > > > > I'm going to take a look at how hard it would be to develop a kpartx > > > > fallback in udev. If that can live across the driver transition then > > > > maybe this can be a non-event for end users that already have that > > > > udev update deployed. > > > > > > I just wanted to remind you that label-less dimms still exist, and are > > > still being shipped. For those devices, the only way to subdivide the > > > storage is via partitioning. > > > > True, but if kpartx + udev can make this transparent then I don't > > think users lose any functionality. They just gain a device-mapper > > dependency. > > So udev rules will trigger when a /dev/pmemX device shows up and run > kpartx which in turn will create dm-linear devices and device nodes > will show up in /dev/mapper/pmemXpY. > > IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or > systemd unit files are depenent on /dev/pmemXpY, these will still be > broken out of the box and will have to be modified to use device nodes > in /dev/mapper/ directory instead. Do I understand it right, Or I missed > the idea completely. No, I'd write the udev rule to create links from /dev/pmemXpY to the /dev/mapper device, and that rule would be gated by a new pmem device attribute to trigger when kpartx needs to run vs the kernel native partitions.
On Thu, Jan 16, 2020 at 11:09:00AM -0800, Dan Williams wrote: [..] > > > True, but if kpartx + udev can make this transparent then I don't > > > think users lose any functionality. They just gain a device-mapper > > > dependency. > > > > So udev rules will trigger when a /dev/pmemX device shows up and run > > kpartx which in turn will create dm-linear devices and device nodes > > will show up in /dev/mapper/pmemXpY. > > > > IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or > > systemd unit files are depenent on /dev/pmemXpY, these will still be > > broken out of the box and will have to be modified to use device nodes > > in /dev/mapper/ directory instead. Do I understand it right, Or I missed > > the idea completely. > > No, I'd write the udev rule to create links from /dev/pmemXpY to the > /dev/mapper device, and that rule would be gated by a new pmem device > attribute to trigger when kpartx needs to run vs the kernel native > partitions. Got it. This sounds much better. Vivek
On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote: > On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > > > Hi, Dan, > > > > Dan Williams <dan.j.williams@intel.com> writes: > > > > > I'm going to take a look at how hard it would be to develop a kpartx > > > fallback in udev. If that can live across the driver transition then > > > maybe this can be a non-event for end users that already have that > > > udev update deployed. > > > > I just wanted to remind you that label-less dimms still exist, and are > > still being shipped. For those devices, the only way to subdivide the > > storage is via partitioning. > > True, but if kpartx + udev can make this transparent then I don't > think users lose any functionality. They just gain a device-mapper > dependency. Hi Dan, Are you planning to look into making this work? We can easily disable partition scanning by specifying gendisk GENHD_FL_NO_PART_SCAN flag. But what about partition additiona path, ioctl(BLKPG_ADD_PARTITION). That does not seem to do any checks whether block device supports in kernel partitions or not. So kernel partitions (hence /dev/pmemXpY) objects are created anyway and this will conflict with all the new planned udev rules. If you block ioctl(BLKPG_ADD_PARTITION), then user space tools like parted and fdisk started breaking when trying to create a partition on /dev/pmeme0. IIUC, we have to allow partition table creation on /dev/pmem0 so that later kpartx can parse it and create dm-linear partitions. Thanks Vivek
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 26a654dbc69a..3cbc97f3e653 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock); int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size, pgoff_t *pgoff) { - phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512; + sector_t start_sect = bdev ? get_start_sect(bdev) : 0; + phys_addr_t phys_off = (start_sect + sector) * 512; if (pgoff) *pgoff = PHYS_PFN(phys_off); diff --git a/fs/dax.c b/fs/dax.c index 6bf81f931de3..a11147bbaf9e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1046,7 +1046,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, static bool dax_range_is_aligned(struct block_device *bdev, unsigned int offset, unsigned int length) { - unsigned short sector_size = bdev_logical_block_size(bdev); + unsigned short sector_size; + + if (!bdev) + return false; + + sector_size = bdev_logical_block_size(bdev); if (!IS_ALIGNED(offset, sector_size)) return false;