Message ID | 1454009704-25959-2-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 28, 2016 at 11:35 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > There are a number of places in dax.c that look up the struct block_device > associated with an inode. Previously this was done by just using > inode->i_sb->s_bdev. This is correct for inodes that exist within the > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > against raw block devices this value is NULL. This causes NULL pointer > dereferences when these block_device pointers are used. > > Instead, for raw block devices we need to look up the struct block_device > using I_BDEV(). This patch fixes all the block_device lookups in dax.c so > that they work properly for both filesystems and raw block devices. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> It's a bit odd to check if it is a raw device inode in dax_clear_blocks() since there's no use case to clear blocks in that case, but I can't think of a better alternative. Acked-by: Dan Williams <dan.j.williams@intel.com>
On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > There are a number of places in dax.c that look up the struct block_device > associated with an inode. Previously this was done by just using > inode->i_sb->s_bdev. This is correct for inodes that exist within the > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > against raw block devices this value is NULL. This causes NULL pointer > dereferences when these block_device pointers are used. It's also wrong for an XFS file system with a RT device.. > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > + : inode->i_sb->s_bdev) .. but this isn't going to fix it. You must use a bdev returned by get_blocks or a similar file system method.
On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > > There are a number of places in dax.c that look up the struct block_device > > associated with an inode. Previously this was done by just using > > inode->i_sb->s_bdev. This is correct for inodes that exist within the > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > > against raw block devices this value is NULL. This causes NULL pointer > > dereferences when these block_device pointers are used. > > It's also wrong for an XFS file system with a RT device.. > > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > > + : inode->i_sb->s_bdev) > > .. but this isn't going to fix it. You must use a bdev returned by > get_blocks or a similar file system method. I guess I need to go off and understand if we can have DAX mappings on such a device. If we can, we may have a problem - we can get the block_device from get_block() in I/O path and the various fault paths, but we don't have access to get_block() when flushing via dax_writeback_mapping_range(). We avoid needing it the normal case by storing the sector results from get_block() in the radix tree. /me is off to play with RT devices...
On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: > > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > > > There are a number of places in dax.c that look up the struct block_device > > > associated with an inode. Previously this was done by just using > > > inode->i_sb->s_bdev. This is correct for inodes that exist within the > > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > > > against raw block devices this value is NULL. This causes NULL pointer > > > dereferences when these block_device pointers are used. > > > > It's also wrong for an XFS file system with a RT device.. > > > > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > > > + : inode->i_sb->s_bdev) > > > > .. but this isn't going to fix it. You must use a bdev returned by > > get_blocks or a similar file system method. > > I guess I need to go off and understand if we can have DAX mappings on such a > device. If we can, we may have a problem - we can get the block_device from > get_block() in I/O path and the various fault paths, but we don't have access > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > needing it the normal case by storing the sector results from get_block() in > the radix tree. > > /me is off to play with RT devices... Well, RT devices are completely broken as far as I can see. I've reported the breakage to the XFS list. Anything I do that triggers a RT block allocation in XFS causes a lockdep splat + a kernel BUG - I've tried regular pwrite(), xfs_rtcp and mmap() + write to address. Not a new bug either - happens just the same with v4.4. Happens with both PMEM and BRD, and has no relationship to whether I'm using DAX or not. Does it work for this patch to go in as-is since it fixes an immediate OOPS with raw block devices + DAX, and when RT devices are alive again I'll figure out how to make them work too?
On Fri, Jan 29, 2016 at 3:34 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >> On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: >> > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: >> > > There are a number of places in dax.c that look up the struct block_device >> > > associated with an inode. Previously this was done by just using >> > > inode->i_sb->s_bdev. This is correct for inodes that exist within the >> > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX >> > > against raw block devices this value is NULL. This causes NULL pointer >> > > dereferences when these block_device pointers are used. >> > >> > It's also wrong for an XFS file system with a RT device.. >> > >> > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ >> > > + : inode->i_sb->s_bdev) >> > >> > .. but this isn't going to fix it. You must use a bdev returned by >> > get_blocks or a similar file system method. >> >> I guess I need to go off and understand if we can have DAX mappings on such a >> device. If we can, we may have a problem - we can get the block_device from >> get_block() in I/O path and the various fault paths, but we don't have access >> to get_block() when flushing via dax_writeback_mapping_range(). We avoid >> needing it the normal case by storing the sector results from get_block() in >> the radix tree. >> >> /me is off to play with RT devices... > > Well, RT devices are completely broken as far as I can see. I've reported the > breakage to the XFS list. Anything I do that triggers a RT block allocation > in XFS causes a lockdep splat + a kernel BUG - I've tried regular pwrite(), > xfs_rtcp and mmap() + write to address. Not a new bug either - happens just > the same with v4.4. Happens with both PMEM and BRD, and has no relationship > to whether I'm using DAX or not. > > Does it work for this patch to go in as-is since it fixes an immediate OOPS > with raw block devices + DAX, and when RT devices are alive again I'll figure > out how to make them work too? Can we step back and be clear about which lookups should be coming from get_blocks(). Which ones are critical vs ones we just opportunistically lookup for a debug print. Right now xfs and ext4 are basically disagreeing on whether get_blocks() reliably sets ->bh_bdev, and checking for a raw block-device inode in dax_clear_blocks() does not make sense. So this all seems a bit confused.
On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > I guess I need to go off and understand if we can have DAX mappings on such a > device. If we can, we may have a problem - we can get the block_device from > get_block() in I/O path and the various fault paths, but we don't have access > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > needing it the normal case by storing the sector results from get_block() in > the radix tree. I think we're doing it wrong by storing the sector in the radix tree; we'd really need to store both the sector and the bdev which is too much data. If we store the PFN of the underlying page instead, we don't have this problem. Instead, we have a different problem; of the device going away under us. I'm trying to find the code which tears down PTEs when the device goes away, and I'm not seeing it. What do we do about user mappings of the device?
On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >> I guess I need to go off and understand if we can have DAX mappings on such a >> device. If we can, we may have a problem - we can get the block_device from >> get_block() in I/O path and the various fault paths, but we don't have access >> to get_block() when flushing via dax_writeback_mapping_range(). We avoid >> needing it the normal case by storing the sector results from get_block() in >> the radix tree. > > I think we're doing it wrong by storing the sector in the radix tree; we'd > really need to store both the sector and the bdev which is too much data. > > If we store the PFN of the underlying page instead, we don't have this > problem. Instead, we have a different problem; of the device going > away under us. I'm trying to find the code which tears down PTEs when > the device goes away, and I'm not seeing it. What do we do about user > mappings of the device? > I deferred the dax tear down code until next cycle as Al rightly pointed out some needed re-works: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html
On Fri, Jan 29, 2016 at 10:01 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox <willy@linux.intel.com> wrote: >> On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>> I guess I need to go off and understand if we can have DAX mappings on such a >>> device. If we can, we may have a problem - we can get the block_device from >>> get_block() in I/O path and the various fault paths, but we don't have access >>> to get_block() when flushing via dax_writeback_mapping_range(). We avoid >>> needing it the normal case by storing the sector results from get_block() in >>> the radix tree. >> >> I think we're doing it wrong by storing the sector in the radix tree; we'd >> really need to store both the sector and the bdev which is too much data. >> >> If we store the PFN of the underlying page instead, we don't have this >> problem. Instead, we have a different problem; of the device going >> away under us. I'm trying to find the code which tears down PTEs when >> the device goes away, and I'm not seeing it. What do we do about user >> mappings of the device? >> > > I deferred the dax tear down code until next cycle as Al rightly > pointed out some needed re-works: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html If you store sectors in the radix and the device gets removed you still have to unmap user mappings of PFNs. So why is the device remove harder with the PFN vs bdev+sector radix entry? Either way you need a list of PFNs and their corresponding PTE's, right? And are we just talking graceful removal? Any plans for device failures?
On Fri, Jan 29, 2016 at 10:01:13PM -0800, Dan Williams wrote: > On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox <willy@linux.intel.com> wrote: > > If we store the PFN of the underlying page instead, we don't have this > > problem. Instead, we have a different problem; of the device going > > away under us. I'm trying to find the code which tears down PTEs when > > the device goes away, and I'm not seeing it. What do we do about user > > mappings of the device? > > I deferred the dax tear down code until next cycle as Al rightly > pointed out some needed re-works: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html Thanks; I eventually found it in my email somewhere over the Pacific. I did probably 70% of the work needed to switch the radix tree over to storing PFNs instead of sectors. It seems viable, though it's a big change from where we are today: fs/dax.c | 415 +++++++++++++++++++++++---------------------- include/linux/dax.h | 3 +- include/linux/pfn_t.h | 33 +++- include/linux/radix-tree.h | 9 - 4 files changed, 236 insertions(+), 224 deletions(-) I'll try and get that finished off this week. One concrete and easily-separable piece is that dax_clear_blocks() has the wrong signature. It currently takes an inode & block as parameters; it has no way of finding out the correct block device. It's only two callers are filesystems (ext2 and xfs). Those filesystems should be passing the block_device instead of the inode. But without the inode, we can't convert a block number to a sector number, so we also need to pass the sector number, not the block number. It still has type sector_t, annoyingly. @@ -63,12 +238,11 @@ static void dax_unmap_atomic(struct block_device *bdev, * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_blocks(struct block_device *bdev, sector_t sector, long size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { - .sector = block << (inode->i_blkbits - 9), - .size = _size, + .sector = sector, + .size = size, }; might_sleep(); but I haven't looked at doing the conversion of xfs or ext2 to use that new interface.
> On Jan 30, 2016, at 7:32 PM, Matthew Wilcox <willy@linux.intel.com> wrote: > >> On Fri, Jan 29, 2016 at 10:01:13PM -0800, Dan Williams wrote: >>> On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox <willy@linux.intel.com> wrote: >>> If we store the PFN of the underlying page instead, we don't have this >>> problem. Instead, we have a different problem; of the device going >>> away under us. I'm trying to find the code which tears down PTEs when >>> the device goes away, and I'm not seeing it. What do we do about user >>> mappings of the device? >> >> I deferred the dax tear down code until next cycle as Al rightly >> pointed out some needed re-works: >> >> https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html > > Thanks; I eventually found it in my email somewhere over the Pacific. > > I did probably 70% of the work needed to switch the radix tree over to > storing PFNs instead of sectors. It seems viable, though it's a big > change from where we are today: At one point I had kaddrs in the radix tree, so I could just pull the addresses out and flush them. That would save us a pfn -> kaddrs conversion before flush. Is there a reason to store pnfs instead of kaddrs in the radix tree? > > fs/dax.c | 415 +++++++++++++++++++++++---------------------- > include/linux/dax.h | 3 +- > include/linux/pfn_t.h | 33 +++- > include/linux/radix-tree.h | 9 - > 4 files changed, 236 insertions(+), 224 deletions(-) > > I'll try and get that finished off this week. > > One concrete and easily-separable piece is that dax_clear_blocks() has > the wrong signature. It currently takes an inode & block as parameters; > it has no way of finding out the correct block device. It's only two > callers are filesystems (ext2 and xfs). Those filesystems should be > passing the block_device instead of the inode. But without the inode, > we can't convert a block number to a sector number, so we also need > to pass the sector number, not the block number. It still has type > sector_t, annoyingly. > > @@ -63,12 +238,11 @@ static void dax_unmap_atomic(struct block_device *bdev, > * and hence this means the stack from this point must follow GFP_NOFS > * semantics for all operations. > */ > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > +int dax_clear_blocks(struct block_device *bdev, sector_t sector, long size) > { > - struct block_device *bdev = inode->i_sb->s_bdev; > struct blk_dax_ctl dax = { > - .sector = block << (inode->i_blkbits - 9), > - .size = _size, > + .sector = sector, > + .size = size, > }; > > might_sleep(); > > but I haven't looked at doing the conversion of xfs or ext2 to use that > new interface. > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
On Sat, Jan 30, 2016 at 11:12:12PM -0700, Ross Zwisler wrote: > > I did probably 70% of the work needed to switch the radix tree over to > > storing PFNs instead of sectors. It seems viable, though it's a big > > change from where we are today: > > At one point I had kaddrs in the radix tree, so I could just pull the addresses out > and flush them. That would save us a pfn -> kaddrs conversion before flush. > > Is there a reason to store pnfs instead of kaddrs in the radix tree? Once ARM, MIPS and SPARC get supported, they're going to need temporary kernel addresses assigned to PFNs rather than permanent ones. Also, it'll be easier for teardown to delete PFNs associated with a particular device than kaddrs associated with a particular device. And it lets us support more persistent memory on a 32-bit machine (also on a 64-bit machine, but that's mostly theoretical) +/* + * DAX uses the 'exceptional' entries to store PFNs in the radix tree. + * Bit 0 is clear (the radix tree uses this for its own purposes). Bit + * 1 is set (to indicate an exceptional entry). Bits 2 & 3 are PFN_DEV + * and PFN_MAP. The top two bits denote the size of the entry (PTE, PMD, + * PUD, one reserved). That leaves us 26 bits on 32-bit systems and 58 + * bits on 64-bit systems, able to address 256GB and 1024EB respectively. + */ It's also pretty cheap to look up the kaddr from the pfn, at least on 64-bit architectures without cache aliasing problems: +static void *dax_map_pfn(pfn_t pfn, unsigned long index) +{ + preempt_disable(); + pagefault_disable(); + return pfn_to_kaddr(pfn_t_to_pfn(pfn)); +} + +static void dax_unmap_pfn(void *vaddr) +{ + pagefault_enable(); + preempt_enable(); +} 32-bit x86 is going to want to do something similar to iomap_atomic_prot_pfn(). ARM/SPARC/MIPS will want something in the kmap_atomic family.
On Sun, Jan 31, 2016 at 2:55 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Sat, Jan 30, 2016 at 11:12:12PM -0700, Ross Zwisler wrote: >> > I did probably 70% of the work needed to switch the radix tree over to >> > storing PFNs instead of sectors. It seems viable, though it's a big >> > change from where we are today: >> >> At one point I had kaddrs in the radix tree, so I could just pull the addresses out >> and flush them. That would save us a pfn -> kaddrs conversion before flush. >> >> Is there a reason to store pnfs instead of kaddrs in the radix tree? > > Once ARM, MIPS and SPARC get supported, they're going to need temporary > kernel addresses assigned to PFNs rather than permanent ones. Also, > it'll be easier for teardown to delete PFNs associated with a particular > device than kaddrs associated with a particular device. And it lets > us support more persistent memory on a 32-bit machine (also on a 64-bit > machine, but that's mostly theoretical) > > +/* > + * DAX uses the 'exceptional' entries to store PFNs in the radix tree. > + * Bit 0 is clear (the radix tree uses this for its own purposes). Bit > + * 1 is set (to indicate an exceptional entry). Bits 2 & 3 are PFN_DEV > + * and PFN_MAP. The top two bits denote the size of the entry (PTE, PMD, > + * PUD, one reserved). That leaves us 26 bits on 32-bit systems and 58 > + * bits on 64-bit systems, able to address 256GB and 1024EB respectively. > + */ > > It's also pretty cheap to look up the kaddr from the pfn, at least on > 64-bit architectures without cache aliasing problems: > > +static void *dax_map_pfn(pfn_t pfn, unsigned long index) > +{ > + preempt_disable(); > + pagefault_disable(); > + return pfn_to_kaddr(pfn_t_to_pfn(pfn)); pfn_to_kaddr() assumes persistent memory is direct mapped which is not always the case.
On Sun, Jan 31, 2016 at 08:38:20AM -0800, Dan Williams wrote: > On Sun, Jan 31, 2016 at 2:55 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > > On Sat, Jan 30, 2016 at 11:12:12PM -0700, Ross Zwisler wrote: > >> Is there a reason to store pnfs instead of kaddrs in the radix tree? > > > > Once ARM, MIPS and SPARC get supported, they're going to need temporary > > kernel addresses assigned to PFNs rather than permanent ones. Also, > > it'll be easier for teardown to delete PFNs associated with a particular > > device than kaddrs associated with a particular device. And it lets > > us support more persistent memory on a 32-bit machine (also on a 64-bit > > machine, but that's mostly theoretical) > > > > +/* > > + * DAX uses the 'exceptional' entries to store PFNs in the radix tree. > > + * Bit 0 is clear (the radix tree uses this for its own purposes). Bit > > + * 1 is set (to indicate an exceptional entry). Bits 2 & 3 are PFN_DEV > > + * and PFN_MAP. The top two bits denote the size of the entry (PTE, PMD, > > + * PUD, one reserved). That leaves us 26 bits on 32-bit systems and 58 > > + * bits on 64-bit systems, able to address 256GB and 1024EB respectively. > > + */ > > > > It's also pretty cheap to look up the kaddr from the pfn, at least on > > 64-bit architectures without cache aliasing problems: > > > > +static void *dax_map_pfn(pfn_t pfn, unsigned long index) > > +{ > > + preempt_disable(); > > + pagefault_disable(); > > + return pfn_to_kaddr(pfn_t_to_pfn(pfn)); > > pfn_to_kaddr() assumes persistent memory is direct mapped which is not > always the case. Yes. This is just the default implementation of dax_map_pfn() which works for most situations. We can introduce more complex implementations of dax_map_pfn() as necessary. You make another excellent point for why we should store PFNs in the radix tree instead of kaddrs :-) One option that I've been looking at (primarily for x86-32) is having an rbtree of PFN ranges that drivers add to when they register peristent memory. That would let us use the io_mapping_create_wc() / io_mapping_map_atomic_wc() API. But having great support for persistent memory with 32-bit x86 kernels is very very low on my priority list.
On Sun, Jan 31, 2016 at 10:07 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Sun, Jan 31, 2016 at 08:38:20AM -0800, Dan Williams wrote: >> On Sun, Jan 31, 2016 at 2:55 AM, Matthew Wilcox <willy@linux.intel.com> wrote: >> > On Sat, Jan 30, 2016 at 11:12:12PM -0700, Ross Zwisler wrote: >> >> Is there a reason to store pnfs instead of kaddrs in the radix tree? >> > >> > Once ARM, MIPS and SPARC get supported, they're going to need temporary >> > kernel addresses assigned to PFNs rather than permanent ones. Also, >> > it'll be easier for teardown to delete PFNs associated with a particular >> > device than kaddrs associated with a particular device. And it lets >> > us support more persistent memory on a 32-bit machine (also on a 64-bit >> > machine, but that's mostly theoretical) >> > >> > +/* >> > + * DAX uses the 'exceptional' entries to store PFNs in the radix tree. >> > + * Bit 0 is clear (the radix tree uses this for its own purposes). Bit >> > + * 1 is set (to indicate an exceptional entry). Bits 2 & 3 are PFN_DEV >> > + * and PFN_MAP. The top two bits denote the size of the entry (PTE, PMD, >> > + * PUD, one reserved). That leaves us 26 bits on 32-bit systems and 58 >> > + * bits on 64-bit systems, able to address 256GB and 1024EB respectively. >> > + */ >> > >> > It's also pretty cheap to look up the kaddr from the pfn, at least on >> > 64-bit architectures without cache aliasing problems: >> > >> > +static void *dax_map_pfn(pfn_t pfn, unsigned long index) >> > +{ >> > + preempt_disable(); >> > + pagefault_disable(); >> > + return pfn_to_kaddr(pfn_t_to_pfn(pfn)); >> >> pfn_to_kaddr() assumes persistent memory is direct mapped which is not >> always the case. > > Yes. This is just the default implementation of dax_map_pfn() which works > for most situations. We can introduce more complex implementations of > dax_map_pfn() as necessary. You make another excellent point for why > we should store PFNs in the radix tree instead of kaddrs :-) How much complexity do we want to add in support of an fsync/msync mechanism that is not the recommended way to use DAX?
On Sun, Jan 31, 2016 at 10:18:46AM -0800, Dan Williams wrote: > On Sun, Jan 31, 2016 at 10:07 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > > Yes. This is just the default implementation of dax_map_pfn() which works > > for most situations. We can introduce more complex implementations of > > dax_map_pfn() as necessary. You make another excellent point for why > > we should store PFNs in the radix tree instead of kaddrs :-) > > How much complexity do we want to add in support of an fsync/msync > mechanism that is not the recommended way to use DAX? It actually makes the dax_io path much, much simpler. And it's not primarily about fixing fsync/msync. It also makes the fault path cheaper in the case where we're refaulting a page that's already been faulted by another process (or was previously faulted by this process and now needs to be faulted at a different address). And it fixes the problem with filesystems that use multiple block_devices. It also makes DAX much less reliant on buffer heads, which is good for the problem that Jared raised where he doesn't have a block_device in an embedded system.
On Sun, Jan 31, 2016 at 10:27 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Sun, Jan 31, 2016 at 10:18:46AM -0800, Dan Williams wrote: >> On Sun, Jan 31, 2016 at 10:07 AM, Matthew Wilcox <willy@linux.intel.com> wrote: >> > Yes. This is just the default implementation of dax_map_pfn() which works >> > for most situations. We can introduce more complex implementations of >> > dax_map_pfn() as necessary. You make another excellent point for why >> > we should store PFNs in the radix tree instead of kaddrs :-) >> >> How much complexity do we want to add in support of an fsync/msync >> mechanism that is not the recommended way to use DAX? > > It actually makes the dax_io path much, much simpler. And it's not > primarily about fixing fsync/msync. It also makes the fault path cheaper > in the case where we're refaulting a page that's already been faulted > by another process (or was previously faulted by this process and now > needs to be faulted at a different address). > > And it fixes the problem with filesystems that use multiple block_devices. > It also makes DAX much less reliant on buffer heads, which is good for > the problem that Jared raised where he doesn't have a block_device in > an embedded system. Oh I thought we were talking about what goes in the radix. Sure, de-emphasizing the usage of a block_device throughout the dax implementation is interesting. It also has some synergy with the LSF/MM topic I'm writing up "pmem as storage device vs pmem as memory".
On Sun, Jan 31, 2016 at 10:07 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > One option that I've been looking at (primarily for x86-32) is > having an rbtree of PFN ranges that drivers add to when they register > peristent memory. On this specific point we do already have find_dev_pagemap().
On Fri, Jan 29, 2016 at 04:34:30PM -0700, Ross Zwisler wrote: > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: > > > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > > > > There are a number of places in dax.c that look up the struct block_device > > > > associated with an inode. Previously this was done by just using > > > > inode->i_sb->s_bdev. This is correct for inodes that exist within the > > > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > > > > against raw block devices this value is NULL. This causes NULL pointer > > > > dereferences when these block_device pointers are used. > > > > > > It's also wrong for an XFS file system with a RT device.. > > > > > > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > > > > + : inode->i_sb->s_bdev) > > > > > > .. but this isn't going to fix it. You must use a bdev returned by > > > get_blocks or a similar file system method. > > > > I guess I need to go off and understand if we can have DAX mappings on such a > > device. If we can, we may have a problem - we can get the block_device from > > get_block() in I/O path and the various fault paths, but we don't have access > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > needing it the normal case by storing the sector results from get_block() in > > the radix tree. > > > > /me is off to play with RT devices... > > Well, RT devices are completely broken as far as I can see. I've reported the > breakage to the XFS list. Anything I do that triggers a RT block allocation > in XFS causes a lockdep splat + a kernel BUG - I've tried regular pwrite(), Set CONFIG_XFS_DEBUG=n (assert failure that can be ignored causing the bug, and lockdep simply has an annotation problem) and it should work. Cheers, Dave.
On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > I guess I need to go off and understand if we can have DAX mappings on such a > > device. If we can, we may have a problem - we can get the block_device from > > get_block() in I/O path and the various fault paths, but we don't have access > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > needing it the normal case by storing the sector results from get_block() in > > the radix tree. > > I think we're doing it wrong by storing the sector in the radix tree; we'd > really need to store both the sector and the bdev which is too much data. > > If we store the PFN of the underlying page instead, we don't have this > problem. Instead, we have a different problem; of the device going > away under us. I'm trying to find the code which tears down PTEs when > the device goes away, and I'm not seeing it. What do we do about user > mappings of the device? So I don't have a strong opinion whether storing PFN or sector is better. Maybe PFN is somewhat more generic but OTOH turning DAX off for special cases like inodes on XFS RT devices would be IMHO fine. I'm somewhat concerned that there are several things in flight (page fault rework, invalidation on device removal, issues with DAX access to block devices Ross found) and this is IMHO the smallest trouble we have and changing this seems relatively invasive. So could we settle the fault code and similar stuff first and look into this somewhat later? Because frankly I have some trouble following how all the pieces are going to fit together and I'm afraid we'll introduce some non-trivial bugs when several fundamental things are in flux in parallel. Honza
On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: > On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > > I think we're doing it wrong by storing the sector in the radix tree; we'd > > really need to store both the sector and the bdev which is too much data. > > > > If we store the PFN of the underlying page instead, we don't have this > > problem. Instead, we have a different problem; of the device going > > away under us. I'm trying to find the code which tears down PTEs when > > the device goes away, and I'm not seeing it. What do we do about user > > mappings of the device? > > So I don't have a strong opinion whether storing PFN or sector is better. > Maybe PFN is somewhat more generic but OTOH turning DAX off for special > cases like inodes on XFS RT devices would be IMHO fine. I'm not sure that's such a great idea. RT devices might be the best way to get aligned pages on storage. > I'm somewhat concerned that there are several things in flight (page fault > rework, invalidation on device removal, issues with DAX access to block > devices Ross found) and this is IMHO the smallest trouble we have and changing > this seems relatively invasive. This isn't the minimal change to convert us from storing sectors to storing PFNs. This is a wholescale rework based around using the page cache radix tree as the primary data structure instead of buffer heads. > So could we settle the fault code and > similar stuff first and look into this somewhat later? Because frankly I > have some trouble following how all the pieces are going to fit together > and I'm afraid we'll introduce some non-trivial bugs when several > fundamental things are in flux in parallel. We can, of course, do a much smaller patch that will use the radix tree much less centrally. And that might be the right way to go for 4.5. With the extra couple of months we'll have, I hope that this redesign can be the basis for the DAX code in 4.6.
On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: > On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > > I guess I need to go off and understand if we can have DAX mappings on such a > > > device. If we can, we may have a problem - we can get the block_device from > > > get_block() in I/O path and the various fault paths, but we don't have access > > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > > needing it the normal case by storing the sector results from get_block() in > > > the radix tree. > > > > I think we're doing it wrong by storing the sector in the radix tree; we'd > > really need to store both the sector and the bdev which is too much data. > > > > If we store the PFN of the underlying page instead, we don't have this > > problem. Instead, we have a different problem; of the device going > > away under us. I'm trying to find the code which tears down PTEs when > > the device goes away, and I'm not seeing it. What do we do about user > > mappings of the device? > > So I don't have a strong opinion whether storing PFN or sector is better. > Maybe PFN is somewhat more generic but OTOH turning DAX off for special > cases like inodes on XFS RT devices would be IMHO fine. We need to support alternate devices. There is a strong case for using the XFS RT device with DAX, especially for applications that know they are going to always use large/huge/giant pages to access their data files. The XFS RT device can guarantee allocation is always aligned to large/huge/giant page constraints right up to ENOSPC and throughout the production life of the filesystem. We have no other filesystem capable of providing such guarantees, which means the XFS RT device is uniquely suited to certain aplications with DAX... > I'm somewhat concerned that there are several things in flight (page fault > rework, invalidation on device removal, issues with DAX access to block > devices Ross found) and this is IMHO the smallest trouble we have and changing > this seems relatively invasive. So could we settle the fault code and > similar stuff first and look into this somewhat later? Because frankly I > have some trouble following how all the pieces are going to fit together > and I'm afraid we'll introduce some non-trivial bugs when several > fundamental things are in flux in parallel. Yup, there's way to many balls in the air at the moment. Cheers, Dave.
On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > > There are a number of places in dax.c that look up the struct block_device > > associated with an inode. Previously this was done by just using > > inode->i_sb->s_bdev. This is correct for inodes that exist within the > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > > against raw block devices this value is NULL. This causes NULL pointer > > dereferences when these block_device pointers are used. > > It's also wrong for an XFS file system with a RT device.. > > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > > + : inode->i_sb->s_bdev) > > .. but this isn't going to fix it. You must use a bdev returned by > get_blocks or a similar file system method. Jan & Dave, Before I start in on a solution to this issue I just wanted to confirm that DAX can rely on the fact that the filesystem's get_block() call will reliably set bh->b_bdev for non-error returns. From this conversation between Jan & Dave: https://lkml.org/lkml/2016/1/7/723 " > No. The real problem is a long-standing abuse of struct buffer_head to be > used for passing block mapping information (it's on my todo list to remove > that at least from DAX code and use cleaner block mapping interface but > first I want basic DAX functionality to settle down to avoid unnecessary > conflicts). Filesystem is not supposed to touch bh->b_bdev. That has not been true for a long, long time. e.g. XFS always rewrites bh->b_bdev in get_blocks because the file may not reside on the primary block device of the filesystem. i.e.: /* * If this is a realtime file, data may be on a different device. * to that pointed to from the buffer_head b_bdev currently. */ bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > If you need > that filled in, set it yourself in before passing bh to the block mapping > function. That may be true, but we cannot assume that the bdev coming back out of get_block is the same one that was passed in. " It sounds like this is always true for XFS, and from looking at the ext4 code I think this is true there as well because bh->b_bdev is set in ext4_dax_mmap_get_block() via map_bh(). Relying on the bh->b_bdev returned by get_block() is correct, yea?
On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >> > > I guess I need to go off and understand if we can have DAX mappings on such a >> > > device. If we can, we may have a problem - we can get the block_device from >> > > get_block() in I/O path and the various fault paths, but we don't have access >> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid >> > > needing it the normal case by storing the sector results from get_block() in >> > > the radix tree. >> > >> > I think we're doing it wrong by storing the sector in the radix tree; we'd >> > really need to store both the sector and the bdev which is too much data. >> > >> > If we store the PFN of the underlying page instead, we don't have this >> > problem. Instead, we have a different problem; of the device going >> > away under us. I'm trying to find the code which tears down PTEs when >> > the device goes away, and I'm not seeing it. What do we do about user >> > mappings of the device? >> >> So I don't have a strong opinion whether storing PFN or sector is better. >> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >> cases like inodes on XFS RT devices would be IMHO fine. > > We need to support alternate devices. Embedded devices trying to use NOR Flash to free up RAM was historically one of the more prevalent real world uses of the old filemap_xip.c code although the users never made it to mainline. So I spent some time last week trying to figure out how to make a subset of DAX not depend on CONFIG_BLOCK. It was a very frustrating and unfruitful experience. I discarded my main conclusion as impractical, but now that I see the difficultly DAX faces in dealing with "alternate devices" especially some of the crazy stuff btrfs can do, I wonder if it's not so crazy after all. Lets stop calling bdev_direct_access() directly from DAX. Let the filesystems do it. Sure we could enable generic_dax_direct_access() helper for the filesystems that only support single devices to make it easy. But XFS and btrfs for example, have to do the work of figuring out what bdev is required and then calling bdev_direct_access(). My reasoning is that the filesystem knows how to map inodes and offsets to devices and sectors, no matter how complex that is. It would even enable a filesystem to intelligently use a mix of direct_access and regular block devices down the road. Of course it would also make the block-less solution doable. Good idea? Stupid idea?
On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert <jaredeh@gmail.com> wrote: > On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>> > > I guess I need to go off and understand if we can have DAX mappings on such a >>> > > device. If we can, we may have a problem - we can get the block_device from >>> > > get_block() in I/O path and the various fault paths, but we don't have access >>> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid >>> > > needing it the normal case by storing the sector results from get_block() in >>> > > the radix tree. >>> > >>> > I think we're doing it wrong by storing the sector in the radix tree; we'd >>> > really need to store both the sector and the bdev which is too much data. >>> > >>> > If we store the PFN of the underlying page instead, we don't have this >>> > problem. Instead, we have a different problem; of the device going >>> > away under us. I'm trying to find the code which tears down PTEs when >>> > the device goes away, and I'm not seeing it. What do we do about user >>> > mappings of the device? >>> >>> So I don't have a strong opinion whether storing PFN or sector is better. >>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >>> cases like inodes on XFS RT devices would be IMHO fine. >> >> We need to support alternate devices. > > Embedded devices trying to use NOR Flash to free up RAM was > historically one of the more prevalent real world uses of the old > filemap_xip.c code although the users never made it to mainline. So I > spent some time last week trying to figure out how to make a subset of > DAX not depend on CONFIG_BLOCK. It was a very frustrating and > unfruitful experience. I discarded my main conclusion as impractical, > but now that I see the difficultly DAX faces in dealing with > "alternate devices" especially some of the crazy stuff btrfs can do, I > wonder if it's not so crazy after all. > > Lets stop calling bdev_direct_access() directly from DAX. Let the > filesystems do it. > > Sure we could enable generic_dax_direct_access() helper for the > filesystems that only support single devices to make it easy. But XFS > and btrfs for example, have to do the work of figuring out what bdev > is required and then calling bdev_direct_access(). > > My reasoning is that the filesystem knows how to map inodes and > offsets to devices and sectors, no matter how complex that is. It > would even enable a filesystem to intelligently use a mix of > direct_access and regular block devices down the road. Of course it > would also make the block-less solution doable. > > Good idea? Stupid idea? The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems the CONFIG_BLOCK=n is an incremental feature in its own right. What driver and what filesystem are looking to enable this XIP support in?
On Mon, Feb 01, 2016 at 05:02:12PM -0700, Ross Zwisler wrote:
> Relying on the bh->b_bdev returned by get_block() is correct, yea?
IMO, yes.
Cheers,
Dave.
On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert <jaredeh@gmail.com> wrote: >> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: >>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>>> > > I guess I need to go off and understand if we can have DAX mappings on such a >>>> > > device. If we can, we may have a problem - we can get the block_device from >>>> > > get_block() in I/O path and the various fault paths, but we don't have access >>>> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid >>>> > > needing it the normal case by storing the sector results from get_block() in >>>> > > the radix tree. >>>> > >>>> > I think we're doing it wrong by storing the sector in the radix tree; we'd >>>> > really need to store both the sector and the bdev which is too much data. >>>> > >>>> > If we store the PFN of the underlying page instead, we don't have this >>>> > problem. Instead, we have a different problem; of the device going >>>> > away under us. I'm trying to find the code which tears down PTEs when >>>> > the device goes away, and I'm not seeing it. What do we do about user >>>> > mappings of the device? >>>> >>>> So I don't have a strong opinion whether storing PFN or sector is better. >>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >>>> cases like inodes on XFS RT devices would be IMHO fine. >>> >>> We need to support alternate devices. >> >> Embedded devices trying to use NOR Flash to free up RAM was >> historically one of the more prevalent real world uses of the old >> filemap_xip.c code although the users never made it to mainline. So I >> spent some time last week trying to figure out how to make a subset of >> DAX not depend on CONFIG_BLOCK. It was a very frustrating and >> unfruitful experience. I discarded my main conclusion as impractical, >> but now that I see the difficultly DAX faces in dealing with >> "alternate devices" especially some of the crazy stuff btrfs can do, I >> wonder if it's not so crazy after all. >> >> Lets stop calling bdev_direct_access() directly from DAX. Let the >> filesystems do it. >> >> Sure we could enable generic_dax_direct_access() helper for the >> filesystems that only support single devices to make it easy. But XFS >> and btrfs for example, have to do the work of figuring out what bdev >> is required and then calling bdev_direct_access(). >> >> My reasoning is that the filesystem knows how to map inodes and >> offsets to devices and sectors, no matter how complex that is. It >> would even enable a filesystem to intelligently use a mix of >> direct_access and regular block devices down the road. Of course it >> would also make the block-less solution doable. >> >> Good idea? Stupid idea? > > The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems > the CONFIG_BLOCK=n is an incremental feature in its own right. What > driver and what filesystem are looking to enable this XIP support in? Well... as CONFIG_BLOCK was not required with filemap_xip.c for a decade. This CONFIG_BLOCK dependency is a result of an incremental feature from a certain point of view ;) The obvious 'driver' is physical RAM without a particular driver. Remember please I'm talking about embedded. RAM measured in MiB and funky one off hardware etc. In the embedded world there are lots of ways that persistent memory has been supported in device specific ways without the new fancypants NFIT and Intel instructions, so frankly they don't fit in the PMEM stuff. Maybe they could be supported in PMEM but not without effort to bring embedded players to the table. The other drivers are the MTD drivers, probably as read-only for now. But the paradigm there isn't so different from what PMEM looks like with asymmetric read/write capabilities. The filesystem I'm concerned with is AXFS (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). Which I've been planning on trying to merge again due to a recent resurgence of interest. The device model for AXFS is... weird. It can use one or two devices at a time of any mix of NOR MTD, NAND MTD, block, and unmanaged physical memory. It's a terribly useful model for embedded. Anyway AXFS is readonly so hacking in a read only dax_fault_nodev() and dax_file_read() would work fine, looks easy enough. But... it would be cool if similar small embedded focused RW filesystems were enabled. I don't expect you to taint DAX with design requirements for this stuff that it wasn't built for, nobody ends up happy in that case. However, if enabling the filesystem to manage the bdev_direct_access() interactions solves some of the "alternate device" problems you are discussing here, then there is a chance we can accommodate both. Sometimes that works. So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the focus anyway. Does it help to support the weirder XFS and btrfs device models to enable the filesystem to handle the bdev_direct_access() stuff?
On Mon 01-02-16 17:02:12, Ross Zwisler wrote: > On Thu, Jan 28, 2016 at 01:38:58PM -0800, Christoph Hellwig wrote: > > On Thu, Jan 28, 2016 at 12:35:04PM -0700, Ross Zwisler wrote: > > > There are a number of places in dax.c that look up the struct block_device > > > associated with an inode. Previously this was done by just using > > > inode->i_sb->s_bdev. This is correct for inodes that exist within the > > > filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX > > > against raw block devices this value is NULL. This causes NULL pointer > > > dereferences when these block_device pointers are used. > > > > It's also wrong for an XFS file system with a RT device.. > > > > > +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ > > > + : inode->i_sb->s_bdev) > > > > .. but this isn't going to fix it. You must use a bdev returned by > > get_blocks or a similar file system method. > > Jan & Dave, > > Before I start in on a solution to this issue I just wanted to confirm that > DAX can rely on the fact that the filesystem's get_block() call will reliably > set bh->b_bdev for non-error returns. From this conversation between Jan & > Dave: > > https://lkml.org/lkml/2016/1/7/723 > > " > > No. The real problem is a long-standing abuse of struct buffer_head to be > > used for passing block mapping information (it's on my todo list to remove > > that at least from DAX code and use cleaner block mapping interface but > > first I want basic DAX functionality to settle down to avoid unnecessary > > conflicts). Filesystem is not supposed to touch bh->b_bdev. > > That has not been true for a long, long time. e.g. XFS always > rewrites bh->b_bdev in get_blocks because the file may not reside on > the primary block device of the filesystem. i.e.: > > /* > * If this is a realtime file, data may be on a different device. > * to that pointed to from the buffer_head b_bdev currently. > */ > bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > > If you need > > that filled in, set it yourself in before passing bh to the block mapping > > function. > > That may be true, but we cannot assume that the bdev coming back > out of get_block is the same one that was passed in. > " > > It sounds like this is always true for XFS, and from looking at the ext4 code > I think this is true there as well because bh->b_bdev is set in > ext4_dax_mmap_get_block() via map_bh(). > > Relying on the bh->b_bdev returned by get_block() is correct, yea? Yeah, sorry, I was confused. If the result is a mapped block (i.e. return value of get_block callback is > 0), ext4 also sets bh->b_bdev via map_bh() as you correctly point out. If the result is a hole or error, ext4 doesn't set bh->b_bdev at all. So you can rely on bh->b_bdev. Honza
On Tue 02-02-16 08:47:30, Dave Chinner wrote: > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: > > On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > > > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > > > I guess I need to go off and understand if we can have DAX mappings on such a > > > > device. If we can, we may have a problem - we can get the block_device from > > > > get_block() in I/O path and the various fault paths, but we don't have access > > > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > > > needing it the normal case by storing the sector results from get_block() in > > > > the radix tree. > > > > > > I think we're doing it wrong by storing the sector in the radix tree; we'd > > > really need to store both the sector and the bdev which is too much data. > > > > > > If we store the PFN of the underlying page instead, we don't have this > > > problem. Instead, we have a different problem; of the device going > > > away under us. I'm trying to find the code which tears down PTEs when > > > the device goes away, and I'm not seeing it. What do we do about user > > > mappings of the device? > > > > So I don't have a strong opinion whether storing PFN or sector is better. > > Maybe PFN is somewhat more generic but OTOH turning DAX off for special > > cases like inodes on XFS RT devices would be IMHO fine. > > We need to support alternate devices. > > There is a strong case for using the XFS RT device with DAX, > especially for applications that know they are going to always use > large/huge/giant pages to access their data files. The XFS RT device > can guarantee allocation is always aligned to large/huge/giant page > constraints right up to ENOSPC and throughout the production life of > the filesystem. We have no other filesystem capable of providing > such guarantees, which means the XFS RT device is uniquely suited to > certain aplications with DAX... I see, thanks for explanation. So I'm OK with changing what is stored in the radix tree to accommodate this use case but my reservation that we IHMO have other more pressing things to fix remains... Honza
On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: [..] > I see, thanks for explanation. So I'm OK with changing what is stored in > the radix tree to accommodate this use case but my reservation that we IHMO > have other more pressing things to fix remains... We don't need pfns in the radix to support XFS RT configurations. Just call get_blocks() again and use the sector, or am I missing something?
On Tue 02-02-16 08:33:56, Dan Williams wrote: > On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > [..] > > I see, thanks for explanation. So I'm OK with changing what is stored in > > the radix tree to accommodate this use case but my reservation that we IHMO > > have other more pressing things to fix remains... > > We don't need pfns in the radix to support XFS RT configurations. > Just call get_blocks() again and use the sector, or am I missing > something? You are correct. But if you decide to pay the cost of additional get_block() call, you only need the dirty tag in the radix tree and nothing else. So my understanding was that the whole point of games with radix tree is avoiding this extra get_block() calls for fsync(). Honza
On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert <jaredeh@gmail.com> wrote: [..] > Well... as CONFIG_BLOCK was not required with filemap_xip.c for a > decade. This CONFIG_BLOCK dependency is a result of an incremental > feature from a certain point of view ;) > > The obvious 'driver' is physical RAM without a particular driver. > Remember please I'm talking about embedded. RAM measured in MiB and > funky one off hardware etc. In the embedded world there are lots of > ways that persistent memory has been supported in device specific ways > without the new fancypants NFIT and Intel instructions,so frankly > they don't fit in the PMEM stuff. Maybe they could be supported in > PMEM but not without effort to bring embedded players to the table. Not sure what you're trying to say here. An ACPI NFIT only feeds the generic libnvdimm device model. You don't need NFIT to get pmem. > The other drivers are the MTD drivers, probably as read-only for now. > But the paradigm there isn't so different from what PMEM looks like > with asymmetric read/write capabilities. > > The filesystem I'm concerned with is AXFS > (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). > Which I've been planning on trying to merge again due to a recent > resurgence of interest. The device model for AXFS is... weird. It > can use one or two devices at a time of any mix of NOR MTD, NAND MTD, > block, and unmanaged physical memory. It's a terribly useful model > for embedded. Anyway AXFS is readonly so hacking in a read only > dax_fault_nodev() and dax_file_read() would work fine, looks easy > enough. But... it would be cool if similar small embedded focused RW > filesystems were enabled. Are those also out of tree? > I don't expect you to taint DAX with design requirements for this > stuff that it wasn't built for, nobody ends up happy in that case. > However, if enabling the filesystem to manage the bdev_direct_access() > interactions solves some of the "alternate device" problems you are > discussing here, then there is a chance we can accommodate both. > Sometimes that works. > > So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the > focus anyway. Does it help to support the weirder XFS and btrfs > device models to enable the filesystem to handle the > bdev_direct_access() stuff? It's not clear that it does. We just clarified with xfs and ext4 that we can really on get_blocks(). That solves the immediate concern with multi-device filesystems.
On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > On Tue 02-02-16 08:33:56, Dan Williams wrote: >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: >> [..] >> > I see, thanks for explanation. So I'm OK with changing what is stored in >> > the radix tree to accommodate this use case but my reservation that we IHMO >> > have other more pressing things to fix remains... >> >> We don't need pfns in the radix to support XFS RT configurations. >> Just call get_blocks() again and use the sector, or am I missing >> something? > > You are correct. But if you decide to pay the cost of additional > get_block() call, you only need the dirty tag in the radix tree and nothing > else. So my understanding was that the whole point of games with radix tree > is avoiding this extra get_block() calls for fsync(). > DAX-fsync() is already a potentially expensive operation to cover data durability guarantees for DAX-unaware applications. A DAX-aware application is going to skip fsync, and the get_blocks() cost, to do cache management itself. Willy pointed out some other potential benefits, assuming a suitable replacement for the protections afforded by the block-device percpu_ref counter can be found. However, optimizing for the DAX-unaware-application case seems the wrong motivation to me.
On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > >> [..] > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > >> > the radix tree to accommodate this use case but my reservation that we IHMO > >> > have other more pressing things to fix remains... > >> > >> We don't need pfns in the radix to support XFS RT configurations. > >> Just call get_blocks() again and use the sector, or am I missing > >> something? > > > > You are correct. But if you decide to pay the cost of additional > > get_block() call, you only need the dirty tag in the radix tree and nothing > > else. So my understanding was that the whole point of games with radix tree > > is avoiding this extra get_block() calls for fsync(). > > > > DAX-fsync() is already a potentially expensive operation to cover data > durability guarantees for DAX-unaware applications. A DAX-aware > application is going to skip fsync, and the get_blocks() cost, to do > cache management itself. > > Willy pointed out some other potential benefits, assuming a suitable > replacement for the protections afforded by the block-device > percpu_ref counter can be found. However, optimizing for the > DAX-unaware-application case seems the wrong motivation to me. Oh, no, the primary issue with calling get_block() in the fsync path isn't performance. It's that we don't have any idea what get_block() function to call. The fault handler calls all come from the filesystem directly, so they can easily give us an appropriate get_block() function pointer. But the dax_writeback_mapping_range() calls come from the generic code in mm/filemap.c, and don't know what get_block() to pass in. During one iteration I had the calls to dax_writeback_mapping_range() happening in the filesystem fsync code so that it could pass in get_block(), but Dave Chinner pointed out that this misses other paths in the filesystem that need to have things flushed via a call to filemap_write_and_wait_range(). In yet another iteration of this series I tried adding get_block() to struct inode_operations so that I could access it from what is now dax_writeback_mapping_range(), but this was shot down as well.
On Tue, Feb 2, 2016 at 9:34 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: >> On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: >> > On Tue 02-02-16 08:33:56, Dan Williams wrote: >> >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: >> >> [..] >> >> > I see, thanks for explanation. So I'm OK with changing what is stored in >> >> > the radix tree to accommodate this use case but my reservation that we IHMO >> >> > have other more pressing things to fix remains... >> >> >> >> We don't need pfns in the radix to support XFS RT configurations. >> >> Just call get_blocks() again and use the sector, or am I missing >> >> something? >> > >> > You are correct. But if you decide to pay the cost of additional >> > get_block() call, you only need the dirty tag in the radix tree and nothing >> > else. So my understanding was that the whole point of games with radix tree >> > is avoiding this extra get_block() calls for fsync(). >> > >> >> DAX-fsync() is already a potentially expensive operation to cover data >> durability guarantees for DAX-unaware applications. A DAX-aware >> application is going to skip fsync, and the get_blocks() cost, to do >> cache management itself. >> >> Willy pointed out some other potential benefits, assuming a suitable >> replacement for the protections afforded by the block-device >> percpu_ref counter can be found. However, optimizing for the >> DAX-unaware-application case seems the wrong motivation to me. > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > performance. It's that we don't have any idea what get_block() function to > call. > > The fault handler calls all come from the filesystem directly, so they can > easily give us an appropriate get_block() function pointer. But the > dax_writeback_mapping_range() calls come from the generic code in > mm/filemap.c, and don't know what get_block() to pass in. > > During one iteration I had the calls to dax_writeback_mapping_range() > happening in the filesystem fsync code so that it could pass in get_block(), > but Dave Chinner pointed out that this misses other paths in the filesystem > that need to have things flushed via a call to filemap_write_and_wait_range(). > > In yet another iteration of this series I tried adding get_block() to struct > inode_operations so that I could access it from what is now > dax_writeback_mapping_range(), but this was shot down as well. Ugh, and we can't trigger it from where a filesystem normally syncs a block device, becauDid you tryse we lose track of the inode radix information at that level. What a about a super_operation? That seems the right level, given we're currently doing: inode->i_sb->s_bdev ...it does not seem terrible to instead do: inode->i_sb->s_op->get_block()
On Tue, Feb 2, 2016 at 9:46 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, Feb 2, 2016 at 9:34 AM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: >> On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: >>> On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: >>> > On Tue 02-02-16 08:33:56, Dan Williams wrote: >>> >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: >>> >> [..] >>> >> > I see, thanks for explanation. So I'm OK with changing what is stored in >>> >> > the radix tree to accommodate this use case but my reservation that we IHMO >>> >> > have other more pressing things to fix remains... >>> >> >>> >> We don't need pfns in the radix to support XFS RT configurations. >>> >> Just call get_blocks() again and use the sector, or am I missing >>> >> something? >>> > >>> > You are correct. But if you decide to pay the cost of additional >>> > get_block() call, you only need the dirty tag in the radix tree and nothing >>> > else. So my understanding was that the whole point of games with radix tree >>> > is avoiding this extra get_block() calls for fsync(). >>> > >>> >>> DAX-fsync() is already a potentially expensive operation to cover data >>> durability guarantees for DAX-unaware applications. A DAX-aware >>> application is going to skip fsync, and the get_blocks() cost, to do >>> cache management itself. >>> >>> Willy pointed out some other potential benefits, assuming a suitable >>> replacement for the protections afforded by the block-device >>> percpu_ref counter can be found. However, optimizing for the >>> DAX-unaware-application case seems the wrong motivation to me. >> >> Oh, no, the primary issue with calling get_block() in the fsync path isn't >> performance. It's that we don't have any idea what get_block() function to >> call. >> >> The fault handler calls all come from the filesystem directly, so they can >> easily give us an appropriate get_block() function pointer. But the >> dax_writeback_mapping_range() calls come from the generic code in >> mm/filemap.c, and don't know what get_block() to pass in. >> >> During one iteration I had the calls to dax_writeback_mapping_range() >> happening in the filesystem fsync code so that it could pass in get_block(), >> but Dave Chinner pointed out that this misses other paths in the filesystem >> that need to have things flushed via a call to filemap_write_and_wait_range(). >> >> In yet another iteration of this series I tried adding get_block() to struct >> inode_operations so that I could access it from what is now >> dax_writeback_mapping_range(), but this was shot down as well. > > Ugh, and we can't trigger it from where a filesystem normally syncs a > block device, becauDid you tryse we lose track of the inode radix [ sorry, copy paste error ] block device, because we lose track of the inode radix > information at that level. > > What a about a super_operation? That seems the right level, given > we're currently doing: > > inode->i_sb->s_bdev > > ...it does not seem terrible to instead do: > > inode->i_sb->s_op->get_block()
On Tue, Feb 02, 2016 at 09:47:37AM -0800, Dan Williams wrote: > On Tue, Feb 2, 2016 at 9:46 AM, Dan Williams <dan.j.williams@intel.com> wrote: > > On Tue, Feb 2, 2016 at 9:34 AM, Ross Zwisler > > <ross.zwisler@linux.intel.com> wrote: > >> On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > >>> On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > >>> > On Tue 02-02-16 08:33:56, Dan Williams wrote: > >>> >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > >>> >> [..] > >>> >> > I see, thanks for explanation. So I'm OK with changing what is stored in > >>> >> > the radix tree to accommodate this use case but my reservation that we IHMO > >>> >> > have other more pressing things to fix remains... > >>> >> > >>> >> We don't need pfns in the radix to support XFS RT configurations. > >>> >> Just call get_blocks() again and use the sector, or am I missing > >>> >> something? > >>> > > >>> > You are correct. But if you decide to pay the cost of additional > >>> > get_block() call, you only need the dirty tag in the radix tree and nothing > >>> > else. So my understanding was that the whole point of games with radix tree > >>> > is avoiding this extra get_block() calls for fsync(). > >>> > > >>> > >>> DAX-fsync() is already a potentially expensive operation to cover data > >>> durability guarantees for DAX-unaware applications. A DAX-aware > >>> application is going to skip fsync, and the get_blocks() cost, to do > >>> cache management itself. > >>> > >>> Willy pointed out some other potential benefits, assuming a suitable > >>> replacement for the protections afforded by the block-device > >>> percpu_ref counter can be found. However, optimizing for the > >>> DAX-unaware-application case seems the wrong motivation to me. > >> > >> Oh, no, the primary issue with calling get_block() in the fsync path isn't > >> performance. It's that we don't have any idea what get_block() function to > >> call. > >> > >> The fault handler calls all come from the filesystem directly, so they can > >> easily give us an appropriate get_block() function pointer. But the > >> dax_writeback_mapping_range() calls come from the generic code in > >> mm/filemap.c, and don't know what get_block() to pass in. > >> > >> During one iteration I had the calls to dax_writeback_mapping_range() > >> happening in the filesystem fsync code so that it could pass in get_block(), > >> but Dave Chinner pointed out that this misses other paths in the filesystem > >> that need to have things flushed via a call to filemap_write_and_wait_range(). > >> > >> In yet another iteration of this series I tried adding get_block() to struct > >> inode_operations so that I could access it from what is now > >> dax_writeback_mapping_range(), but this was shot down as well. > > > > Ugh, and we can't trigger it from where a filesystem normally syncs a > > block device, becauDid you tryse we lose track of the inode radix > > [ sorry, copy paste error ] > > block device, because we lose track of the inode radix > > > information at that level. > > > > What a about a super_operation? That seems the right level, given > > we're currently doing: > > > > inode->i_sb->s_bdev > > > > ...it does not seem terrible to instead do: > > > > inode->i_sb->s_op->get_block() This seems promising. I'll try and code it up.
On Tue, Feb 02, 2016 at 12:17:23PM +0100, Jan Kara wrote: > On Tue 02-02-16 08:47:30, Dave Chinner wrote: > > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: > > > On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > > > > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > > > > I guess I need to go off and understand if we can have DAX mappings on such a > > > > > device. If we can, we may have a problem - we can get the block_device from > > > > > get_block() in I/O path and the various fault paths, but we don't have access > > > > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > > > > needing it the normal case by storing the sector results from get_block() in > > > > > the radix tree. > > > > > > > > I think we're doing it wrong by storing the sector in the radix tree; we'd > > > > really need to store both the sector and the bdev which is too much data. > > > > > > > > If we store the PFN of the underlying page instead, we don't have this > > > > problem. Instead, we have a different problem; of the device going > > > > away under us. I'm trying to find the code which tears down PTEs when > > > > the device goes away, and I'm not seeing it. What do we do about user > > > > mappings of the device? > > > > > > So I don't have a strong opinion whether storing PFN or sector is better. > > > Maybe PFN is somewhat more generic but OTOH turning DAX off for special > > > cases like inodes on XFS RT devices would be IMHO fine. > > > > We need to support alternate devices. > > > > There is a strong case for using the XFS RT device with DAX, > > especially for applications that know they are going to always use > > large/huge/giant pages to access their data files. The XFS RT device > > can guarantee allocation is always aligned to large/huge/giant page > > constraints right up to ENOSPC and throughout the production life of > > the filesystem. We have no other filesystem capable of providing > > such guarantees, which means the XFS RT device is uniquely suited to > > certain aplications with DAX... > > I see, thanks for explanation. So I'm OK with changing what is stored in > the radix tree to accommodate this use case but my reservation that we IHMO > have other more pressing things to fix remains... IMO this is pretty pressing - without it neither XFS RT devices nor DAX raw block devices work. The case has been made above for XFS RT devices, and with DAX raw block devices we really need a fix because the current code will cause a kernel BUG when a user tries to fsync/msync a raw block device mmap(). This is especially bad because, unlike with filesystems where you mount with the dax mount option, there is no opt-in step for raw block devices. This has to be fixed - it seems like we either figure out how to fix DAX fsync, or we have to disable DAX on raw block devices for a kernel cycle. I'm hoping for the former. :)
On Tue, Feb 02, 2016 at 09:46:21AM -0800, Dan Williams wrote: > What a about a super_operation? That seems the right level, given > we're currently doing: > > inode->i_sb->s_bdev > > ...it does not seem terrible to instead do: > > inode->i_sb->s_op->get_block() The point is that filesystems have lots of different get_block operations, and the right one to use depends not just on the inode, but also upon what VFS function is being called, and in some filesystems the phase of the moon, or the file open flags (so even inode->i_ops->get_block is wrong; file->f_ops->get_block would be better, but of course we've lost that by the point we're doing writeback). I now realise that basing DAX around get_block & buffer_heads was a mistake. I think the Right Solution (not for 4.5) is to ask filesystems to populate the radix tree. A flow somewhat like this: 1. VFS or VM calls filesystem (eg ->fault()) 2. Filesystem calls DAX (eg dax_fault()) 3. DAX looks in radix tree, finds no information. 4. DAX calls (NEW!) mapping->a_ops->populate_pfns 5. Filesystem looks up its internal data structure (eg extent tree) and calls dax_create_pfns() (see giant patch from yesterday, only instead of passing a get_block_t, the filesystem has already filled in a bh which describes the entire extent that this access happens to land in). 6. DAX continues to take care of calling bdev_direct_access() from dax_create_pfns(). After we have that step done, we can look at what it would take to avoid calling bdev_direct_access for non-block-based filesystems. That looks to me like just calling dax_add_pfn_entries() from their ->populate_pfns implementation. And we introduce a CONFIG_BLOCK ifdef around dax_create_pfns(), dax_clear_blocks() and dax_zero_page_range(). Or maybe modify dax_zero_page_range() to use the radix tree as above, since it's probably a useful helper function. Once we have buffer_head usage confined to a fairly small part of DAX, we can look at replacing it with a more appropriate data structure with better-defined contents.
On Tue, Feb 02, 2016 at 11:41:34AM -0700, Ross Zwisler wrote: > On Tue, Feb 02, 2016 at 12:17:23PM +0100, Jan Kara wrote: > > On Tue 02-02-16 08:47:30, Dave Chinner wrote: > > > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: > > > > On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: > > > > > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: > > > > > > I guess I need to go off and understand if we can have DAX mappings on such a > > > > > > device. If we can, we may have a problem - we can get the block_device from > > > > > > get_block() in I/O path and the various fault paths, but we don't have access > > > > > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid > > > > > > needing it the normal case by storing the sector results from get_block() in > > > > > > the radix tree. > > > > > > > > > > I think we're doing it wrong by storing the sector in the radix tree; we'd > > > > > really need to store both the sector and the bdev which is too much data. > > > > > > > > > > If we store the PFN of the underlying page instead, we don't have this > > > > > problem. Instead, we have a different problem; of the device going > > > > > away under us. I'm trying to find the code which tears down PTEs when > > > > > the device goes away, and I'm not seeing it. What do we do about user > > > > > mappings of the device? > > > > > > > > So I don't have a strong opinion whether storing PFN or sector is better. > > > > Maybe PFN is somewhat more generic but OTOH turning DAX off for special > > > > cases like inodes on XFS RT devices would be IMHO fine. > > > > > > We need to support alternate devices. > > > > > > There is a strong case for using the XFS RT device with DAX, > > > especially for applications that know they are going to always use > > > large/huge/giant pages to access their data files. The XFS RT device > > > can guarantee allocation is always aligned to large/huge/giant page > > > constraints right up to ENOSPC and throughout the production life of > > > the filesystem. We have no other filesystem capable of providing > > > such guarantees, which means the XFS RT device is uniquely suited to > > > certain aplications with DAX... > > > > I see, thanks for explanation. So I'm OK with changing what is stored in > > the radix tree to accommodate this use case but my reservation that we IHMO > > have other more pressing things to fix remains... > > IMO this is pretty pressing - without it neither XFS RT devices nor DAX raw > block devices work. The case has been made above for XFS RT devices, and with > DAX raw block devices we really need a fix because the current code will cause > a kernel BUG when a user tries to fsync/msync a raw block device mmap(). This > is especially bad because, unlike with filesystems where you mount with the > dax mount option, there is no opt-in step for raw block devices. > > This has to be fixed - it seems like we either figure out how to fix DAX > fsync, or we have to disable DAX on raw block devices for a kernel cycle. I'm > hoping for the former. :) Well, I guess a third option would be to keep DAX raw block device in and just take this patch as a temporary fix: https://lkml.org/lkml/2016/1/28/679 This would leave XFS RT broken, though, so we may want to explicitly disable DAX + XFS RT configs for now, but at least we wouldn't have the raw block device kernel BUG.
On Tue, Feb 2, 2016 at 10:46 AM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Tue, Feb 02, 2016 at 09:46:21AM -0800, Dan Williams wrote: >> What a about a super_operation? That seems the right level, given >> we're currently doing: >> >> inode->i_sb->s_bdev >> >> ...it does not seem terrible to instead do: >> >> inode->i_sb->s_op->get_block() > > The point is that filesystems have lots of different get_block operations, > and the right one to use depends not just on the inode, but also upon > what VFS function is being called, and in some filesystems the phase > of the moon, or the file open flags (so even inode->i_ops->get_block is > wrong; file->f_ops->get_block would be better, but of course we've lost > that by the point we're doing writeback). True, but in this case we're just trying to resolve the bdev for a inode / sector combination to already allocated storage. So get_block() is a misnomer, this is just get_bdev() to resolve a super_block-inode+sector tuple to a bdev for cases when s_bdev is the wrong answer.
On Tue, Feb 02, 2016 at 10:59:41AM -0800, Dan Williams wrote: > True, but in this case we're just trying to resolve the bdev for a > inode / sector combination to already allocated storage. So > get_block() is a misnomer, this is just get_bdev() to resolve a > super_block-inode+sector tuple to a bdev for cases when s_bdev is the > wrong answer. Then let's call it that. i_ops->get_bdev(struct inode *inode, pgoff_t index) And if it doesn't exist, use i_sb->s_bdev.
On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert <jaredeh@gmail.com> wrote: > [..] >> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a >> decade. This CONFIG_BLOCK dependency is a result of an incremental >> feature from a certain point of view ;) >> >> The obvious 'driver' is physical RAM without a particular driver. >> Remember please I'm talking about embedded. RAM measured in MiB and >> funky one off hardware etc. In the embedded world there are lots of >> ways that persistent memory has been supported in device specific ways >> without the new fancypants NFIT and Intel instructions,so frankly >> they don't fit in the PMEM stuff. Maybe they could be supported in >> PMEM but not without effort to bring embedded players to the table. > > Not sure what you're trying to say here. An ACPI NFIT only feeds the > generic libnvdimm device model. You don't need NFIT to get pmem. Right... I'm just not seeing how the libnvdimm device model fits, is relevant, or useful to a persistent SRAM in embedded. Therefore I don't see some of the user will have a driver. >> The other drivers are the MTD drivers, probably as read-only for now. >> But the paradigm there isn't so different from what PMEM looks like >> with asymmetric read/write capabilities. >> >> The filesystem I'm concerned with is AXFS >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> Which I've been planning on trying to merge again due to a recent >> resurgence of interest. The device model for AXFS is... weird. It >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> block, and unmanaged physical memory. It's a terribly useful model >> for embedded. Anyway AXFS is readonly so hacking in a read only >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> enough. But... it would be cool if similar small embedded focused RW >> filesystems were enabled. > > Are those also out of tree? Of course. Merging embedded filesystems is little merging regular filesystems except 98% of you reviewers don't want it merged. >> I don't expect you to taint DAX with design requirements for this >> stuff that it wasn't built for, nobody ends up happy in that case. >> However, if enabling the filesystem to manage the bdev_direct_access() >> interactions solves some of the "alternate device" problems you are >> discussing here, then there is a chance we can accommodate both. >> Sometimes that works. >> >> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the >> focus anyway. Does it help to support the weirder XFS and btrfs >> device models to enable the filesystem to handle the >> bdev_direct_access() stuff? > > It's not clear that it does. We just clarified with xfs and ext4 that > we can really on get_blocks(). That solves the immediate concern with > multi-device filesystems. IMO you're making DAX more complex by overly coupling to the bdev and I think it could bite you later. I submit this rework of the radix tree and confusion about where to get the real bdev as evidence. I'm guessing that it won't be the last time. It's unnecessary to couple it like this, and in fact is not how the vfs has been layered in the past. The trouble with vfs work has been that it straddles the line between mm and block, unfortunately that line is dark chasm with ill defined boundaries. DAX is even more exciting because it's trying to duct tape the filesystem even closer to the mm system, one could argue it's actually in some respects enabling the filesystem to bypass the mm code. On top of that DAX is designed to enable block based filesystems to use RAM like devices. Bolting the block device interface on to NVDIMM is a brilliant hack and the right design choice, but it's still a hack. The upside is it enables the reuse of all this glorious legacy filesystem code which does a pretty amazing job of handling what the pmem device applications need considering they were designed to manage data on platters of slow spinning rust. How would DAX look like developed with a filesystem purpose built for pmem? To look at the the downside consider dax_fault(). Its called on a fault to a user memory map, uses the filesystems get_block() to lookup a sector so you can ask a block device to convert it to an address on a DIMM. Come on, that's awkward. Everything around dax_fault() is dripping with memory semantic interfaces, the dax_fault() call are fundamentally about memory, the pmem calls are memory, the hardware is memory, and yet it directly calls bdev_direct_access(). It's out of place. The legacy vfs/mm code didn't have this layering problem either. Even filemap_fault() that dax_fault() is modeled after doesn't call any bdev methods directly, when it needs something it asks the filesystem with a ->readpage(). The precedence is that you ask the filesystem for what you need. Look at the get_bdev() thing you've concluded you need. It _almost_ makes my point. I just happen to be of the opinion that you don't actually want or need the bdev, you want the pfn/kaddr so you can flush or map or memcpy().
On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote: > On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams@intel.com> wrote: > >> The filesystem I'm concerned with is AXFS > >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). > >> Which I've been planning on trying to merge again due to a recent > >> resurgence of interest. The device model for AXFS is... weird. It > >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, > >> block, and unmanaged physical memory. It's a terribly useful model > >> for embedded. Anyway AXFS is readonly so hacking in a read only > >> dax_fault_nodev() and dax_file_read() would work fine, looks easy > >> enough. But... it would be cool if similar small embedded focused RW > >> filesystems were enabled. > > > > Are those also out of tree? > > Of course. Merging embedded filesystems is little merging regular > filesystems except 98% of you reviewers don't want it merged. You should at least be able to get it into staging these days. I mean, look at some of the junk that's in staging ... and I don't think AXFS was nearly as bad. > IMO you're making DAX more complex by overly coupling to the bdev and > I think it could bite you later. I submit this rework of the radix > tree and confusion about where to get the real bdev as evidence. I'm > guessing that it won't be the last time. It's unnecessary to couple > it like this, and in fact is not how the vfs has been layered in the > past. Huh? The rework to use the radix tree for PFNs was done with one eye firmly on your usage case. Just because I had to thread the get_block interface through it for the moment doesn't mean that I didn't have the "how do we get rid of get_block entirely" question on my mind. Using get_block seemed like the right idea three years ago. I didn't know just how fundamentally ext4 and XFS disagree on how it should be used. > To look at the the downside consider dax_fault(). Its called on a > fault to a user memory map, uses the filesystems get_block() to lookup > a sector so you can ask a block device to convert it to an address on > a DIMM. Come on, that's awkward. Everything around dax_fault() is > dripping with memory semantic interfaces, the dax_fault() call are > fundamentally about memory, the pmem calls are memory, the hardware is > memory, and yet it directly calls bdev_direct_access(). It's out of > place. What was out of place was the old 'get_xip_mem' in address_space operations. Returning a kernel virtual address and a PFN from a filesystem operation? That looks awful. All the other operations deal in struct pages, file offsets and occasionally sectors. Of course, we don't have a struct page, so a pfn makes sense, but the kernel virtual address being returned was a gargantuan layering problem. > The legacy vfs/mm code didn't have this layering problem either. Even > filemap_fault() that dax_fault() is modeled after doesn't call any > bdev methods directly, when it needs something it asks the filesystem > with a ->readpage(). The precedence is that you ask the filesystem > for what you need. Look at the get_bdev() thing you've concluded you > need. It _almost_ makes my point. I just happen to be of the opinion > that you don't actually want or need the bdev, you want the pfn/kaddr > so you can flush or map or memcpy(). You want the pfn. The device driver doesn't have enough information to give you a (coherent with userspace) kaddr. That's what (some future arch-specific implementation of) dax_map_pfn() is for. That's why it takes 'index' as a parameter, so you can calculate where it'll be mapped in userspace, and determine an appropriate kernel virtual address to use for it.
On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox <willy@linux.intel.com> wrote: > On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote: >> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.williams@intel.com> wrote: >> >> The filesystem I'm concerned with is AXFS >> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> >> Which I've been planning on trying to merge again due to a recent >> >> resurgence of interest. The device model for AXFS is... weird. It >> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> >> block, and unmanaged physical memory. It's a terribly useful model >> >> for embedded. Anyway AXFS is readonly so hacking in a read only >> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> >> enough. But... it would be cool if similar small embedded focused RW >> >> filesystems were enabled. >> > >> > Are those also out of tree? >> >> Of course. Merging embedded filesystems is little merging regular >> filesystems except 98% of you reviewers don't want it merged. > > You should at least be able to get it into staging these days. I mean, > look at some of the junk that's in staging ... and I don't think AXFS was > nearly as bad. Thanks....? ;) >> IMO you're making DAX more complex by overly coupling to the bdev and >> I think it could bite you later. I submit this rework of the radix >> tree and confusion about where to get the real bdev as evidence. I'm >> guessing that it won't be the last time. It's unnecessary to couple >> it like this, and in fact is not how the vfs has been layered in the >> past. > > Huh? The rework to use the radix tree for PFNs was done with one eye > firmly on your usage case. Just because I had to thread the get_block > interface through it for the moment doesn't mean that I didn't have > the "how do we get rid of get_block entirely" question on my mind. Oh yeah. I think we're on the same page. But I'm not sure Dan is. I get the need to phase this in too. > Using get_block seemed like the right idea three years ago. I didn't > know just how fundamentally ext4 and XFS disagree on how it should be > used. Sure. I can see that. >> To look at the the downside consider dax_fault(). Its called on a >> fault to a user memory map, uses the filesystems get_block() to lookup >> a sector so you can ask a block device to convert it to an address on >> a DIMM. Come on, that's awkward. Everything around dax_fault() is >> dripping with memory semantic interfaces, the dax_fault() call are >> fundamentally about memory, the pmem calls are memory, the hardware is >> memory, and yet it directly calls bdev_direct_access(). It's out of >> place. > > What was out of place was the old 'get_xip_mem' in address_space > operations. Returning a kernel virtual address and a PFN from a > filesystem operation? That looks awful. Yes. Yes it does! But at least my big hack was just one line. ;) Nobody really even seemed to notice at the time. > All the other operations deal > in struct pages, file offsets and occasionally sectors. Of course, we > don't have a struct page, so a pfn makes sense, but the kernel virtual > address being returned was a gargantuan layering problem. Well yes, but it was an expedient hack. >> The legacy vfs/mm code didn't have this layering problem either. Even >> filemap_fault() that dax_fault() is modeled after doesn't call any >> bdev methods directly, when it needs something it asks the filesystem >> with a ->readpage(). The precedence is that you ask the filesystem >> for what you need. Look at the get_bdev() thing you've concluded you >> need. It _almost_ makes my point. I just happen to be of the opinion >> that you don't actually want or need the bdev, you want the pfn/kaddr >> so you can flush or map or memcpy(). > > You want the pfn. The device driver doesn't have enough information to > give you a (coherent with userspace) kaddr. That's what (some future > arch-specific implementation of) dax_map_pfn() is for. That's why it > takes 'index' as a parameter, so you can calculate where it'll be mapped > in userspace, and determine an appropriate kernel virtual address to > use for it. Oh.... I think I'm just beginning to catch your vision for dax_map_pfn(). I still don't get why we can't just do semi-arch specific flushing instead of the alignment thing. But that just might be epic ignorance on my part. Either way flush or magic alignments dax_(un)map_pfn() would handle it, right?
On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > > >> [..] > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > >> > have other more pressing things to fix remains... > > >> > > >> We don't need pfns in the radix to support XFS RT configurations. > > >> Just call get_blocks() again and use the sector, or am I missing > > >> something? > > > > > > You are correct. But if you decide to pay the cost of additional > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > else. So my understanding was that the whole point of games with radix tree > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > durability guarantees for DAX-unaware applications. A DAX-aware > > application is going to skip fsync, and the get_blocks() cost, to do > > cache management itself. > > > > Willy pointed out some other potential benefits, assuming a suitable > > replacement for the protections afforded by the block-device > > percpu_ref counter can be found. However, optimizing for the > > DAX-unaware-application case seems the wrong motivation to me. > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > performance. It's that we don't have any idea what get_block() function to > call. > > The fault handler calls all come from the filesystem directly, so they can > easily give us an appropriate get_block() function pointer. But the > dax_writeback_mapping_range() calls come from the generic code in > mm/filemap.c, and don't know what get_block() to pass in. > > During one iteration I had the calls to dax_writeback_mapping_range() > happening in the filesystem fsync code so that it could pass in get_block(), > but Dave Chinner pointed out that this misses other paths in the filesystem > that need to have things flushed via a call to filemap_write_and_wait_range(). Let's clear this up a bit: The problem with using ->fsync() method is that it doesn't get called for sync(2). We could use ->sync_fs() to flush caches in case of sync(2) (that's what's happening for normal storage) but the problem with PMEM is that "flush all cached data" operation effectively means iterate through all modified pages and we didn't want to implement this for DAX fsync code. So we have decided to do cache flushing for DAX at a different point - mark inodes which may have writes cached as dirty and use writeback code for the cache flushing. But looking at it now, we have actually chosen a wrong place to do the flushing in the writeback path - note that sync(2) writes data via __writeback_single_inode() -> do_writepages() and thus doesn't even get to filemap_write_and_wait(). So revisiting the decision I see two options: 1) Move the DAX flushing code from filemap_write_and_wait() into ->writepages() fs callback. There the filesystem can provide all the information it needs including bdev, get_block callback, or whatever. 2) Back out even further and implement own tracking and iteration of inodes to write. So far I still think 2) is not worth the complexity (although it would bring DAX code closer to how things behave with standard storage) so I would go for 1). Honza
On Tue 02-02-16 13:46:43, Matthew Wilcox wrote: > On Tue, Feb 02, 2016 at 09:46:21AM -0800, Dan Williams wrote: > > What a about a super_operation? That seems the right level, given > > we're currently doing: > > > > inode->i_sb->s_bdev > > > > ...it does not seem terrible to instead do: > > > > inode->i_sb->s_op->get_block() > > The point is that filesystems have lots of different get_block operations, > and the right one to use depends not just on the inode, but also upon > what VFS function is being called, and in some filesystems the phase > of the moon, or the file open flags (so even inode->i_ops->get_block is > wrong; file->f_ops->get_block would be better, but of course we've lost > that by the point we're doing writeback). See what I wrote to Ross. I think this particular issue needs to be solved by moving the flushing to ->writepages() callback. > I now realise that basing DAX around get_block & buffer_heads was a mistake. > I think the Right Solution (not for 4.5) is to ask filesystems to populate > the radix tree. A flow somewhat like this: > > 1. VFS or VM calls filesystem (eg ->fault()) > 2. Filesystem calls DAX (eg dax_fault()) > 3. DAX looks in radix tree, finds no information. > 4. DAX calls (NEW!) mapping->a_ops->populate_pfns > 5. Filesystem looks up its internal data structure (eg extent tree) and > calls dax_create_pfns() (see giant patch from yesterday, only instead of > passing a get_block_t, the filesystem has already filled in a bh which > describes the entire extent that this access happens to land in). > 6. DAX continues to take care of calling bdev_direct_access() from > dax_create_pfns(). So I don't think that ->populate_pfns() is the right interface because it doesn't really tell the filesystem what you want to do. It is essentially like get_blocks() callback only you additionaly ask the fs to fill in the mapping information into the radix tree. So it has the same problems as get_blocks() callback in inode_operations (or superblock_operations, aops, or anywhere else). History has taught us (there was get_blocks() callback in inode operations in ancient times ;) that fs really needs to know wider context to decide how exactly to fulfill the request. I don't see anything obviously wrong with using radix tree as a primary source of mapping information for DAX (after all we do that for page cache as well where the mapping information is attached to pages in the radix tree). But this seems independent of the get_blocks() vs something else discussion. And if your problem is with vaguely defined meaning of buffer_head flags returned from get_blocks() callback, using the iomap interface (which XFS currently uses for pNFS) would solve that. Honza
On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: <> > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > performance. It's that we don't have any idea what get_block() function to > > call. > > > > The fault handler calls all come from the filesystem directly, so they can > > easily give us an appropriate get_block() function pointer. But the > > dax_writeback_mapping_range() calls come from the generic code in > > mm/filemap.c, and don't know what get_block() to pass in. > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > happening in the filesystem fsync code so that it could pass in get_block(), > > but Dave Chinner pointed out that this misses other paths in the filesystem > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > Let's clear this up a bit: The problem with using ->fsync() method is that > it doesn't get called for sync(2). Ugh, yep. We need to correct this. > We could use ->sync_fs() to flush caches > in case of sync(2) (that's what's happening for normal storage) but the > problem with PMEM is that "flush all cached data" operation effectively > means iterate through all modified pages and we didn't want to implement > this for DAX fsync code. I'm not sure what you mean by this - this is exactly what we implemented for the DAX fsync code? We iterate through all dirty pages, as recorded in the radix tree, and flush them to media. It seems like we could and should do the same thing for sync(2) and syncfs()? > So we have decided to do cache flushing for DAX at a different point - mark > inodes which may have writes cached as dirty and use writeback code for the > cache flushing. But looking at it now, we have actually chosen a wrong > place to do the flushing in the writeback path - note that sync(2) writes > data via __writeback_single_inode() -> do_writepages() and thus doesn't > even get to filemap_write_and_wait(). True - but I think we have to basically add another case for sync() regardless of what we do, since AFAICS the fsync() and sync() paths never intersect. So, if we call into the DAX code at the filesystem level we'll need to do it in both ->fsync and ->sync_fs/->writepages, and if we do it in common MM code we'll need to do it in filemap_write_and_wait_range() and do_writepages() or similar. Here is the comment from Dave Chinner that had me move to having the calls to dax_writeback_mapping_range() into the generic mm code: > Lastly, this flushing really needs to be inside > filemap_write_and_wait_range(), because we call the writeback code > from many more places than just fsync to ensure ordering of various > operations such that files are in known state before proceeding > (e.g. hole punch). https://lkml.org/lkml/2015/11/16/847 Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for hole punch, truncate, and block relocation (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). I think having DAX special case code sprinkled over all these call sites seems like an indication that we've made a bad choice, and that we should centralize in the mm layer with filemap_write_and_wait_range() and do_writepages(). OTOH, I think that it might be true that we don't actually need to cover all these non-fsync/sync cases. In the page cache case when we have dirty data in the page cache, that data will be actively lost if we evict a dirty page cache page without flushing it to media first. For DAX, though, the data will remain consistent with the physical address to which it was written regardless of whether it's in the processor cache or not - really the only reason I see to flush is in response to a fsync/msync/sync/syncfs so that our data is durable on media in case of a power loss. The case where we could throw dirty data out of the page cache and essentially lose writes simply doesn't exist. Does this sound correct, or am I missing something? > So revisiting the decision I see two options: > > 1) Move the DAX flushing code from filemap_write_and_wait() into > ->writepages() fs callback. There the filesystem can provide all the > information it needs including bdev, get_block callback, or whatever. This seems fine as long as we add it to ->fsync as well since ->writepages is never called in that path, and as long as we are okay with skipping DAX writebacks on hole punch, truncate, and block relocation. > 2) Back out even further and implement own tracking and iteration of inodes > to write. > > So far I still think 2) is not worth the complexity (although it would > bring DAX code closer to how things behave with standard storage) so I > would go for 1).
On Wed 03-02-16 13:13:28, Ross Zwisler wrote: > On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > <> > > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > > performance. It's that we don't have any idea what get_block() function to > > > call. > > > > > > The fault handler calls all come from the filesystem directly, so they can > > > easily give us an appropriate get_block() function pointer. But the > > > dax_writeback_mapping_range() calls come from the generic code in > > > mm/filemap.c, and don't know what get_block() to pass in. > > > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > > happening in the filesystem fsync code so that it could pass in get_block(), > > > but Dave Chinner pointed out that this misses other paths in the filesystem > > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > > > Let's clear this up a bit: The problem with using ->fsync() method is that > > it doesn't get called for sync(2). > > Ugh, yep. We need to correct this. > > > We could use ->sync_fs() to flush caches > > in case of sync(2) (that's what's happening for normal storage) but the > > problem with PMEM is that "flush all cached data" operation effectively > > means iterate through all modified pages and we didn't want to implement > > this for DAX fsync code. > > I'm not sure what you mean by this - this is exactly what we implemented for > the DAX fsync code? We iterate through all dirty pages, as recorded in the > radix tree, and flush them to media. I by "iterate through all modified pages" I mean "iterate through all inodes with modified pages and for each inode iterate through all modified pages". The "iterate through all inodes" part is actually the one which adds to complexity - either you resort to iterating over all inodes attached to sb->s_inodes_list which is unnecessarily slow (there can be milions of inodes there and you may need to flush only a couple of them) or you have to somehow track which inodes need flushing and so you have another inode list to manage. By marking inodes that need flushing dirty and treating indexes to flush as dirty pages, we can use all the writeback & dirty inode tracking machinery to track and iterate through dirty inodes for us. > It seems like we could and should do the same thing for sync(2) and syncfs()? > > > So we have decided to do cache flushing for DAX at a different point - mark > > inodes which may have writes cached as dirty and use writeback code for the > > cache flushing. But looking at it now, we have actually chosen a wrong > > place to do the flushing in the writeback path - note that sync(2) writes > > data via __writeback_single_inode() -> do_writepages() and thus doesn't > > even get to filemap_write_and_wait(). > > True - but I think we have to basically add another case for sync() > regardless of what we do, since AFAICS the fsync() and sync() paths never > intersect. So, if we call into the DAX code at the filesystem level > we'll need to do it in both ->fsync and ->sync_fs/->writepages, and if we > do it in common MM code we'll need to do it in > filemap_write_and_wait_range() and do_writepages() or similar. Both fsync(2) and sync(2) paths end up in do_writepages() and consequently ->writepages() callback. That's why I suggested moving calls to DAX code into ->writepages(). The only trouble is that sometimes we have a performance optimization checks for (mapping->nrpages > 0) which get in the way. We need to update those to: (mapping->nrpages > 0 || (IS_DAX(inode) && mapping->nrexceptional > 0)) Probably we should introduce a helper function for this check so that people adding new checks won't forget about DAX. > Here is the comment from Dave Chinner that had me move to having the calls to > dax_writeback_mapping_range() into the generic mm code: > > > Lastly, this flushing really needs to be inside > > filemap_write_and_wait_range(), because we call the writeback code > > from many more places than just fsync to ensure ordering of various > > operations such that files are in known state before proceeding > > (e.g. hole punch). > > https://lkml.org/lkml/2015/11/16/847 > > Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for > hole punch, truncate, and block relocation (xfs_shift_file_space() && > ext4_collapse_range()/ext4_insert_range()). > > I think having DAX special case code sprinkled over all these call sites seems > like an indication that we've made a bad choice, and that we should centralize > in the mm layer with filemap_write_and_wait_range() and do_writepages(). > > OTOH, I think that it might be true that we don't actually need to cover all > these non-fsync/sync cases. In the page cache case when we have dirty data in > the page cache, that data will be actively lost if we evict a dirty page cache > page without flushing it to media first. For DAX, though, the data will > remain consistent with the physical address to which it was written regardless > of whether it's in the processor cache or not - really the only reason I see > to flush is in response to a fsync/msync/sync/syncfs so that our data is > durable on media in case of a power loss. The case where we could throw dirty > data out of the page cache and essentially lose writes simply doesn't exist. > > Does this sound correct, or am I missing something? Yes, you are right. filemap_write_and_wait_range() actually doesn't guarantee data durability. That function only means all dirty data has been sent to storage and the storage has acknowledged them. This is noop for PMEM. So we are perfectly fine ignoring calls to filemap_write_and_wait_range(). What guarantees data durability are only ->fsync() and ->sync_fs() calls. But some code could get upset by seeing that filemap_write_and_wait_range() didn't actually get rid of dirty pages (in some special cases like inode eviction or similar). That's why I'd choose one of the two options for consistency: 1) Treat inode indexes to flush as close to dirty pages as we can - this means inode is dirty with all the tracking associated with it, radix tree entries have dirty tag, we get rid of these in ->writepages(). We are close to this currently. 2) Completely avoid the dirty tracking and writeback code and reimplement everything in DAX code. Because some hybrid between these is IMHO bound to provoke weird (and very hard to find) bugs. > > So revisiting the decision I see two options: > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > ->writepages() fs callback. There the filesystem can provide all the > > information it needs including bdev, get_block callback, or whatever. > > This seems fine as long as we add it to ->fsync as well since ->writepages is > never called in that path, and as long as we are okay with skipping DAX > writebacks on hole punch, truncate, and block relocation. Look at ext4_sync_file() -> filemap_write_and_wait_range() -> __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 checks which would need to be changed. > > 2) Back out even further and implement own tracking and iteration of inodes > > to write. > > > > So far I still think 2) is not worth the complexity (although it would > > bring DAX code closer to how things behave with standard storage) so I > > would go for 1). Honza
On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > > > >> [..] > > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > > >> > have other more pressing things to fix remains... > > > >> > > > >> We don't need pfns in the radix to support XFS RT configurations. > > > >> Just call get_blocks() again and use the sector, or am I missing > > > >> something? > > > > > > > > You are correct. But if you decide to pay the cost of additional > > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > > else. So my understanding was that the whole point of games with radix tree > > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > > durability guarantees for DAX-unaware applications. A DAX-aware > > > application is going to skip fsync, and the get_blocks() cost, to do > > > cache management itself. > > > > > > Willy pointed out some other potential benefits, assuming a suitable > > > replacement for the protections afforded by the block-device > > > percpu_ref counter can be found. However, optimizing for the > > > DAX-unaware-application case seems the wrong motivation to me. > > > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > performance. It's that we don't have any idea what get_block() function to > > call. > > > > The fault handler calls all come from the filesystem directly, so they can > > easily give us an appropriate get_block() function pointer. But the > > dax_writeback_mapping_range() calls come from the generic code in > > mm/filemap.c, and don't know what get_block() to pass in. > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > happening in the filesystem fsync code so that it could pass in get_block(), > > but Dave Chinner pointed out that this misses other paths in the filesystem > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > Let's clear this up a bit: The problem with using ->fsync() method is that > it doesn't get called for sync(2). We could use ->sync_fs() to flush caches > in case of sync(2) (that's what's happening for normal storage) but the > problem with PMEM is that "flush all cached data" operation effectively > means iterate through all modified pages and we didn't want to implement > this for DAX fsync code. > > So we have decided to do cache flushing for DAX at a different point - mark > inodes which may have writes cached as dirty and use writeback code for the > cache flushing. But looking at it now, we have actually chosen a wrong > place to do the flushing in the writeback path - note that sync(2) writes > data via __writeback_single_inode() -> do_writepages() and thus doesn't > even get to filemap_write_and_wait(). > > So revisiting the decision I see two options: > > 1) Move the DAX flushing code from filemap_write_and_wait() into > ->writepages() fs callback. There the filesystem can provide all the > information it needs including bdev, get_block callback, or whatever. > > 2) Back out even further and implement own tracking and iteration of inodes > to write. > > So far I still think 2) is not worth the complexity (although it would > bring DAX code closer to how things behave with standard storage) so I > would go for 1). Jan, just to clarify, are you proposing this change for v4.5 in the remaining RCs as an alternative to the get_bdev() patch? https://lkml.org/lkml/2016/2/2/941 Or can we move forward with get_bdev(), and try and figure out this new way of calling fsync/msync for v4.6? My main concern here is that changing how the DAX sync code gets called will affect all three filesystems as well as MM, and that it might be too much for RC inclusion...
On Thu 04-02-16 12:56:19, Ross Zwisler wrote: > On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > > > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@suse.cz> wrote: > > > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@suse.cz> wrote: > > > > >> [..] > > > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > > > >> > have other more pressing things to fix remains... > > > > >> > > > > >> We don't need pfns in the radix to support XFS RT configurations. > > > > >> Just call get_blocks() again and use the sector, or am I missing > > > > >> something? > > > > > > > > > > You are correct. But if you decide to pay the cost of additional > > > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > > > else. So my understanding was that the whole point of games with radix tree > > > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > > > durability guarantees for DAX-unaware applications. A DAX-aware > > > > application is going to skip fsync, and the get_blocks() cost, to do > > > > cache management itself. > > > > > > > > Willy pointed out some other potential benefits, assuming a suitable > > > > replacement for the protections afforded by the block-device > > > > percpu_ref counter can be found. However, optimizing for the > > > > DAX-unaware-application case seems the wrong motivation to me. > > > > > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > > performance. It's that we don't have any idea what get_block() function to > > > call. > > > > > > The fault handler calls all come from the filesystem directly, so they can > > > easily give us an appropriate get_block() function pointer. But the > > > dax_writeback_mapping_range() calls come from the generic code in > > > mm/filemap.c, and don't know what get_block() to pass in. > > > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > > happening in the filesystem fsync code so that it could pass in get_block(), > > > but Dave Chinner pointed out that this misses other paths in the filesystem > > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > > > Let's clear this up a bit: The problem with using ->fsync() method is that > > it doesn't get called for sync(2). We could use ->sync_fs() to flush caches > > in case of sync(2) (that's what's happening for normal storage) but the > > problem with PMEM is that "flush all cached data" operation effectively > > means iterate through all modified pages and we didn't want to implement > > this for DAX fsync code. > > > > So we have decided to do cache flushing for DAX at a different point - mark > > inodes which may have writes cached as dirty and use writeback code for the > > cache flushing. But looking at it now, we have actually chosen a wrong > > place to do the flushing in the writeback path - note that sync(2) writes > > data via __writeback_single_inode() -> do_writepages() and thus doesn't > > even get to filemap_write_and_wait(). > > > > So revisiting the decision I see two options: > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > ->writepages() fs callback. There the filesystem can provide all the > > information it needs including bdev, get_block callback, or whatever. > > > > 2) Back out even further and implement own tracking and iteration of inodes > > to write. > > > > So far I still think 2) is not worth the complexity (although it would > > bring DAX code closer to how things behave with standard storage) so I > > would go for 1). > > Jan, just to clarify, are you proposing this change for v4.5 in the remaining > RCs as an alternative to the get_bdev() patch? > > https://lkml.org/lkml/2016/2/2/941 Yes, because I don't think anything like ->get_bdev() is needed at all. Look: dax_do_io(), __dax_fault(), __dax_pmd_fault(), dax_zero_page_range() don't really need bdev - we have agreed that get_block() fills that in just fine. dax_clear_blocks() has IMO just the wrong signature - it should take bdev and not inode as an argument. Because combination inode + bdev sector doesn't really make much sense. dax_writeback_mapping_range() is the only remaining offender and it can easily take bdev as an argument when called from ->writepages(). > Or can we move forward with get_bdev(), and try and figure out this new way of > calling fsync/msync for v4.6? My main concern here is that changing how the > DAX sync code gets called will affect all three filesystems as well as MM, and > that it might be too much for RC inclusion... I think changes aren't very intrusive so we can feed them in during RC phase and frankly, you have to move to using ->writepages() anyway to make sync(2) work reliably. Honza
On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > On Thu 04-02-16 12:56:19, Ross Zwisler wrote: > > On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: <> > > > Let's clear this up a bit: The problem with using ->fsync() method is that > > > it doesn't get called for sync(2). We could use ->sync_fs() to flush caches > > > in case of sync(2) (that's what's happening for normal storage) but the > > > problem with PMEM is that "flush all cached data" operation effectively > > > means iterate through all modified pages and we didn't want to implement > > > this for DAX fsync code. > > > > > > So we have decided to do cache flushing for DAX at a different point - mark > > > inodes which may have writes cached as dirty and use writeback code for the > > > cache flushing. But looking at it now, we have actually chosen a wrong > > > place to do the flushing in the writeback path - note that sync(2) writes > > > data via __writeback_single_inode() -> do_writepages() and thus doesn't > > > even get to filemap_write_and_wait(). > > > > > > So revisiting the decision I see two options: > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > ->writepages() fs callback. There the filesystem can provide all the > > > information it needs including bdev, get_block callback, or whatever. > > > > > > 2) Back out even further and implement own tracking and iteration of inodes > > > to write. > > > > > > So far I still think 2) is not worth the complexity (although it would > > > bring DAX code closer to how things behave with standard storage) so I > > > would go for 1). > > > > Jan, just to clarify, are you proposing this change for v4.5 in the remaining > > RCs as an alternative to the get_bdev() patch? > > > > https://lkml.org/lkml/2016/2/2/941 > > Yes, because I don't think anything like ->get_bdev() is needed at all. > Look: dax_do_io(), __dax_fault(), __dax_pmd_fault(), dax_zero_page_range() > don't really need bdev - we have agreed that get_block() fills that in just > fine. > > dax_clear_blocks() has IMO just the wrong signature - it should take bdev > and not inode as an argument. Because combination inode + bdev sector > doesn't really make much sense. > > dax_writeback_mapping_range() is the only remaining offender and it can > easily take bdev as an argument when called from ->writepages(). > > > Or can we move forward with get_bdev(), and try and figure out this new way of > > calling fsync/msync for v4.6? My main concern here is that changing how the > > DAX sync code gets called will affect all three filesystems as well as MM, and > > that it might be too much for RC inclusion... > > I think changes aren't very intrusive so we can feed them in during RC > phase and frankly, you have to move to using ->writepages() anyway to make > sync(2) work reliably. Okay, sounds good. I'll send it out once I've got it working & tested.
On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote: <> > Yes, you are right. filemap_write_and_wait_range() actually doesn't > guarantee data durability. That function only means all dirty data has been > sent to storage and the storage has acknowledged them. This is noop for > PMEM. So we are perfectly fine ignoring calls to > filemap_write_and_wait_range(). What guarantees data durability are only > ->fsync() and ->sync_fs() calls. But some code could get upset by seeing > that filemap_write_and_wait_range() didn't actually get rid of dirty pages > (in some special cases like inode eviction or similar). That's why I'd > choose one of the two options for consistency: > > 1) Treat inode indexes to flush as close to dirty pages as we can - this > means inode is dirty with all the tracking associated with it, radix tree > entries have dirty tag, we get rid of these in ->writepages(). We are close > to this currently. I think we're actually pretty far from this, at least for v4.5. The issue is that I don't think we can safely clear radix tree dirty entries during the DAX code that is called via ->writepages(). To do this correctly we would need to also mark the PTE as clean so that when the userspace process next writes to their mmap mapping we would get a new fault to make the page writable. This would allow us to re-dirty the DAX entry in the radix tree. I implemented code to do this in v2 of my set, but ripped it out in v3: https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) The race that compelled this removal is described here: https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html (sorry for all the links) Anyway, for v4.5 I think whatever solution we come up with must be okay with an ever growing list of dirty radix tree entries, as we currently have. Are you aware of a reason why this won't work, or was the cleaning of the radix tree entries just a good goal to have? (And I agree it is a good goal, I just don't know how to do it safely.) > 2) Completely avoid the dirty tracking and writeback code and reimplement > everything in DAX code. > > Because some hybrid between these is IMHO bound to provoke weird (and very > hard to find) bugs. > > > > So revisiting the decision I see two options: > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > ->writepages() fs callback. There the filesystem can provide all the > > > information it needs including bdev, get_block callback, or whatever. > > > > This seems fine as long as we add it to ->fsync as well since ->writepages is > > never called in that path, and as long as we are okay with skipping DAX > > writebacks on hole punch, truncate, and block relocation. > > Look at ext4_sync_file() -> filemap_write_and_wait_range() -> > __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 > checks which would need to be changed. Ah, cool, I missed this path. Thank you for setting me straight.
On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > I think changes aren't very intrusive so we can feed them in during RC > phase and frankly, you have to move to using ->writepages() anyway to make > sync(2) work reliably. I've been looking into this a bit more, and I don't think we actually want to have DAX flushing in response to sync(2) or syncfs(2). Here's some text from the BUGS section of the sync(2) man page: BUGS According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) Based on this I don't believe that it is a requirement that sync and syncfs actually flush the data durably to media before they return - they just need to make sure it has been sent to the device, which is always true for all writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). The fsync(2) man page, on the other hand, *does* require that a call to fsync() will result in the data being durable on the media before the system call returns. From fsync(2): fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. I think we are doing the right thing if we only call dax_writeback_mapping_range() in response to fsync() and msync(), but skip it in response to sync() and syncfs(). Practically this also simplifies things a lot because we don't need to worry about cleaning the DAX inodes. With my naive implementation where I was calling dax_writeback_mapping_range() in ext4_writepages(), we were flushing all DAX pages that had ever been dirtied every 5 seconds in response to a periodic filesystem sync(), which I'm sure is untenable. Please let me know if anyone disagrees with any of the above. Based on this, I'm off to move the calls to dax_writeback_mapping_range() back into the filesystem specific fsync()/msync() paths so that we can provide an appropriate block device.
On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote: > On Wed 03-02-16 13:13:28, Ross Zwisler wrote: > > Here is the comment from Dave Chinner that had me move to having the calls to > > dax_writeback_mapping_range() into the generic mm code: > > > > > Lastly, this flushing really needs to be inside > > > filemap_write_and_wait_range(), because we call the writeback code > > > from many more places than just fsync to ensure ordering of various > > > operations such that files are in known state before proceeding > > > (e.g. hole punch). > > https://lkml.org/lkml/2015/11/16/847 ..... > > > So revisiting the decision I see two options: > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > ->writepages() fs callback. There the filesystem can provide all the > > > information it needs including bdev, get_block callback, or whatever. > > > > This seems fine as long as we add it to ->fsync as well since ->writepages is > > never called in that path, and as long as we are okay with skipping DAX > > writebacks on hole punch, truncate, and block relocation. > > Look at ext4_sync_file() -> filemap_write_and_wait_range() -> > __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 > checks which would need to be changed. Just to be clear: this is pretty much what I was implying was necessary when I said that the DAX flushing needed to be "inside filemap_write_and_wait_range". And that's what I thought Ross was planning on doing after that round discussion. i.e. Ross said: "If the race described above isn't an issue then I agree moving this call out of the filesystems and down into the generic page writeback code is probably the right thing to do." https://lkml.org/lkml/2015/11/17/718 Realistically, what Jan is saying in this thread is exactly what I said we needed to do way back when I first pointed out that fsync was broken and dirty tracking in the mapping radix tree was still needed for fsync to work effectively. Clearly, haven't been following recent developments in this patchset as closely as I should have - I did not think that such micro-management was going to be necessary after the discussion taht was had back in November. Cheers, Dave.
On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > > I think changes aren't very intrusive so we can feed them in during RC > > phase and frankly, you have to move to using ->writepages() anyway to make > > sync(2) work reliably. > > I've been looking into this a bit more, and I don't think we actually want to > have DAX flushing in response to sync(2) or syncfs(2). Here's some text from > the BUGS section of the sync(2) man page: > > BUGS > According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but may return before the actual writing > is done. However, since version 1.3.20 Linux does actually wait. > (This still does not guarantee data integrity: modern disks have large > caches.) > > Based on this I don't believe that it is a requirement that sync and syncfs > actually flush the data durably to media before they return - they just need > to make sure it has been sent to the device, which is always true for all > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). That's an assumption we've already pointed out as being platform dependent, not to mention also being undesirable from a performance point of view (writes are 10x slower into pmem than into the page cache using the same physical memory!). Further, the ordering constraints of modern filesystems mean that they guarantee durability of data written back when sync() is run. i.e. ext4, XFS, btrfs, etc all ensure that sync guarantees data integrity is maintained across all the data and metadata written back during sync(). e.g. for XFS we do file size update transactions at IO completion. sync() triggers writeback of data, then runs a transaction that modifies the file size so that the data is valid on disk. We absolutely need to ensure that this transaction is durable before sync() returns, otherwise we lose that data if a failure occurs immediately after sync() returns because the size update is not on disk. Users are right to complain when data written before a sync() call is made does not accessible after a crash/reboot because we failed to make it durable. That's why ->sync_fs(wait) is called at the end of the sync() implementation - it enables filesystems to ensure all data and metadata written during the sync processing is on durable storage. IOWs, we can't language-lawyer or weasel-word our way out of providing durability guarantees for sync(). Cheers, Dave.
On Sun, Feb 07, 2016 at 10:15:12AM +1100, Dave Chinner wrote: > On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote: > > On Wed 03-02-16 13:13:28, Ross Zwisler wrote: > > > Here is the comment from Dave Chinner that had me move to having the calls to > > > dax_writeback_mapping_range() into the generic mm code: > > > > > > > Lastly, this flushing really needs to be inside > > > > filemap_write_and_wait_range(), because we call the writeback code > > > > from many more places than just fsync to ensure ordering of various > > > > operations such that files are in known state before proceeding > > > > (e.g. hole punch). > > > https://lkml.org/lkml/2015/11/16/847 > ..... > > > > So revisiting the decision I see two options: > > > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > > ->writepages() fs callback. There the filesystem can provide all the > > > > information it needs including bdev, get_block callback, or whatever. > > > > > > This seems fine as long as we add it to ->fsync as well since ->writepages is > > > never called in that path, and as long as we are okay with skipping DAX > > > writebacks on hole punch, truncate, and block relocation. > > > > Look at ext4_sync_file() -> filemap_write_and_wait_range() -> > > __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 > > checks which would need to be changed. > > Just to be clear: this is pretty much what I was implying was > necessary when I said that the DAX flushing needed to be "inside > filemap_write_and_wait_range". And that's what I thought Ross was > planning on doing after that round discussion. i.e. Ross said: > > "If the race described above isn't an issue then I agree moving this > call out of the filesystems and down into the generic page writeback > code is probably the right thing to do." > > https://lkml.org/lkml/2015/11/17/718 > > Realistically, what Jan is saying in this thread is exactly what I > said we needed to do way back when I first pointed out that fsync > was broken and dirty tracking in the mapping radix tree was still > needed for fsync to work effectively. Here, let me try and quickly summarize what is going on. 1) The DAX fsync set was merged into v4.5-rc1, it does use the radix tree for tracking dirty PTE and PMD pages, and we do currently call into the DAX sync code via filemap_write_and_wait_range() as you initially suggested. 2) During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the flushing code we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. In short, somehow we need to get dax_writeback_mapping_range() a valid bdev. Right now it is called via filemap_write_and_wait_range(), which can't provide either the bdev nor a get_block() function pointer. So, our options seem to be: a) Move the calls to dax_writeback_mapping_range() into the filesystems (what Jan is suggesting, i.e. ->writepages()) b) Keep the calls to dax_writeback_mapping_range() in the mm code, and provide a generic way to ask a filesystem for an inode's bdev. I did a version of this using a superblock operation here: https://lkml.org/lkml/2016/2/2/941 3) During the review and discussion for the above problems, Jan noticed that the flushing code wasn't being called for sync() and syncfs(). Clearly from your other response (https://lkml.org/lkml/2016/2/6/168) you think this is incorrect. Regardless, the above issue remains - dax_writeback_mapping_range() needs a bdev. Do we move the calls into the filesystem so the fs can provide a bdev, or do we we create a generic method for DAX to ask the fs for the correct bdev for an inode?
On Sun, Feb 07, 2016 at 10:40:53AM +1100, Dave Chinner wrote: > On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > > > I think changes aren't very intrusive so we can feed them in during RC > > > phase and frankly, you have to move to using ->writepages() anyway to make > > > sync(2) work reliably. > > > > I've been looking into this a bit more, and I don't think we actually want to > > have DAX flushing in response to sync(2) or syncfs(2). Here's some text from > > the BUGS section of the sync(2) man page: > > > > BUGS > > According to the standard specification (e.g., POSIX.1-2001), sync() > > schedules the writes, but may return before the actual writing > > is done. However, since version 1.3.20 Linux does actually wait. > > (This still does not guarantee data integrity: modern disks have large > > caches.) > > > > Based on this I don't believe that it is a requirement that sync and syncfs > > actually flush the data durably to media before they return - they just need > > to make sure it has been sent to the device, which is always true for all > > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). > > That's an assumption we've already pointed out as being platform > dependent, not to mention also being undesirable from a performance > point of view (writes are 10x slower into pmem than into the page > cache using the same physical memory!). > > Further, the ordering constraints of modern filesystems mean that > they guarantee durability of data written back when sync() is run. > i.e. ext4, XFS, btrfs, etc all ensure that sync guarantees data > integrity is maintained across all the data and metadata written > back during sync(). > > e.g. for XFS we do file size update transactions at IO completion. > sync() triggers writeback of data, then runs a transaction that > modifies the file size so that the data is valid on disk. We > absolutely need to ensure that this transaction is durable before > sync() returns, otherwise we lose that data if a failure occurs > immediately after sync() returns because the size update is not on > disk. > > Users are right to complain when data written before a sync() call > is made does not accessible after a crash/reboot because we failed > to make it durable. That's why ->sync_fs(wait) is called at the end > of the sync() implementation - it enables filesystems to ensure all > data and metadata written during the sync processing is on durable > storage. > > IOWs, we can't language-lawyer or weasel-word our way out of > providing durability guarantees for sync(). To be clear, we are only talking about user data that was mmaped. All I/Os initiated by the filesystem for metadata, and all user issued I/Os will be durable on media before the I/O is completed. Nothing we do or don't do for fsync, msync, sync or syncfs() will break filesystem metadata guarantees. I think the question then is: "Do users currently rely on data written to an mmap to be durable on media after a sync() or syncfs(), in spite of the BUGS section quoted above?" If the answer for this is "yes", then I agree that we need to enhance the current DAX fsync code to cover the sync use case. However, as stated previously I don't think that it is as easy as just moving the call to dax_writeback_mapping_range() to the sync() call path. On my system sync() is called every 5 seconds, and flushing every page of every DAX mmap that has ever been dirtied every 5 seconds is unreasonable. If we need to support the sync() use case with our DAX mmap flushing code, I think the only way to do that is to clear out radix entries as we flush them, so that the sync calls that happen every 5 seconds are only flushing new writes. I think we're actually pretty far from this, at least for v4.5. The issue is that I don't think we can safely clear radix tree dirty entries during the DAX code that is called via ->writepages(). To do this correctly we would need to also mark the PTE as clean so that when the userspace process next writes to their mmap mapping we would get a new fault to make the page writable. This would allow us to re-dirty the DAX entry in the radix tree. I implemented code to do this in v2 of my set, but ripped it out in v3: https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) The race that compelled this removal is described here: https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html If this is really where we need to be, that's fine, we'll have to figure out some way to defeat the race and get this code into v4.6.
On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but may return before the actual writing > is done. However, since version 1.3.20 Linux does actually wait. > (This still does not guarantee data integrity: modern disks have large > caches.) > > Based on this I don't believe that it is a requirement that sync and syncfs > actually flush the data durably to media before they return - they just need > to make sure it has been sent to the device, which is always true for all > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). For Linux there is a requirement to not return before the data is on disk, and our users rely on it. The man page is rather confusing, and I'll see if I can fix it up.
On Sat 06-02-16 23:43:49, Ross Zwisler wrote: > On Sun, Feb 07, 2016 at 10:40:53AM +1100, Dave Chinner wrote: > > On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > > > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > > > > I think changes aren't very intrusive so we can feed them in during RC > > > > phase and frankly, you have to move to using ->writepages() anyway to make > > > > sync(2) work reliably. > > > > > > I've been looking into this a bit more, and I don't think we actually want to > > > have DAX flushing in response to sync(2) or syncfs(2). Here's some text from > > > the BUGS section of the sync(2) man page: > > > > > > BUGS > > > According to the standard specification (e.g., POSIX.1-2001), sync() > > > schedules the writes, but may return before the actual writing > > > is done. However, since version 1.3.20 Linux does actually wait. > > > (This still does not guarantee data integrity: modern disks have large > > > caches.) > > > > > > Based on this I don't believe that it is a requirement that sync and syncfs > > > actually flush the data durably to media before they return - they just need > > > to make sure it has been sent to the device, which is always true for all > > > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). > > > > That's an assumption we've already pointed out as being platform > > dependent, not to mention also being undesirable from a performance > > point of view (writes are 10x slower into pmem than into the page > > cache using the same physical memory!). > > > > Further, the ordering constraints of modern filesystems mean that > > they guarantee durability of data written back when sync() is run. > > i.e. ext4, XFS, btrfs, etc all ensure that sync guarantees data > > integrity is maintained across all the data and metadata written > > back during sync(). > > > > e.g. for XFS we do file size update transactions at IO completion. > > sync() triggers writeback of data, then runs a transaction that > > modifies the file size so that the data is valid on disk. We > > absolutely need to ensure that this transaction is durable before > > sync() returns, otherwise we lose that data if a failure occurs > > immediately after sync() returns because the size update is not on > > disk. > > > > Users are right to complain when data written before a sync() call > > is made does not accessible after a crash/reboot because we failed > > to make it durable. That's why ->sync_fs(wait) is called at the end > > of the sync() implementation - it enables filesystems to ensure all > > data and metadata written during the sync processing is on durable > > storage. > > > > IOWs, we can't language-lawyer or weasel-word our way out of > > providing durability guarantees for sync(). > > To be clear, we are only talking about user data that was mmaped. All I/Os > initiated by the filesystem for metadata, and all user issued I/Os will be > durable on media before the I/O is completed. Nothing we do or don't do for > fsync, msync, sync or syncfs() will break filesystem metadata guarantees. Yes, but be careful here: So far ext4 (and AFAIK XFS as well) have depended on the fact that blkdev_issue_flush() (or WRITE_FLUSH request) will flush all the volatile caches for the storage. We do such calls / writes from transaction commit code to make sure that all metadata *and data* writes reach permanent storage before we make some metadata changes visible. This is necessary so that we don't expose uninitialized block contents after a power failure (think of making i_size increase / block allocation visible after power failure without data being durable). For PMEM, we ignore blkdev_issue_flush() / WRITE_FLUSH requests on the grounds that writes are either done bypassing caches (the case for metadata IO and IO done via dax_do_io()). Currently we are fine even for mmap because both ext4 & XFS zero out allocated blocks using non-cached writes so even though latest data needn't be on persistent storage we won't expose stale data. But it is a catch that may hit us in future. So from this POV flushing caches from ->writepages() would also make PMEM give more similar guarantees to fs as ordinary block storage. > I think the question then is: "Do users currently rely on data written to an > mmap to be durable on media after a sync() or syncfs(), in spite of the BUGS > section quoted above?" > > If the answer for this is "yes", then I agree that we need to enhance the > current DAX fsync code to cover the sync use case. IMO the answer is yes. > However, as stated previously I don't think that it is as easy as just moving > the call to dax_writeback_mapping_range() to the sync() call path. On my > system sync() is called every 5 seconds, and flushing every page of every DAX > mmap that has ever been dirtied every 5 seconds is unreasonable. It is not sync() that is called but background writeback. But you are correct that ->writepages() will get called for a dirty inode every 5 seconds. > If we need to support the sync() use case with our DAX mmap flushing code, I > think the only way to do that is to clear out radix entries as we flush them, > so that the sync calls that happen every 5 seconds are only flushing new > writes. I agree and I forgot about the issue that we currently don't clear out dirty radix tree entries for DAX. I also agree that clearing of written radix tree entries needs to be implemented before flushing caches from ->writepages() is practical. > I think we're actually pretty far from this, at least for v4.5. The issue is > that I don't think we can safely clear radix tree dirty entries during the DAX > code that is called via ->writepages(). To do this correctly we would need to > also mark the PTE as clean so that when the userspace process next writes to > their mmap mapping we would get a new fault to make the page writable. This > would allow us to re-dirty the DAX entry in the radix tree. Yup, so I agree that clearing of radix tree entries probably is not a 4.5 material. > I implemented code to do this in v2 of my set, but ripped it out in v3: > > https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) > https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) > > The race that compelled this removal is described here: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > If this is really where we need to be, that's fine, we'll have to figure out > some way to defeat the race and get this code into v4.6. I think we need to find a way to safely clear radix tree entries anyway. And I agree that without page lock it is not easy. I see two possibilities here: 1) Protect dax_writeback_mapping_range() with exclusive EXT4_I(inode)->i_mmap_sem and provide new callback called from wp_page_reuse() to mark index dirty in the radix tree (that would take shared i_mmap_sem). This is relatively expensive but it should work. Or 2) We could devote one bit in the radix tree exceptional entry as a bitlock equivalent to page lock. This will likely scale better but it is more intrusive. Honza
On Sun, Feb 07, 2016 at 12:38:14AM -0800, Christoph Hellwig wrote: > On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > > According to the standard specification (e.g., POSIX.1-2001), sync() > > schedules the writes, but may return before the actual writing > > is done. However, since version 1.3.20 Linux does actually wait. > > (This still does not guarantee data integrity: modern disks have large > > caches.) > > > > Based on this I don't believe that it is a requirement that sync and syncfs > > actually flush the data durably to media before they return - they just need > > to make sure it has been sent to the device, which is always true for all > > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). > > For Linux there is a requirement to not return before the data is on disk, > and our users rely on it. The man page is rather confusing, and I'll see > if I can fix it up. Okay, thank you for the clarification. :)
diff --git a/fs/dax.c b/fs/dax.c index 4fd6b0c..e60a5a7 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -32,6 +32,9 @@ #include <linux/pfn_t.h> #include <linux/sizes.h> +#define DAX_BDEV(inode) (S_ISBLK(inode->i_mode) ? I_BDEV(inode) \ + : inode->i_sb->s_bdev) + static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) { struct request_queue *q = bdev->bd_queue; @@ -65,7 +68,7 @@ static void dax_unmap_atomic(struct block_device *bdev, */ int dax_clear_blocks(struct inode *inode, sector_t block, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; + struct block_device *bdev = DAX_BDEV(inode); struct blk_dax_ctl dax = { .sector = block << (inode->i_blkbits - 9), .size = _size, @@ -246,7 +249,7 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, loff_t end = pos + iov_iter_count(iter); memset(&bh, 0, sizeof(bh)); - bh.b_bdev = inode->i_sb->s_bdev; + bh.b_bdev = DAX_BDEV(inode); if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) { struct address_space *mapping = inode->i_mapping; @@ -468,7 +471,7 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, loff_t end) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; + struct block_device *bdev = DAX_BDEV(inode); pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -608,7 +611,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, memset(&bh, 0, sizeof(bh)); block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits); - bh.b_bdev = inode->i_sb->s_bdev; + bh.b_bdev = DAX_BDEV(inode); bh.b_size = PAGE_SIZE; repeat: @@ -827,7 +830,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } memset(&bh, 0, sizeof(bh)); - bh.b_bdev = inode->i_sb->s_bdev; + bh.b_bdev = DAX_BDEV(inode); block = (sector_t)pgoff << (PAGE_SHIFT - blkbits); bh.b_size = PMD_SIZE; @@ -1080,7 +1083,7 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, BUG_ON((offset + length) > PAGE_CACHE_SIZE); memset(&bh, 0, sizeof(bh)); - bh.b_bdev = inode->i_sb->s_bdev; + bh.b_bdev = DAX_BDEV(inode); bh.b_size = PAGE_CACHE_SIZE; err = get_block(inode, index, &bh, 0); if (err < 0)
There are a number of places in dax.c that look up the struct block_device associated with an inode. Previously this was done by just using inode->i_sb->s_bdev. This is correct for inodes that exist within the filesystems supported by DAX (ext2, ext4 & XFS), but when running DAX against raw block devices this value is NULL. This causes NULL pointer dereferences when these block_device pointers are used. Instead, for raw block devices we need to look up the struct block_device using I_BDEV(). This patch fixes all the block_device lookups in dax.c so that they work properly for both filesystems and raw block devices. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/dax.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)