Message ID | 20190326190301.32365-2-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/15] btrfs: create a mount option for dax | expand |
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > This sets S_DAX in inode->i_flags, which can be used with > IS_DAX(). > > The dax option is restricted to non multi-device mounts. > dax interacts with the device directly instead of using bio, so > all bio-hooks which we use for multi-device cannot be performed > here. While regular read/writes could be manipulated with > RAID0/1, mmap() is still an issue. > > Auto-setting free space tree, because dealing with free space > inode (specifically readpages) is a nightmare. > Auto-setting nodatasum because we don't get callback for writing > checksums after mmap()s. Congratulations on getting the bear to dance. But why? To me, the point of btrfs is all the cool stuff it does with built-in checksumming and snapshots and RAID and so on. DAX doesn't let you do any of that, so why would somebody want to use btrfs to manage DAX?
On 12:10 26/03, Matthew Wilcox wrote: > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > This sets S_DAX in inode->i_flags, which can be used with > > IS_DAX(). > > > > The dax option is restricted to non multi-device mounts. > > dax interacts with the device directly instead of using bio, so > > all bio-hooks which we use for multi-device cannot be performed > > here. While regular read/writes could be manipulated with > > RAID0/1, mmap() is still an issue. > > > > Auto-setting free space tree, because dealing with free space > > inode (specifically readpages) is a nightmare. > > Auto-setting nodatasum because we don't get callback for writing > > checksums after mmap()s. > > Congratulations on getting the bear to dance. But why? > Why not ? ;) > To me, the point of btrfs is all the cool stuff it does with built-in > checksumming and snapshots and RAID and so on. DAX doesn't let you do > any of that, so why would somebody want to use btrfs to manage DAX? > There are users who are asking for advantages of dax on btrfs. I have tried to make it work with snapshots in this series. Checksumming should be possible, but would require some more hacks. I am looking into it. multi-device is an issue for mmap() and I don't think we can work around it. I agree there is a price to pay to use dax, but I am sure the users would know about that.
On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote: > On 12:10 26/03, Matthew Wilcox wrote: > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > > This sets S_DAX in inode->i_flags, which can be used with > > > IS_DAX(). > > > > > > The dax option is restricted to non multi-device mounts. > > > dax interacts with the device directly instead of using bio, so > > > all bio-hooks which we use for multi-device cannot be performed > > > here. While regular read/writes could be manipulated with > > > RAID0/1, mmap() is still an issue. > > > > > > Auto-setting free space tree, because dealing with free space > > > inode (specifically readpages) is a nightmare. > > > Auto-setting nodatasum because we don't get callback for writing > > > checksums after mmap()s. > > > > Congratulations on getting the bear to dance. But why? > > Why not ? ;) 18 files changed, 662 insertions(+), 77 deletions(-) I want to know what advantage we're getting for that. > > To me, the point of btrfs is all the cool stuff it does with built-in > > checksumming and snapshots and RAID and so on. DAX doesn't let you do > > any of that, so why would somebody want to use btrfs to manage DAX? > > There are users who are asking for advantages of dax on btrfs. There are also people asking for perpetual motion machines. Do these users understand the tradeoffs? > I have tried to make it work with snapshots in this series. > Checksumming should be possible, but would require some more hacks. I am > looking into it. > multi-device is an issue for mmap() and I don't think we can work around > it. > > I agree there is a price to pay to use dax, but I am sure the users > would know about that. I really doubt it, to be honest.
On 5:00 27/03, Matthew Wilcox wrote: > On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote: > > On 12:10 26/03, Matthew Wilcox wrote: > > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > > > This sets S_DAX in inode->i_flags, which can be used with > > > > IS_DAX(). > > > > > > > > The dax option is restricted to non multi-device mounts. > > > > dax interacts with the device directly instead of using bio, so > > > > all bio-hooks which we use for multi-device cannot be performed > > > > here. While regular read/writes could be manipulated with > > > > RAID0/1, mmap() is still an issue. > > > > > > > > Auto-setting free space tree, because dealing with free space > > > > inode (specifically readpages) is a nightmare. > > > > Auto-setting nodatasum because we don't get callback for writing > > > > checksums after mmap()s. > > > > > > Congratulations on getting the bear to dance. But why? > > > > Why not ? ;) > > 18 files changed, 662 insertions(+), 77 deletions(-) > > I want to know what advantage we're getting for that. Direct device mmap'd files, faster access, lower pagecache usage... all the advantages you would get from a dax device. > > > > To me, the point of btrfs is all the cool stuff it does with built-in > > > checksumming and snapshots and RAID and so on. DAX doesn't let you do > > > any of that, so why would somebody want to use btrfs to manage DAX? > > > > There are users who are asking for advantages of dax on btrfs. > > There are also people asking for perpetual motion machines. Do these > users understand the tradeoffs? The only major feature they would not be able to use is multi-device, which will be relayed at mount time. > > > I have tried to make it work with snapshots in this series. > > Checksumming should be possible, but would require some more hacks. I am > > looking into it. > > multi-device is an issue for mmap() and I don't think we can work around > > it. > > > > I agree there is a price to pay to use dax, but I am sure the users > > would know about that. > > I really doubt it, to be honest. Care to elaborate?
On Tue, Mar 26, 2019 at 12:10:01PM -0700, Matthew Wilcox wrote: > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > The dax option is restricted to non multi-device mounts. > > dax interacts with the device directly instead of using bio, so > > all bio-hooks which we use for multi-device cannot be performed > > here. While regular read/writes could be manipulated with > > RAID0/1, mmap() is still an issue. > > > > Auto-setting free space tree, because dealing with free space > > inode (specifically readpages) is a nightmare. > > Auto-setting nodatasum because we don't get callback for writing > > checksums after mmap()s. > > Congratulations on getting the bear to dance. But why? > > To me, the point of btrfs is all the cool stuff it does with built-in > checksumming and snapshots and RAID and so on. DAX doesn't let you do > any of that, so why would somebody want to use btrfs to manage DAX? If I read this correctly (I merely glanced at it), this patchset _does_ provide the full snapshot functionality. This is something other filesystems don't allow: ext4 has no CoW at all, and IIRC on XFS reflinks and DAX are mutually exclusive. Obviously, the usual btrfs way of CoWing every write would remove all (write) upsides of DAX, thus NOCOW (ie, CoW once) is the way to go: a page fault should happen no more than once per page per snapshot. On the other hand, checksumming seems useless to me. Data corruption can happen either in transit or at rest. For at rest, disks already have their own checksums -- and [NV]DIMMs have ECC. On the other hand, the majority of the time when someone seeks help on the btrfs mailing list, it turns out to be a matter of bad RAM, bad motherboard or bad cabling. This doesn't apply to pmem. The usual path is: CPU |<--->memory | SATA controller | (SATA cable) | disk The data goes to memory (very unlikely to to remain in the cache before getting checksummed), then has to travel all the way down. On the other hand, the path on pmem is: CPU |---->memory So the data written by userspace goes to memory... and that's it. As for multi-device, at least single block groups would be very nice (to have a filesystem than spans regions) and easyish to implement, while RAID0 might spoil hugepage fun but may still be straightforward. Meow!
On 5:00 27/03, Matthew Wilcox wrote: > On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote: > > On 12:10 26/03, Matthew Wilcox wrote: > > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > > > This sets S_DAX in inode->i_flags, which can be used with > > > > IS_DAX(). > > > > > > > > The dax option is restricted to non multi-device mounts. > > > > dax interacts with the device directly instead of using bio, so > > > > all bio-hooks which we use for multi-device cannot be performed > > > > here. While regular read/writes could be manipulated with > > > > RAID0/1, mmap() is still an issue. > > > > > > > > Auto-setting free space tree, because dealing with free space > > > > inode (specifically readpages) is a nightmare. > > > > Auto-setting nodatasum because we don't get callback for writing > > > > checksums after mmap()s. > > > > > > Congratulations on getting the bear to dance. But why? > > > > Why not ? ;) > > 18 files changed, 662 insertions(+), 77 deletions(-) > > I want to know what advantage we're getting for that. So the prime advantages are btrfs filesystem snapshots on the dax device. Leveraging btrfs features, reflinks works just as well. (In addition to all the advantages dax has to offer)
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > This sets S_DAX in inode->i_flags, which can be used with > IS_DAX(). I haven't followed the dax developments recently, IIRC the mount option 'dax' was meant to be a temporary solution. I don't know if this has stuck, or is discouraged to be added to new code. The replacement is the per-inode xflag. > The dax option is restricted to non multi-device mounts. > dax interacts with the device directly instead of using bio, so > all bio-hooks which we use for multi-device cannot be performed > here. While regular read/writes could be manipulated with > RAID0/1, mmap() is still an issue. > > Auto-setting free space tree, because dealing with free space > inode (specifically readpages) is a nightmare. > Auto-setting nodatasum because we don't get callback for writing > checksums after mmap()s. As discussed before, mandating free-space-tree for dax is ok and even more that it does not complicate the space cache v1 code. > Store the dax_device in fs_info which will be used in iomap code. > Question: Since we have only one dax device, I thought fs_info is > the best place. However, should it moved to btrfs_device? At this point it's probably ok to store it in fs_info as the multi-device support does not exist. If this changes, then the device is the right place. As Adam mentioned, support for the 'single' profile would be nice. It's the simplest profile that over multiple devices only splits the logical address space (of btrfs) into more chunks. However I don't know if the dax side does require something that makes this impossible to coexist.
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > This sets S_DAX in inode->i_flags, which can be used with > IS_DAX(). > > The dax option is restricted to non multi-device mounts. > dax interacts with the device directly instead of using bio, so > all bio-hooks which we use for multi-device cannot be performed > here. While regular read/writes could be manipulated with > RAID0/1, mmap() is still an issue. I'm looking for other features that would not work with dax. Disabling multiple devices strikes out device add, device delete and device replace. To be verified: - balance - at least profile changes must be forbidden - defrag - compression - as it needs COW, it won't work on the nodatacow files, thus seems to be ok - resize/grow - this could work without changes, no block relocation is needed, only new structures created - resize/shrink - depends on balance, the block groups from the removed area need to be relocated - swapfiles - I'm sure somebody will try that; I haven't seen any IS_DAX checks in the swapfile activation code
On Thu, Mar 28, 2019 at 06:28:25PM +0100, David Sterba wrote: > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > This sets S_DAX in inode->i_flags, which can be used with > > IS_DAX(). > > > > The dax option is restricted to non multi-device mounts. > > dax interacts with the device directly instead of using bio, so > > all bio-hooks which we use for multi-device cannot be performed > > here. While regular read/writes could be manipulated with > > RAID0/1, mmap() is still an issue. > > I'm looking for other features that would not work with dax. Disabling > multiple devices strikes out device add, device delete and device > replace. > > To be verified: > > - balance - at least profile changes must be forbidden > > - defrag > > - compression - as it needs COW, it won't work on the nodatacow files, > thus seems to be ok > > - resize/grow - this could work without changes, no block relocation is > needed, only new structures created > > - resize/shrink - depends on balance, the block groups from the removed > area need to be relocated > > - swapfiles - I'm sure somebody will try that; I haven't seen any > IS_DAX checks in the swapfile activation code I've never tried dax swapfiles, but so long as you can issue bios to the underlying bdev they ought to work, right? Even if the swap code could be streamlined by memcpy in the dax case. (Granted I have no idea what sort of bio-hooks magic btrfs does, maybe it really doesn't work there...) --D
On 18:28 28/03, David Sterba wrote: > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > This sets S_DAX in inode->i_flags, which can be used with > > IS_DAX(). > > > > The dax option is restricted to non multi-device mounts. > > dax interacts with the device directly instead of using bio, so > > all bio-hooks which we use for multi-device cannot be performed > > here. While regular read/writes could be manipulated with > > RAID0/1, mmap() is still an issue. > > I'm looking for other features that would not work with dax. Disabling > multiple devices strikes out device add, device delete and device > replace. I am glad you brought this up. > > To be verified: > > - balance - at least profile changes must be forbidden > > - defrag Disabled for now with EOPNOTSUPP primarily because of (null) readpages() in dax. > > - compression - as it needs COW, it won't work on the nodatacow files, > thus seems to be ok Compression would require some sort of processing before writes. However, users are writing directly to disk. So no, this can't be supported either. > > - resize/grow - this could work without changes, no block relocation is > needed, only new structures created > > - resize/shrink - depends on balance, the block groups from the removed > area need to be relocated > > - swapfiles - I'm sure somebody will try that; I haven't seen any > IS_DAX checks in the swapfile activation code
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b3642367a595..8ca1c0d120f4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1067,6 +1067,7 @@ struct btrfs_fs_info { u32 metadata_ratio; void *bdev_holder; + struct dax_device *dax_dev; /* private scrub information */ struct mutex scrub_lock; @@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) #define BTRFS_MOUNT_REF_VERIFY (1 << 28) +#define BTRFS_MOUNT_DAX (1 << 29) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6fe9197f6ee4..2bbb63b2fcff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -16,6 +16,7 @@ #include <linux/uuid.h> #include <linux/semaphore.h> #include <linux/error-injection.h> +#include <linux/dax.h> #include <linux/crc32c.h> #include <linux/sched/mm.h> #include <asm/unaligned.h> @@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev); + /* * We want to check superblock checksum, the type is stored inside. * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k). @@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) #endif btrfs_close_devices(fs_info->fs_devices); + fs_put_dax(fs_info->dax_dev); btrfs_mapping_tree_free(&fs_info->mapping_tree); percpu_counter_destroy(&fs_info->dirty_metadata_bytes); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ec2d8919e7fb..e66426e7692d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode) if (binode->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode)) + new_fl |= S_DAX; + set_mask_bits(&inode->i_flags, - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX, new_fl); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 120e4340792a..2d448b9d6004 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -326,6 +326,7 @@ enum { Opt_treelog, Opt_notreelog, Opt_usebackuproot, Opt_user_subvol_rm_allowed, + Opt_dax, /* Deprecated options */ Opt_alloc_start, @@ -393,6 +394,7 @@ static const match_table_t tokens = { {Opt_notreelog, "notreelog"}, {Opt_usebackuproot, "usebackuproot"}, {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, + {Opt_dax, "dax"}, /* Deprecated options */ {Opt_alloc_start, "alloc_start=%s"}, @@ -745,6 +747,28 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_user_subvol_rm_allowed: btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); break; + case Opt_dax: +#ifdef CONFIG_FS_DAX + if (btrfs_super_num_devices(info->super_copy) > 1) { + btrfs_info(info, + "dax not supported for multi-device btrfs partition\n"); + ret = -EOPNOTSUPP; + goto out; + } + btrfs_set_opt(info->mount_opt, DAX); + btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n"); + btrfs_set_and_info(info, NODATASUM, + "auto-setting nodatasum (dax)"); + btrfs_clear_opt(info->mount_opt, SPACE_CACHE); + btrfs_set_and_info(info, FREE_SPACE_TREE, + "auto-setting free space tree (dax)"); + break; +#else + btrfs_err(info, + "DAX option not supported\n"); + ret = -EINVAL; + goto out; +#endif case Opt_enospc_debug: btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG); break; @@ -1335,6 +1359,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",clear_cache"); if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED)) seq_puts(seq, ",user_subvol_rm_allowed"); + if (btrfs_test_opt(info, DAX)) + seq_puts(seq, ",dax"); if (btrfs_test_opt(info, ENOSPC_DEBUG)) seq_puts(seq, ",enospc_debug"); if (btrfs_test_opt(info, AUTO_DEFRAG))