diff mbox series

[01/15] btrfs: create a mount option for dax

Message ID 20190326190301.32365-2-rgoldwyn@suse.de (mailing list archive)
State New, archived
Headers show
Series [01/15] btrfs: create a mount option for dax | expand

Commit Message

Goldwyn Rodrigues March 26, 2019, 7:02 p.m. UTC
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This sets S_DAX in inode->i_flags, which can be used with
IS_DAX().

The dax option is restricted to non multi-device mounts.
dax interacts with the device directly instead of using bio, so
all bio-hooks which we use for multi-device cannot be performed
here. While regular read/writes could be manipulated with
RAID0/1, mmap() is still an issue.

Auto-setting free space tree, because dealing with free space
inode (specifically readpages) is a nightmare.
Auto-setting nodatasum because we don't get callback for writing
checksums after mmap()s.

Store the dax_device in fs_info which will be used in iomap code.
Question: Since we have only one dax device, I thought fs_info is
the best place. However, should it moved to btrfs_device?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c |  4 ++++
 fs/btrfs/ioctl.c   |  5 ++++-
 fs/btrfs/super.c   | 26 ++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox March 26, 2019, 7:10 p.m. UTC | #1
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
> 
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
> 
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.

Congratulations on getting the bear to dance.  But why?

To me, the point of btrfs is all the cool stuff it does with built-in
checksumming and snapshots and RAID and so on.  DAX doesn't let you do
any of that, so why would somebody want to use btrfs to manage DAX?
Goldwyn Rodrigues March 27, 2019, 11 a.m. UTC | #2
On 12:10 26/03, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 

Why not ? ;)

> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?
> 

There are users who are asking for advantages of dax on btrfs.

I have tried to make it work with snapshots in this series.
Checksumming should be possible, but would require some more hacks. I am
looking into it.
multi-device is an issue for mmap() and I don't think we can work around
it.

I agree there is a price to pay to use dax, but I am sure the users
would know about that.
Matthew Wilcox March 27, 2019, noon UTC | #3
On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> On 12:10 26/03, Matthew Wilcox wrote:
> > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > This sets S_DAX in inode->i_flags, which can be used with
> > > IS_DAX().
> > > 
> > > The dax option is restricted to non multi-device mounts.
> > > dax interacts with the device directly instead of using bio, so
> > > all bio-hooks which we use for multi-device cannot be performed
> > > here. While regular read/writes could be manipulated with
> > > RAID0/1, mmap() is still an issue.
> > > 
> > > Auto-setting free space tree, because dealing with free space
> > > inode (specifically readpages) is a nightmare.
> > > Auto-setting nodatasum because we don't get callback for writing
> > > checksums after mmap()s.
> > 
> > Congratulations on getting the bear to dance.  But why?
> 
> Why not ? ;)

 18 files changed, 662 insertions(+), 77 deletions(-)

I want to know what advantage we're getting for that.

> > To me, the point of btrfs is all the cool stuff it does with built-in
> > checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> > any of that, so why would somebody want to use btrfs to manage DAX?
> 
> There are users who are asking for advantages of dax on btrfs.

There are also people asking for perpetual motion machines.  Do these
users understand the tradeoffs?

> I have tried to make it work with snapshots in this series.
> Checksumming should be possible, but would require some more hacks. I am
> looking into it.
> multi-device is an issue for mmap() and I don't think we can work around
> it.
> 
> I agree there is a price to pay to use dax, but I am sure the users
> would know about that.

I really doubt it, to be honest.
Goldwyn Rodrigues March 27, 2019, 12:26 p.m. UTC | #4
On  5:00 27/03, Matthew Wilcox wrote:
> On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> > On 12:10 26/03, Matthew Wilcox wrote:
> > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > > This sets S_DAX in inode->i_flags, which can be used with
> > > > IS_DAX().
> > > > 
> > > > The dax option is restricted to non multi-device mounts.
> > > > dax interacts with the device directly instead of using bio, so
> > > > all bio-hooks which we use for multi-device cannot be performed
> > > > here. While regular read/writes could be manipulated with
> > > > RAID0/1, mmap() is still an issue.
> > > > 
> > > > Auto-setting free space tree, because dealing with free space
> > > > inode (specifically readpages) is a nightmare.
> > > > Auto-setting nodatasum because we don't get callback for writing
> > > > checksums after mmap()s.
> > > 
> > > Congratulations on getting the bear to dance.  But why?
> > 
> > Why not ? ;)
> 
>  18 files changed, 662 insertions(+), 77 deletions(-)
> 
> I want to know what advantage we're getting for that.

Direct device mmap'd files, faster access, lower pagecache usage...
all the advantages you would get from a dax device.

> 
> > > To me, the point of btrfs is all the cool stuff it does with built-in
> > > checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> > > any of that, so why would somebody want to use btrfs to manage DAX?
> > 
> > There are users who are asking for advantages of dax on btrfs.
> 
> There are also people asking for perpetual motion machines.  Do these
> users understand the tradeoffs?

The only major feature they would not be able to use is
multi-device, which will be relayed at mount time.

> 
> > I have tried to make it work with snapshots in this series.
> > Checksumming should be possible, but would require some more hacks. I am
> > looking into it.
> > multi-device is an issue for mmap() and I don't think we can work around
> > it.
> > 
> > I agree there is a price to pay to use dax, but I am sure the users
> > would know about that.
> 
> I really doubt it, to be honest.

Care to elaborate?
Adam Borowski March 27, 2019, 5:38 p.m. UTC | #5
On Tue, Mar 26, 2019 at 12:10:01PM -0700, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 
> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?

If I read this correctly (I merely glanced at it), this patchset _does_
provide the full snapshot functionality.  This is something other
filesystems don't allow: ext4 has no CoW at all, and IIRC on XFS reflinks
and DAX are mutually exclusive.

Obviously, the usual btrfs way of CoWing every write would remove all
(write) upsides of DAX, thus NOCOW (ie, CoW once) is the way to go: a page
fault should happen no more than once per page per snapshot.


On the other hand, checksumming seems useless to me.  Data corruption can
happen either in transit or at rest.  For at rest, disks already have their
own checksums -- and [NV]DIMMs have ECC.  On the other hand, the majority of
the time when someone seeks help on the btrfs mailing list, it turns out to
be a matter of bad RAM, bad motherboard or bad cabling.  This doesn't apply
to pmem.  The usual path is:

   CPU
    |<--->memory
    |
  SATA controller
    |
    (SATA cable)
    |
  disk

The data goes to memory (very unlikely to to remain in the cache before
getting checksummed), then has to travel all the way down.  On the other
hand, the path on pmem is:

  CPU
   |---->memory

So the data written by userspace goes to memory... and that's it.


As for multi-device, at least single block groups would be very nice (to
have a filesystem than spans regions) and easyish to implement, while RAID0
might spoil hugepage fun but may still be straightforward.


Meow!
Goldwyn Rodrigues March 27, 2019, 11:31 p.m. UTC | #6
On  5:00 27/03, Matthew Wilcox wrote:
> On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> > On 12:10 26/03, Matthew Wilcox wrote:
> > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > > This sets S_DAX in inode->i_flags, which can be used with
> > > > IS_DAX().
> > > > 
> > > > The dax option is restricted to non multi-device mounts.
> > > > dax interacts with the device directly instead of using bio, so
> > > > all bio-hooks which we use for multi-device cannot be performed
> > > > here. While regular read/writes could be manipulated with
> > > > RAID0/1, mmap() is still an issue.
> > > > 
> > > > Auto-setting free space tree, because dealing with free space
> > > > inode (specifically readpages) is a nightmare.
> > > > Auto-setting nodatasum because we don't get callback for writing
> > > > checksums after mmap()s.
> > > 
> > > Congratulations on getting the bear to dance.  But why?
> > 
> > Why not ? ;)
> 
>  18 files changed, 662 insertions(+), 77 deletions(-)
> 
> I want to know what advantage we're getting for that.

So the prime advantages are btrfs filesystem snapshots on the dax device.
Leveraging btrfs features, reflinks works just as well.
(In addition to all the advantages dax has to offer)
David Sterba March 28, 2019, 2:49 p.m. UTC | #7
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().

I haven't followed the dax developments recently, IIRC the mount option
'dax' was meant to be a temporary solution. I don't know if this has
stuck, or is discouraged to be added to new code. The replacement is the
per-inode xflag.

> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
> 
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.

As discussed before, mandating free-space-tree for dax is ok and even
more that it does not complicate the space cache v1 code.

> Store the dax_device in fs_info which will be used in iomap code.
> Question: Since we have only one dax device, I thought fs_info is
> the best place. However, should it moved to btrfs_device?

At this point it's probably ok to store it in fs_info as the
multi-device support does not exist. If this changes, then the device is
the right place.

As Adam mentioned, support for the 'single' profile would be nice. It's
the simplest profile that over multiple devices only splits the logical
address space (of btrfs) into more chunks. However I don't know if the
dax side does require something that makes this impossible to coexist.
David Sterba March 28, 2019, 5:28 p.m. UTC | #8
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
> 
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.

I'm looking for other features that would not work with dax. Disabling
multiple devices strikes out device add, device delete and device
replace.

To be verified:

- balance - at least profile changes must be forbidden

- defrag

- compression - as it needs COW, it won't work on the nodatacow files,
  thus seems to be ok

- resize/grow - this could work without changes, no block relocation is
  needed, only new structures created

- resize/shrink - depends on balance, the block groups from the removed
  area need to be relocated

- swapfiles - I'm sure somebody will try that; I haven't seen any
  IS_DAX checks in the swapfile activation code
Darrick J. Wong March 28, 2019, 5:57 p.m. UTC | #9
On Thu, Mar 28, 2019 at 06:28:25PM +0100, David Sterba wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> 
> I'm looking for other features that would not work with dax. Disabling
> multiple devices strikes out device add, device delete and device
> replace.
> 
> To be verified:
> 
> - balance - at least profile changes must be forbidden
> 
> - defrag
> 
> - compression - as it needs COW, it won't work on the nodatacow files,
>   thus seems to be ok
> 
> - resize/grow - this could work without changes, no block relocation is
>   needed, only new structures created
> 
> - resize/shrink - depends on balance, the block groups from the removed
>   area need to be relocated
> 
> - swapfiles - I'm sure somebody will try that; I haven't seen any
>   IS_DAX checks in the swapfile activation code

I've never tried dax swapfiles, but so long as you can issue bios to the
underlying bdev they ought to work, right?  Even if the swap code could
be streamlined by memcpy in the dax case.

(Granted I have no idea what sort of bio-hooks magic btrfs does, maybe
it really doesn't work there...)

--D
Goldwyn Rodrigues April 1, 2019, 8:43 p.m. UTC | #10
On 18:28 28/03, David Sterba wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> 
> I'm looking for other features that would not work with dax. Disabling
> multiple devices strikes out device add, device delete and device
> replace.

I am glad you brought this up.
> 
> To be verified:
> 
> - balance - at least profile changes must be forbidden
> 
> - defrag

Disabled for now with EOPNOTSUPP primarily because of (null) readpages()
in dax.

> 
> - compression - as it needs COW, it won't work on the nodatacow files,
>   thus seems to be ok

Compression would require some sort of processing before writes. However,
users are writing directly to disk. So no, this can't be supported either.

> 
> - resize/grow - this could work without changes, no block relocation is
>   needed, only new structures created
> 
> - resize/shrink - depends on balance, the block groups from the removed
>   area need to be relocated
> 
> - swapfiles - I'm sure somebody will try that; I haven't seen any
>   IS_DAX checks in the swapfile activation code
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b3642367a595..8ca1c0d120f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1067,6 +1067,7 @@  struct btrfs_fs_info {
 	u32 metadata_ratio;
 
 	void *bdev_holder;
+	struct dax_device *dax_dev;
 
 	/* private scrub information */
 	struct mutex scrub_lock;
@@ -1442,6 +1443,7 @@  static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DAX			(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6fe9197f6ee4..2bbb63b2fcff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -16,6 +16,7 @@ 
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
 #include <linux/error-injection.h>
+#include <linux/dax.h>
 #include <linux/crc32c.h>
 #include <linux/sched/mm.h>
 #include <asm/unaligned.h>
@@ -2805,6 +2806,8 @@  int open_ctree(struct super_block *sb,
 		goto fail_alloc;
 	}
 
+	fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
+
 	/*
 	 * We want to check superblock checksum, the type is stored inside.
 	 * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
@@ -4043,6 +4046,7 @@  void close_ctree(struct btrfs_fs_info *fs_info)
 #endif
 
 	btrfs_close_devices(fs_info->fs_devices);
+	fs_put_dax(fs_info->dax_dev);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ec2d8919e7fb..e66426e7692d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@  void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
 
+	if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
+		new_fl |= S_DAX;
+
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
 		      new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 120e4340792a..2d448b9d6004 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@  enum {
 	Opt_treelog, Opt_notreelog,
 	Opt_usebackuproot,
 	Opt_user_subvol_rm_allowed,
+	Opt_dax,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -393,6 +394,7 @@  static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_dax, "dax"},
 
 	/* Deprecated options */
 	{Opt_alloc_start, "alloc_start=%s"},
@@ -745,6 +747,28 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+		case Opt_dax:
+#ifdef CONFIG_FS_DAX
+			if (btrfs_super_num_devices(info->super_copy) > 1) {
+				btrfs_info(info,
+					   "dax not supported for multi-device btrfs partition\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			btrfs_set_opt(info->mount_opt, DAX);
+			btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n");
+			btrfs_set_and_info(info, NODATASUM,
+					   "auto-setting nodatasum (dax)");
+			btrfs_clear_opt(info->mount_opt, SPACE_CACHE);
+			btrfs_set_and_info(info, FREE_SPACE_TREE,
+					"auto-setting free space tree (dax)");
+			break;
+#else
+			btrfs_err(info,
+				  "DAX option not supported\n");
+			ret = -EINVAL;
+			goto out;
+#endif
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1335,6 +1359,8 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (btrfs_test_opt(info, DAX))
+		seq_puts(seq, ",dax");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))