diff mbox series

[1/4] btrfs: include subvol identifier in inode number if -o inumbits=...

Message ID 162848132771.25823.16399620921119050413.stgit@noble.brown (mailing list archive)
State New, archived
Headers show
Series Attempt to make progress with btrfs dev number strangeness. | expand

Commit Message

NeilBrown Aug. 9, 2021, 3:55 a.m. UTC
A btrfs filesystem uses 112 bits to identify an object: 48 for the
object-id of the subvolume (really a 'subtree') and 64 for the object
within that tree.

It only exposes the 64bits in i_ino (and st_ino for stat()) and attempts
to hide the non-uniqueness by reporting a different st_dev in stat() for
each different subvol.

This is incomplete and doesn't scale.

It is incomplete because there are places other than 'stat()' where the
device number is visible to user-space, including /proc/$PID/mountinfo,
/proc/$PID/maps and /proc/locks.  These report the device-number for the
whole filesystem together with i_ino, and so do not match the
st_dev+st_ino reported by stat().

It is also incomplete because nfsd doesn't notice the st_dev value,
depending only of the existence of mount points to discover filesystem
changes.

It doesn't scale because there are a limited number of anon device
numbers that can be allocated, which is approximately 20 bits, much less
than the 48 bits worth of subvols which btrfs supports.

I believe that we *must* extend the user-space API to properly support
btrfs - trying to fit within it will always cause pain, at least in the
extremes.  I believe that the use of varying device numbers is not a
viable long-term solution and must be deprecated.

This patch is a first step towards deprecating the use of device numbers
to add uniqueness.  It changes the inode numbers reported so they are
unique across all subvols for modest sized filesystems.  It does this
by creating an 'overlay' from the subvol number and xor-ing that into
the file's object id.  This results in reported inode numbers being
completely unique within a subvol, and mostly unique between subvols.

The overlay is *not* xor-ed in when it exactly matches the objectid, as
that would produce zero.

A few placed in the code assume that ->i_ino is the objectid.  Those few
are changed to call btrfs_ino(), and ino now subtracts the overlay.

The "overlay" is created by byte-swapping the subvol identifier, then
optionally shifting down a few bits so there is unused space at the
top-end.  When the maximum objectid in use requires many fewer than 64
bits, and the maximum subvol id in use does not use all of the
remaining bits, complete uniqueness can be provided.  For larger
fileystems, complete uniqueness cannot be guaranteed.

The size of the shift can be set using the "inumbits" mount option.  A
value of 64 suppresses any shift and maximum uniqueness is provided.  A
value of 0 (the default) disables the overlay functionality.

A generally good value on 64bit systems which might use overlayfs with
btrdfs is 56, as this provides broad uniqueness, while leaving some bits
for overlayfs to differentiate between merged filesystems.

The only material improvement from this patch alone will come to
applications and tools which do not pay attention to st_dev (as that is
unchanged).  In particular, nfsd will, when "-o inumbits=56", report
(mostly) unique inode numbers to the NFS client, and some problems
caused by "find" and related tools detecting that the root of a subvol
has the name inode number as the root of its parent - both of which
appear to NFS to be in the same filesystem.

Subsequent patches will build on this base to allow the use of multiple
devices to be controlled, and then to allow complete uniqueness through
interface extensions.

ISSUE: In btrfs, inode numbers below the highest-currently-allocated are
   never reused.  This allows the highest inode number to be arbitrarily
   higher than the number of inodes.  This means that an "old"
   filesystem can trigger a risk of non-uniqueness just as a large
   filesystem can.

ISSUE: I don't understand the role of BTRFS_EMPTY_SUBVOL_DIR_OBJECTID
   so I don't know if I have to do anything when that value is assigned
   to i_ino.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/btrfs_inode.h |   22 ++++++++++++++--------
 fs/btrfs/ctree.h       |    4 ++++
 fs/btrfs/disk-io.c     |   15 +++++++++++++++
 fs/btrfs/inode.c       |   17 ++++++++++++++---
 fs/btrfs/ioctl.c       |    2 +-
 fs/btrfs/super.c       |   24 +++++++++++++++++++++++-
 6 files changed, 71 insertions(+), 13 deletions(-)
diff mbox series

Patch

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..18e1b071bb69 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -246,13 +246,6 @@  static inline unsigned long btrfs_inode_hash(u64 objectid,
 	return (unsigned long)h;
 }
 
-static inline void btrfs_insert_inode_hash(struct inode *inode)
-{
-	unsigned long h = btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root);
-
-	__insert_inode_hash(inode, h);
-}
-
 static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 {
 	u64 ino = inode->location.objectid;
@@ -261,11 +254,24 @@  static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 	 * !ino: btree_inode
 	 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 	 */
-	if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
+	if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY) {
+		/* vfs_inode.i_ino has inum_overlay merged in, when
+		 * that wouldn't produce zero. We need to remove it here.
+		 */
 		ino = inode->vfs_inode.i_ino;
+		if (ino != inode->root->inum_overlay)
+			ino ^= inode->root->inum_overlay;
+	}
 	return ino;
 }
 
+static inline void btrfs_insert_inode_hash(struct inode *inode)
+{
+	unsigned long h = btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root);
+
+	__insert_inode_hash(inode, h);
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e5e53e592d4f..0ef557db3a8b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -987,6 +987,8 @@  struct btrfs_fs_info {
 	u32 csums_per_leaf;
 	u32 stripesize;
 
+	unsigned short inumbits;
+
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
 	struct rb_root swapfile_pins;
@@ -1145,6 +1147,8 @@  struct btrfs_root {
 
 	u64 last_trans;
 
+	u64 inum_overlay;
+
 	u32 type;
 
 	u64 free_objectid;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a59ab7b9aea0..7f3bfa042d66 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1202,6 +1202,12 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
 	root->root_key.objectid = objectid;
 	root->anon_dev = 0;
+	if (fs_info->inumbits &&
+	    objectid != BTRFS_FS_TREE_OBJECTID &&
+	    is_fstree(objectid))
+		root->inum_overlay = swab64(objectid) >> (64 - fs_info->inumbits);
+	else
+		root->inum_overlay = 0;
 
 	spin_lock_init(&root->root_item_lock);
 	btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
@@ -3314,12 +3320,21 @@  int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	 */
 	fs_info->compress_type = BTRFS_COMPRESS_ZLIB;
 
+	fs_info->inumbits = BITS_PER_LONG + 1; /* impossible value */
+
 	ret = btrfs_parse_options(fs_info, options, sb->s_flags);
 	if (ret) {
 		err = ret;
 		goto fail_alloc;
 	}
 
+	/* By default, use inumbits=0 to avoid behaviour change.
+	 * "-o inumbits" can over-ride this default.
+	 * BITS_PER_LONG * 7 / 8 is a good value to use
+	 */
+	if (fs_info->inumbits > BITS_PER_LONG)
+		fs_info->inumbits = 0;
+
 	features = btrfs_super_incompat_flags(disk_super) &
 		~BTRFS_FEATURE_INCOMPAT_SUPP;
 	if (features) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 01099bf602fb..860cb5045123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5782,6 +5782,11 @@  static int btrfs_init_locked_inode(struct inode *inode, void *p)
 	struct btrfs_iget_args *args = p;
 
 	inode->i_ino = args->ino;
+	if (args->root && args->ino != args->root->inum_overlay)
+		/* This inode number will still be unique within this
+		 * 'root', and should be nearly unique across the filesystem.
+		 */
+		inode->i_ino ^= args->root->inum_overlay;
 	BTRFS_I(inode)->location.objectid = args->ino;
 	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
 	BTRFS_I(inode)->location.offset = 0;
@@ -6092,6 +6097,7 @@  static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 
 	while (1) {
 		struct dir_entry *entry;
+		u64 inum;
 
 		leaf = path->nodes[0];
 		slot = path->slots[0];
@@ -6136,7 +6142,10 @@  static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 		put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
 				&entry->type);
 		btrfs_dir_item_key_to_cpu(leaf, di, &location);
-		put_unaligned(location.objectid, &entry->ino);
+		inum = location.objectid;
+		if (inum != root->inum_overlay)
+			inum ^= root->inum_overlay;
+		put_unaligned(inum, &entry->ino);
 		put_unaligned(found_key.offset, &entry->offset);
 		entries++;
 		addr += sizeof(struct dir_entry) + name_len;
@@ -6333,7 +6342,7 @@  static int btrfs_insert_inode_locked(struct inode *inode)
 	args.root = BTRFS_I(inode)->root;
 
 	return insert_inode_locked4(inode,
-		   btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root),
+		   btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root),
 		   btrfs_find_actor, &args);
 }
 
@@ -6412,6 +6421,8 @@  static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	 * number if we fail afterwards in this function.
 	 */
 	inode->i_ino = objectid;
+	if (objectid != root->inum_overlay)
+		inode->i_ino ^= root->inum_overlay;
 
 	if (dir && name) {
 		trace_btrfs_inode_request(dir);
@@ -9515,7 +9526,7 @@  static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 
 
 	/* check for collisions, even if the  name isn't there */
-	ret = btrfs_check_dir_item_collision(dest, new_dir->i_ino,
+	ret = btrfs_check_dir_item_collision(dest, btrfs_ino(BTRFS_I(new_dir)),
 			     new_dentry->d_name.name,
 			     new_dentry->d_name.len);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..e008a9ceb827 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -907,7 +907,7 @@  static noinline int btrfs_mksubvol(const struct path *parent,
 	 * check for them now when we can safely fail
 	 */
 	error = btrfs_check_dir_item_collision(BTRFS_I(dir)->root,
-					       dir->i_ino, name,
+					       btrfs_ino(BTRFS_I(dir)), name,
 					       namelen);
 	if (error)
 		goto out_dput;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..5f3350e2f7ec 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -359,6 +359,7 @@  enum {
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
 	Opt_discard_mode,
+	Opt_inumbits,
 	Opt_norecovery,
 	Opt_ratio,
 	Opt_rescan_uuid_tree,
@@ -427,6 +428,7 @@  static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_discard_mode, "discard=%s"},
+	{Opt_inumbits, "inumbits=%u"},
 	{Opt_nodiscard, "nodiscard"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -830,6 +832,25 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_clear_and_info(info, FLUSHONCOMMIT,
 					     "turning off flush-on-commit");
 			break;
+		case Opt_inumbits:
+			if (info->inumbits <= BITS_PER_LONG)
+				/* silently ignore subsequent change
+				 * e.g. on remount
+				 */
+				break;
+			ret = match_int(&args[0], &intarg);
+			if (ret)
+				goto out;
+			if (intarg > BITS_PER_LONG ||
+			    (intarg && intarg < BITS_PER_LONG / 2)) {
+				btrfs_err(info,
+					  "inumbits must be 0 or in range [%d..%d]",
+					  BITS_PER_LONG/2, BITS_PER_LONG);
+				ret = -EINVAL;
+				goto out;
+			}
+			info->inumbits = intarg;
+			break;
 		case Opt_ratio:
 			ret = match_int(&args[0], &intarg);
 			if (ret)
@@ -1537,6 +1558,7 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	seq_printf(seq, ",inumbits=%u", info->inumbits);
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	subvol_name = btrfs_get_subvol_name_from_objectid(info,
@@ -1570,7 +1592,7 @@  static int btrfs_set_super(struct super_block *s, void *data)
  */
 static inline int is_subvolume_inode(struct inode *inode)
 {
-	if (inode && inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+	if (inode && btrfs_ino(BTRFS_I(inode)) == BTRFS_FIRST_FREE_OBJECTID)
 		return 1;
 	return 0;
 }