[PATCH/RFC,0/4] Attempt to make progress with btrfs dev number strangeness.

Message ID	162848123483.25823.15844774651164477866.stgit@noble.brown (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Subject: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness. From: NeilBrown <neilb@suse.de> To: Josef Bacik <josef@toxicpanda.com>, Chris Mason <clm@fb.com>, David Sterba <dsterba@suse.com> Cc: linux-fsdevel@vger.kernel.org, Linux NFS list <linux-nfs@vger.kernel.org>, Btrfs BTRFS <linux-btrfs@vger.kernel.org> Date: Mon, 09 Aug 2021 13:55:27 +1000 Message-ID: <162848123483.25823.15844774651164477866.stgit@noble.brown> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	Attempt to make progress with btrfs dev number strangeness. \| expand [PATCH/RFC,0/4] Attempt to make progress with btrfs dev number strangeness. [1/4] btrfs: include subvol identifier in inode number if -o inumbits=... [2/4] btrfs: add numdevs= mount option. [3/4] VFS/btrfs: add STATX_TREE_ID [4/4] Add "tree" number to "inode" number in various /proc files.

NeilBrown Aug. 9, 2021, 3:55 a.m. UTC

I continue to search for a way forward for btrfs so that its behaviour
with respect to device numbers and subvols is somewhat coherent.

This series implements some of the ideas in my "A Third perspective"[1],
though with changes is various details.

I introduce two new mount options, which default to
no-change-in-behaviour.

 -o inumbits=  causes inode numbers to be more unique across a whole btrfs
               filesystem, and is many cases completely unique.  Mounting
               with "-i inumbits=56" will resolve the NFS issues that
               started me tilting at this particular windmill.

 -o numdevs=  can reduce the number of distinct devices reported by
              stat(), either to 2 or to 1.
              Both ease problems for sites that exhaust their supply of
              device numbers.
              '2' allows "du -x" to continue to work, but is otherwise
              rather strange.
              '1' breaks the use of "du -x" and similar to examine a
              single subvol which might have subvol descendants, but
              provides generally sane behaviour
              "-o numdevs=1" also forces inumbits to have a useful value.

I introduce a "tree id" which can be discovered using statx().  Two
files with the same dev and ino might still be different if the tree-ids
are different.  Connected files with the same tree-id may be usefully
considered to be related.

I also change various /proc files (only when numdevs=1 is used) to
provide extra information so they are useful with btrfs despite subvols.
/proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
The inode number becomes "XX:YY" where XX is the subvol number (tree id)
and YY is the inode number.

An alternate might be to report a number which might use up to 128 bits.
Which is less likely to seriously break code?

Note that code which ignores badly formatted lines is safe, because it
will never currently find a match for a btrfs file in these files
anyway.  The device number they report is never returned in st_dev for
stat() on any file.

The audit subsystem and one or two other places report dev/ino and so
need enhanced, but I haven't tried to address those.

Various trace points also report dev/ino.  I haven't tried thinking
about those either.

Thanks for your upcoming replies!

NeilBrown

---

NeilBrown (4):
      btrfs: include subvol identifier in inode number if -o inumbits=...
      btrfs: add numdevs= mount option.
      VFS/btrfs: add STATX_TREE_ID
      Add "tree" number to "inode" number in various /proc files.


 fs/btrfs/ctree.h          | 17 +++++++++++++++--
 fs/btrfs/disk-io.c        | 24 +++++++++++++++++++++---
 fs/btrfs/inode.c          | 39 ++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c          |  6 ++++--
 fs/btrfs/super.c          | 31 +++++++++++++++++++++++++++++++
 fs/inode.c                |  1 +
 fs/locks.c                | 12 +++++++++---
 fs/notify/fdinfo.c        | 19 ++++++++++++++-----
 fs/proc/nommu.c           | 11 ++++++++---
 fs/proc/task_mmu.c        | 17 ++++++++++++-----
 fs/proc/task_nommu.c      | 11 ++++++++---
 fs/stat.c                 |  2 ++
 include/linux/fs.h        |  3 ++-
 include/linux/stat.h      | 13 +++++++++++++
 include/uapi/linux/stat.h |  3 ++-
 samples/vfs/test-statx.c  |  4 +++-
 16 files changed, 183 insertions(+), 30 deletions(-)

--
Signature

Josef Bacik Aug. 10, 2021, 8:51 p.m. UTC | #1

On 8/8/21 11:55 PM, NeilBrown wrote:
> I continue to search for a way forward for btrfs so that its behaviour
> with respect to device numbers and subvols is somewhat coherent.
> 
> This series implements some of the ideas in my "A Third perspective"[1],
> though with changes is various details.
> 
> I introduce two new mount options, which default to
> no-change-in-behaviour.
> 
>   -o inumbits=  causes inode numbers to be more unique across a whole btrfs
>                 filesystem, and is many cases completely unique.  Mounting
>                 with "-i inumbits=56" will resolve the NFS issues that
>                 started me tilting at this particular windmill.
> 
>   -o numdevs=  can reduce the number of distinct devices reported by
>                stat(), either to 2 or to 1.
>                Both ease problems for sites that exhaust their supply of
>                device numbers.
>                '2' allows "du -x" to continue to work, but is otherwise
>                rather strange.
>                '1' breaks the use of "du -x" and similar to examine a
>                single subvol which might have subvol descendants, but
>                provides generally sane behaviour
>                "-o numdevs=1" also forces inumbits to have a useful value.
> 
> I introduce a "tree id" which can be discovered using statx().  Two
> files with the same dev and ino might still be different if the tree-ids
> are different.  Connected files with the same tree-id may be usefully
> considered to be related.
> 
> I also change various /proc files (only when numdevs=1 is used) to
> provide extra information so they are useful with btrfs despite subvols.
> /proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
> The inode number becomes "XX:YY" where XX is the subvol number (tree id)
> and YY is the inode number.
> 
> An alternate might be to report a number which might use up to 128 bits.
> Which is less likely to seriously break code?
> 
> Note that code which ignores badly formatted lines is safe, because it
> will never currently find a match for a btrfs file in these files
> anyway.  The device number they report is never returned in st_dev for
> stat() on any file.
> 
> The audit subsystem and one or two other places report dev/ino and so
> need enhanced, but I haven't tried to address those.
> 
> Various trace points also report dev/ino.  I haven't tried thinking
> about those either.

I think this is a step in the right direction, but I want to figure out a way to 
accomplish this without magical mount points that users must be aware of.

I think the stat() st_dev ship as sailed, we're stuck with that.  However 
Christoph does have a valid point where it breaks the various info spit out by 
/proc.  You've done a good job with the treeid here, but it still makes it 
impossible for somebody to map the st_dev back to the correct mount.

I think we aren't going to solve that problem, at least not with stat().  I 
think with statx() spitting out treeid we have given userspace a way to 
differentiate subvolumes, and so we should fix statx() to spit out the the super 
block device, that way new userspace things can do their appropriate lookup if 
they so choose.

This leaves the problem of nfsd.  Can you just integrate this new treeid into 
nfsd, and use that to either change the ino within nfsd itself, or do something 
similar to what your first patchset did and generate a fsid based on the treeid?

Mount options are messy, and are just going to lead to distro's turning them on 
without understanding what's going on and then we have to support them forever. 
  I want to get this fixed in a way that we all hate the least with as little 
opportunity for confused users to make bad decisions.  Thanks,

Josef

NeilBrown Aug. 11, 2021, 10:13 p.m. UTC | #2

On Wed, 11 Aug 2021, Josef Bacik wrote:
> 
> I think this is a step in the right direction, but I want to figure out a way to 
> accomplish this without magical mount points that users must be aware of.

magic mount *options* ???

> 
> I think the stat() st_dev ship as sailed, we're stuck with that.  However 
> Christoph does have a valid point where it breaks the various info spit out by 
> /proc.  You've done a good job with the treeid here, but it still makes it 
> impossible for somebody to map the st_dev back to the correct mount.

The ship might have sailed, but it is not water tight.  And as the world
it round, it can still come back to bite us from behind.
Anything can be transitioned away from, whether it is devfs or 32-bit
time or giving different device numbers to different file-trees.

The linkage between device number and and filesystem is quite strong.
We could modified all of /proc and /sys/ and audit and whatever else to
report the fake device number, but we cannot get the fake device number
into the mount table (without making the mount table unmanageablely
large).  
And if subtrees aren't in the mount-table for the NFS server, I don't
think they should be in the mount-table of the NFS client.  So we cannot
export them to NFS.

I understand your dislike for mount options.  An alternative with
different costs and benefits would be to introduce a new filesystem type
- btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
whatever we decided was best with inode numbers.  How much would you
hate that?

> 
> I think we aren't going to solve that problem, at least not with stat().  I 
> think with statx() spitting out treeid we have given userspace a way to 
> differentiate subvolumes, and so we should fix statx() to spit out the the super 
> block device, that way new userspace things can do their appropriate lookup if 
> they so choose.

I don't think we should normalize having multiple devnums per filesystem
by encoding it in statx().  It *would* make sense to add a btrfs ioctl
which reports the real device number of a file.  Tools that really need
to work with btrfs could use that, but it would always be obvious that
it was an exception.

> 
> This leaves the problem of nfsd.  Can you just integrate this new treeid into 
> nfsd, and use that to either change the ino within nfsd itself, or do something 
> similar to what your first patchset did and generate a fsid based on the treeid?

I would only want nfsd to change the inode number.  I no longer think it
is acceptable for nfsd to report different device number (as I mention
above).
I would want the new inode number to be explicitly provided by the
filesystem.  Whether that is a new export_operation or a new field in
'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
but I can live without that.

On the topic of inode numbers....  I've recently learned that btrfs
never reuses inode (objectid) numbers (except possibly after an
unmount).  Equally it doesn't re-use subvol numbers.  How much does this
contribute to the 64 bits not being enough for subtree+inode?

It would be nice if we could be comfortable limiting the objectid number
to 40 bits and the root.objectid (filetree) number to 24 bits, and
combine them into a 64bit inode number.

If we added a inode number reuse scheme that was suitably performant,
would that make this possible?  That would remove the need for a treeid,
and allow us to use project-id to identify subtrees.

> 
> Mount options are messy, and are just going to lead to distro's turning them on 
> without understanding what's going on and then we have to support them forever. 
>   I want to get this fixed in a way that we all hate the least with as little 
> opportunity for confused users to make bad decisions.  Thanks,

Hence my question: how much do you hate creating a new filesystem type
to fix the problems?

Thanks,
NeilBrown

Josef Bacik Aug. 12, 2021, 1:54 p.m. UTC | #3

On 8/11/21 6:13 PM, NeilBrown wrote:
> On Wed, 11 Aug 2021, Josef Bacik wrote:
>>
>> I think this is a step in the right direction, but I want to figure out a way to
>> accomplish this without magical mount points that users must be aware of.
> 
> magic mount *options* ???
> 
>>
>> I think the stat() st_dev ship as sailed, we're stuck with that.  However
>> Christoph does have a valid point where it breaks the various info spit out by
>> /proc.  You've done a good job with the treeid here, but it still makes it
>> impossible for somebody to map the st_dev back to the correct mount.
> 
> The ship might have sailed, but it is not water tight.  And as the world
> it round, it can still come back to bite us from behind.
> Anything can be transitioned away from, whether it is devfs or 32-bit
> time or giving different device numbers to different file-trees.
> 
> The linkage between device number and and filesystem is quite strong.
> We could modified all of /proc and /sys/ and audit and whatever else to
> report the fake device number, but we cannot get the fake device number
> into the mount table (without making the mount table unmanageablely
> large).
> And if subtrees aren't in the mount-table for the NFS server, I don't
> think they should be in the mount-table of the NFS client.  So we cannot
> export them to NFS.
> 
> I understand your dislike for mount options.  An alternative with
> different costs and benefits would be to introduce a new filesystem type
> - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> whatever we decided was best with inode numbers.  How much would you
> hate that?
> 

A lot more ;).

>>
>> I think we aren't going to solve that problem, at least not with stat().  I
>> think with statx() spitting out treeid we have given userspace a way to
>> differentiate subvolumes, and so we should fix statx() to spit out the the super
>> block device, that way new userspace things can do their appropriate lookup if
>> they so choose.
> 
> I don't think we should normalize having multiple devnums per filesystem
> by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> which reports the real device number of a file.  Tools that really need
> to work with btrfs could use that, but it would always be obvious that
> it was an exception.

That's not what I'm saying.  I'm saying that stat() continues to behave the way 
it currently does, for legacy users.

And then for statx() it returns the correct devnum like any other file system, 
with the augmentation of the treeid so that future userspace programs can use 
the treeid to decide if they want to wander into a subvolume.

This way moving forward we have a way to map back to a mount point because 
statx() will return the actual devnum for the mountpoint, and then we can use 
the treeid to be smart about when we wander into a subvolume.

And if we're going to add a treeid, I would actually like to add a parent_treeid 
as well so we could tell if we're a snapshot or just a normal subvolume.

> 
>>
>> This leaves the problem of nfsd.  Can you just integrate this new treeid into
>> nfsd, and use that to either change the ino within nfsd itself, or do something
>> similar to what your first patchset did and generate a fsid based on the treeid?
> 
> I would only want nfsd to change the inode number.  I no longer think it
> is acceptable for nfsd to report different device number (as I mention
> above).
> I would want the new inode number to be explicitly provided by the
> filesystem.  Whether that is a new export_operation or a new field in
> 'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
> but I can live without that.
>

Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you 
could accomplish the same behavior without the mount options.  We add either a 
new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb 
or the fs_type, and then NFS does the inode number magic transformation 
automatically and we are good to go.

> On the topic of inode numbers....  I've recently learned that btrfs
> never reuses inode (objectid) numbers (except possibly after an
> unmount).  Equally it doesn't re-use subvol numbers.  How much does this
> contribute to the 64 bits not being enough for subtree+inode?
> 
> It would be nice if we could be comfortable limiting the objectid number
> to 40 bits and the root.objectid (filetree) number to 24 bits, and
> combine them into a 64bit inode number.
> 
> If we added a inode number reuse scheme that was suitably performant,
> would that make this possible?  That would remove the need for a treeid,
> and allow us to use project-id to identify subtrees.
> 

We had a resuse scheme, we deprecated and deleted it.  I don't want to 
arbitrarily limit objectid's to work around this issue.

>>
>> Mount options are messy, and are just going to lead to distro's turning them on
>> without understanding what's going on and then we have to support them forever.
>>    I want to get this fixed in a way that we all hate the least with as little
>> opportunity for confused users to make bad decisions.  Thanks,
> 
> Hence my question: how much do you hate creating a new filesystem type
> to fix the problems?
> 

I'm still not convinced we can't solve this without adding new options or 
fstypes.  I think flags to indicate that we're special and to use a treeid that 
we stuff into the inode would be a reasonable solution.  That being said I'm a 
little sleep deprived so I could be missing why my plan is a bad one, so I'm 
willing to be convinced that mount options are the solution to this, but I want 
to make sure we're damned certain that's the best way forward.  Thanks,

Josef

Hugo Mills Aug. 12, 2021, 2:06 p.m. UTC | #4

On Thu, Aug 12, 2021 at 09:54:54AM -0400, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> > > 
> > > I think this is a step in the right direction, but I want to figure out a way to
> > > accomplish this without magical mount points that users must be aware of.
> > 
> > magic mount *options* ???
> > 
> > > 
> > > I think the stat() st_dev ship as sailed, we're stuck with that.  However
> > > Christoph does have a valid point where it breaks the various info spit out by
> > > /proc.  You've done a good job with the treeid here, but it still makes it
> > > impossible for somebody to map the st_dev back to the correct mount.
> > 
> > The ship might have sailed, but it is not water tight.  And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> > 
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client.  So we cannot
> > export them to NFS.
> > 
> > I understand your dislike for mount options.  An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers.  How much would you
> > hate that?
> > 
> 
> A lot more ;).
> 
> > > 
> > > I think we aren't going to solve that problem, at least not with stat().  I
> > > think with statx() spitting out treeid we have given userspace a way to
> > > differentiate subvolumes, and so we should fix statx() to spit out the the super
> > > block device, that way new userspace things can do their appropriate lookup if
> > > they so choose.
> > 
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file.  Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
> 
> That's not what I'm saying.  I'm saying that stat() continues to behave the
> way it currently does, for legacy users.
> 
> And then for statx() it returns the correct devnum like any other file
> system, with the augmentation of the treeid so that future userspace
> programs can use the treeid to decide if they want to wander into a
> subvolume.
> 
> This way moving forward we have a way to map back to a mount point because
> statx() will return the actual devnum for the mountpoint, and then we can
> use the treeid to be smart about when we wander into a subvolume.
> 
> And if we're going to add a treeid, I would actually like to add a
> parent_treeid as well so we could tell if we're a snapshot or just a normal
> subvolume.

   Can I make a request to call it something other than a
"parent". There's at least three different usages of "parent" for
three different concepts related to subvolumes in btrfs(*), and it'd
be nice to avoid the inevitable confusion.

(*) 1. "subvolume containing this one",
    2. "subvolume that was snapshotted to make this one", and,
    3. at least informally, "subvolume that was sent/received to make this one"

   Hugo.

[snip to end]

NeilBrown Aug. 12, 2021, 10:35 p.m. UTC | #5

On Thu, 12 Aug 2021, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> >>
> >> I think this is a step in the right direction, but I want to figure out a way to
> >> accomplish this without magical mount points that users must be aware of.
> > 
> > magic mount *options* ???
> > 
> >>
> >> I think the stat() st_dev ship as sailed, we're stuck with that.  However
> >> Christoph does have a valid point where it breaks the various info spit out by
> >> /proc.  You've done a good job with the treeid here, but it still makes it
> >> impossible for somebody to map the st_dev back to the correct mount.
> > 
> > The ship might have sailed, but it is not water tight.  And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> > 
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client.  So we cannot
> > export them to NFS.
> > 
> > I understand your dislike for mount options.  An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers.  How much would you
> > hate that?
> > 
> 
> A lot more ;).
> 
> >>
> >> I think we aren't going to solve that problem, at least not with stat().  I
> >> think with statx() spitting out treeid we have given userspace a way to
> >> differentiate subvolumes, and so we should fix statx() to spit out the the super
> >> block device, that way new userspace things can do their appropriate lookup if
> >> they so choose.
> > 
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file.  Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
> 
> That's not what I'm saying.  I'm saying that stat() continues to behave the way 
> it currently does, for legacy users.
> 
> And then for statx() it returns the correct devnum like any other file system, 
> with the augmentation of the treeid so that future userspace programs can use 
> the treeid to decide if they want to wander into a subvolume.

Yes, that is what I thought you were saying.  It implies that the
possibility of a file having two different device numbers becomes
normalised in the API - one returned by stat(), the other by statx()
(presumably in a new field - the FS cannot tell what libc call the
application made).  I don't like that.

> 
> This way moving forward we have a way to map back to a mount point because 
> statx() will return the actual devnum for the mountpoint, and then we can use 
> the treeid to be smart about when we wander into a subvolume.

We already have a way to map back to a mountpoint.  statx reports a
mnt_id with result flag STATX_MNT_ID.  This is the number at the start
of the line in mountinfo.  Hmmm, this isn't in the manpage.  It has been
in the kernel since Linux 5.8.  I'll send a patch for the manpage.

So we could pursue a path where the device-id no longer defines the
filesystem (or mount), but instead it defines some arbitrary grouping of
objects within a filesystem.  So instead of my proposed
   dev-id  /  subtree-id / inode-number
we would have
   dev-id-in-mountinfo / mnt_id / dev-id-in-stat / inode-number

In some ways this would be a smoother path forward - no change to statx,
no new concepts, just formalizing some de-facto concepts.
In other ways it might be rougher - we would need to convince the
community to use the stat() dev-id in all those proc files etc.

I think having the two meanings for a device-id would cause confusion for
quite some years..... but then any change will probably cause confusion.

> 
> And if we're going to add a treeid, I would actually like to add a parent_treeid 
> as well so we could tell if we're a snapshot or just a normal subvolume.

Is this a well-defined concept? Isn't "snapshot" just one possible
use-case for the btrfs functionality of creating a reflink to a subtree?
What happens to the "parent_treeid" reference when that "parent" gets
deleted?

I understand the desire to track this sort of connection, but I wonder
if the filesystem is really the right place to track it.  Maybe having
the tools track it would be better.

> 
> > 
> >>
> >> This leaves the problem of nfsd.  Can you just integrate this new treeid into
> >> nfsd, and use that to either change the ino within nfsd itself, or do something
> >> similar to what your first patchset did and generate a fsid based on the treeid?
> > 
> > I would only want nfsd to change the inode number.  I no longer think it
> > is acceptable for nfsd to report different device number (as I mention
> > above).
> > I would want the new inode number to be explicitly provided by the
> > filesystem.  Whether that is a new export_operation or a new field in
> > 'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
> > but I can live without that.
> >
> 
> Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you 
> could accomplish the same behavior without the mount options.  We add either a 
> new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb 
> or the fs_type, and then NFS does the inode number magic transformation 
> automatically and we are good to go.

I really don't want nfsd to do the magic transformations.  I want the
filesystem to do those if they need to be done.  I could cope with nfsd
xor-ing some provided number with i_ino, but I wouldn't like nfsd to
have the responsibility of doing the swab64().

> 
> > On the topic of inode numbers....  I've recently learned that btrfs
> > never reuses inode (objectid) numbers (except possibly after an
> > unmount).  Equally it doesn't re-use subvol numbers.  How much does this
> > contribute to the 64 bits not being enough for subtree+inode?
> > 
> > It would be nice if we could be comfortable limiting the objectid number
> > to 40 bits and the root.objectid (filetree) number to 24 bits, and
> > combine them into a 64bit inode number.
> > 
> > If we added a inode number reuse scheme that was suitably performant,
> > would that make this possible?  That would remove the need for a treeid,
> > and allow us to use project-id to identify subtrees.
> > 
> 
> We had a resuse scheme, we deprecated and deleted it.  I don't want to 
> arbitrarily limit objectid's to work around this issue.

These are computers we are working with.  There are always arbitrary
limits.
The syscall interface places an arbitrary limit of 64bits on the
identity of any object in a filesystem.  btrfs clearly doesn't like that
arbitrary limit, and plays games with device number to increase it to a
new arbitrary limit of 84 bits (sort-of).

I'm fully open to the possibility that last year's arbitrary limits are
no longer comfortable and that we might need to push the boundaries.
But I'd rather the justification was a bit stronger than "we cannot be
bothered reusing old inode numbers".

Are you at all aware of any site coming anywhere vaguely close to one trillion
concurrent inodes - maybe even 16 billion?
Or anything close to 16 million concurrent subvolumes?

> 
> >>
> >> Mount options are messy, and are just going to lead to distro's turning them on
> >> without understanding what's going on and then we have to support them forever.
> >>    I want to get this fixed in a way that we all hate the least with as little
> >> opportunity for confused users to make bad decisions.  Thanks,
> > 
> > Hence my question: how much do you hate creating a new filesystem type
> > to fix the problems?
> > 
> 
> I'm still not convinced we can't solve this without adding new options or 
> fstypes.  I think flags to indicate that we're special and to use a treeid that 
> we stuff into the inode would be a reasonable solution.  That being said I'm a 
> little sleep deprived so I could be missing why my plan is a bad one, so I'm 
> willing to be convinced that mount options are the solution to this, but I want 
> to make sure we're damned certain that's the best way forward.  Thanks,

I don't think "best way forward" is the appropriate goal - impossible to
assess.

What we need is a chosen way forward.  Someone - and ultimately that
someone needs to be the BTRFS maintainer team - needs to decide what
breakage they are willing to bear the cost of, and what breakage is
unacceptable to them, and to choose a way to move forward.  I cannot
make that decision for you because I'm just an interested bystander.  Al
Viro and Linus cannot either, though they are in a position to veto some
decisions.

The current choice appears to be "ignore the problem and hope it goes
away", though I appreciate that appearances can be deceiving.

You appear very keen to preserve as much of the status quo as possible.
Given that, I think you really need to push to get all the procfs files
changed to use the same device number as stat - so push the patch which
SUSE has that add inode_get_dev().

https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

(though the change to show_mountinfo() in that patch would need careful consideration).

If that lands, you have a clear way forward, and we can find some
solution for NFSd (and other network filesystems), and for user-space to
use mnt_id.
If you cannot overcome the pushback, then you know you will have to
find another path - make a 64bit inode number unique, or add more bits
to the effective inode number.  Or something.

NeilBrown

[PATCH/RFC,0/4] Attempt to make progress with btrfs dev number strangeness.

Message

Comments