diff mbox series

[v5,2/2] tmpfs: Support 64-bit inums per-sb

Message ID ae9306ab10ce3d794c13b1836f5473e89562b98c.1578225806.git.chris@chrisdown.name (mailing list archive)
State New, archived
Headers show
Series fs: inode: shmem: Reduce risk of inum overflow | expand

Commit Message

Chris Down Jan. 5, 2020, 12:06 p.m. UTC
The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:

1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

The inode64 and inode32 names are used based on existing precedent from
XFS.

Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 Documentation/filesystems/tmpfs.txt | 11 +++++
 fs/Kconfig                          | 15 +++++++
 include/linux/shmem_fs.h            |  1 +
 mm/shmem.c                          | 66 ++++++++++++++++++++++++++++-
 4 files changed, 92 insertions(+), 1 deletion(-)

v5: Nothing in code, just resending with correct linux-mm domain.

Comments

Dave Chinner Jan. 7, 2020, 12:10 a.m. UTC | #1
On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote:
> The default is still set to inode32 for backwards compatibility, but
> system administrators can opt in to the new 64-bit inode numbers by
> either:
> 
> 1. Passing inode64 on the command line when mounting, or
> 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
> 
> The inode64 and inode32 names are used based on existing precedent from
> XFS.

Please don't copy this misfeature of XFS.

The inode32/inode64 XFS options were a horrible hack made more than
20 years ago when NFSv2 was still in use and 64 bit inodes could
not be used for NFSv2 exports. It was then continued to be used
because 32bit NFSv3 clients were unable to handle 64 bit inodes.

It took 15 years for us to be able to essentially deprecate
inode32 (inode64 is the default behaviour), and we were very happy
to get that albatross off our necks.  In reality, almost everything
out there in the world handles 64 bit inodes correctly
including 32 bit machines and 32bit binaries on 64 bit machines.
And, IMNSHO, there no excuse these days for 32 bit binaries that
don't using the *64() syscall variants directly and hence support
64 bit inodes correctlyi out of the box on all platforms.

I don't think we should be repeating past mistakes by trying to
cater for broken 32 bit applications on 64 bit machines in this day
and age.

Cheers,

Dave.
Chris Down Jan. 7, 2020, 12:16 a.m. UTC | #2
Dave Chinner writes:
>It took 15 years for us to be able to essentially deprecate
>inode32 (inode64 is the default behaviour), and we were very happy
>to get that albatross off our necks.  In reality, almost everything
>out there in the world handles 64 bit inodes correctly
>including 32 bit machines and 32bit binaries on 64 bit machines.
>And, IMNSHO, there no excuse these days for 32 bit binaries that
>don't using the *64() syscall variants directly and hence support
>64 bit inodes correctlyi out of the box on all platforms.
>
>I don't think we should be repeating past mistakes by trying to
>cater for broken 32 bit applications on 64 bit machines in this day
>and age.

I'm very glad to hear that. I strongly support moving to 64-bit inums in all 
cases if there is precedent that it's not a compatibility issue, but from the 
comments on my original[0] patch (especially that they strayed from the 
original patches' change to use ino_t directly into slab reuse), I'd been given 
the impression that it was known to be one.

 From my perspective I have no evidence that inode32 is needed other than the 
comment from Jeff above get_next_ino. If that turns out not to be a problem, I 
am more than happy to just wholesale migrate 64-bit inodes per-sb in tmpfs.

0: https://lore.kernel.org/patchwork/patch/1170963/
Dave Chinner Jan. 7, 2020, 12:39 a.m. UTC | #3
On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> Dave Chinner writes:
> > It took 15 years for us to be able to essentially deprecate
> > inode32 (inode64 is the default behaviour), and we were very happy
> > to get that albatross off our necks.  In reality, almost everything
> > out there in the world handles 64 bit inodes correctly
> > including 32 bit machines and 32bit binaries on 64 bit machines.
> > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > don't using the *64() syscall variants directly and hence support
> > 64 bit inodes correctlyi out of the box on all platforms.
> > 
> > I don't think we should be repeating past mistakes by trying to
> > cater for broken 32 bit applications on 64 bit machines in this day
> > and age.
> 
> I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> cases if there is precedent that it's not a compatibility issue, but from
> the comments on my original[0] patch (especially that they strayed from the
> original patches' change to use ino_t directly into slab reuse), I'd been
> given the impression that it was known to be one.
> 
> From my perspective I have no evidence that inode32 is needed other than the
> comment from Jeff above get_next_ino. If that turns out not to be a problem,
> I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> tmpfs.

Well, that's my comment above about 32 bit apps using non-LFS
compliant interfaces in this day and age. It's essentially a legacy
interface these days, and anyone trying to access a modern linux
filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
64 bit inodes because they all can create >32bit inode numbers
in their default configurations.

Cheers,

Dave.
Amir Goldstein Jan. 7, 2020, 6:54 a.m. UTC | #4
On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > Dave Chinner writes:
> > > It took 15 years for us to be able to essentially deprecate
> > > inode32 (inode64 is the default behaviour), and we were very happy
> > > to get that albatross off our necks.  In reality, almost everything
> > > out there in the world handles 64 bit inodes correctly
> > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > don't using the *64() syscall variants directly and hence support
> > > 64 bit inodes correctlyi out of the box on all platforms.
> > >
> > > I don't think we should be repeating past mistakes by trying to
> > > cater for broken 32 bit applications on 64 bit machines in this day
> > > and age.
> >
> > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > cases if there is precedent that it's not a compatibility issue, but from
> > the comments on my original[0] patch (especially that they strayed from the
> > original patches' change to use ino_t directly into slab reuse), I'd been
> > given the impression that it was known to be one.
> >
> > From my perspective I have no evidence that inode32 is needed other than the
> > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > tmpfs.
>
> Well, that's my comment above about 32 bit apps using non-LFS
> compliant interfaces in this day and age. It's essentially a legacy
> interface these days, and anyone trying to access a modern linux
> filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> 64 bit inodes because they all can create >32bit inode numbers
> in their default configurations.
>

Chris,

Following Dave's comment, let's keep the config option, but make it
default to Y and remove the mount option for now.
If somebody shouts, mount option could be re-added later.
This way at least we leave an option for workaround for an unlikely
breakage.

Thanks,
Amir.
Hugh Dickins Jan. 7, 2020, 8:36 a.m. UTC | #5
On Tue, 7 Jan 2020, Amir Goldstein wrote:
> On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > Dave Chinner writes:
> > > > It took 15 years for us to be able to essentially deprecate
> > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > to get that albatross off our necks.  In reality, almost everything
> > > > out there in the world handles 64 bit inodes correctly
> > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > don't using the *64() syscall variants directly and hence support
> > > > 64 bit inodes correctlyi out of the box on all platforms.

Interesting take on it.  I'd all along imagined we would have to resort
to a mount option for safety, but Dave is right that I was too focused on
avoiding tmpfs regressions, without properly realizing that people were
very unlikely to have written such tools for tmpfs in particular, but
written them for all filesystems, and already encountered and fixed
such EOVERFLOWs for other filesystems.

Hmm, though how readily does XFS actually reach the high inos on
ordinary users' systems?

> > > >
> > > > I don't think we should be repeating past mistakes by trying to
> > > > cater for broken 32 bit applications on 64 bit machines in this day
> > > > and age.
> > >
> > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > > cases if there is precedent that it's not a compatibility issue, but from
> > > the comments on my original[0] patch (especially that they strayed from the
> > > original patches' change to use ino_t directly into slab reuse), I'd been
> > > given the impression that it was known to be one.
> > >
> > > From my perspective I have no evidence that inode32 is needed other than the
> > > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > > tmpfs.
> >
> > Well, that's my comment above about 32 bit apps using non-LFS
> > compliant interfaces in this day and age. It's essentially a legacy
> > interface these days, and anyone trying to access a modern linux
> > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> > 64 bit inodes because they all can create >32bit inode numbers
> > in their default configurations.

I thought ext4 still deals in 32-bit inos, so ext4-world would not
necessarily have caught up?  Though, arguing the other way, IIUC 64-bit
ino support comes bundled with file sizes over 2GB (on all architectures?),
and it's hard to imagine who gets by with a 2GB file size limit nowadays.

> >
> 
> Chris,
> 
> Following Dave's comment, let's keep the config option, but make it
> default to Y and remove the mount option for now.
> If somebody shouts, mount option could be re-added later.
> This way at least we leave an option for workaround for an unlikely
> breakage.

Though I don't share Dave's strength of aversion to the inode32 and
inode64 mount options, I do agree that it's preferable not to offer
them if they're so very unlikely to be necessary.

Do we even need the config option?  We certainly do need to hold the
patch with config option and mount options "in our back pocket", so
that if a regression does get reported, then upstream and stable can
respond to fix it quickly; but do we need more than that?

My main concern is that a new EOVERFLOW might not be reported as such,
and waste a lot of someone's time to track down.

Hugh
Amir Goldstein Jan. 7, 2020, 10:12 a.m. UTC | #6
On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > Dave Chinner writes:
> > > > > It took 15 years for us to be able to essentially deprecate
> > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > to get that albatross off our necks.  In reality, almost everything
> > > > > out there in the world handles 64 bit inodes correctly
> > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > don't using the *64() syscall variants directly and hence support
> > > > > 64 bit inodes correctlyi out of the box on all platforms.
>
> Interesting take on it.  I'd all along imagined we would have to resort
> to a mount option for safety, but Dave is right that I was too focused on
> avoiding tmpfs regressions, without properly realizing that people were
> very unlikely to have written such tools for tmpfs in particular, but
> written them for all filesystems, and already encountered and fixed
> such EOVERFLOWs for other filesystems.
>
> Hmm, though how readily does XFS actually reach the high inos on
> ordinary users' systems?
>

Define 'ordinary'
I my calculations are correct, with default mkfs.xfs any inode allocated
from logical offset > 2TB on a volume has high ino bits set.
Besides, a deployment with more than 4G inodes shouldn't be hard to find.

Thanks,
Amir.
Dave Chinner Jan. 7, 2020, 8:59 p.m. UTC | #7
On Tue, Jan 07, 2020 at 12:36:25AM -0800, Hugh Dickins wrote:
> On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > Dave Chinner writes:
> > > > > It took 15 years for us to be able to essentially deprecate
> > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > to get that albatross off our necks.  In reality, almost everything
> > > > > out there in the world handles 64 bit inodes correctly
> > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > don't using the *64() syscall variants directly and hence support
> > > > > 64 bit inodes correctlyi out of the box on all platforms.
> 
> Interesting take on it.  I'd all along imagined we would have to resort
> to a mount option for safety, but Dave is right that I was too focused on
> avoiding tmpfs regressions, without properly realizing that people were
> very unlikely to have written such tools for tmpfs in particular, but
> written them for all filesystems, and already encountered and fixed
> such EOVERFLOWs for other filesystems.
> 
> Hmm, though how readily does XFS actually reach the high inos on
> ordinary users' systems?

Quite frequently these days - the threshold for exceeding 32 bits
in the inode number is dependent on the inode size.

Old v4 filesystems use 256 byte inodes by default, so they overflow
32bits when the filesystem size is >1TB.

Current v5 filesystems use 512 byte inodes, so they overflow 32 bits
on filesytsems >2TB.

> > > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > > > cases if there is precedent that it's not a compatibility issue, but from
> > > > the comments on my original[0] patch (especially that they strayed from the
> > > > original patches' change to use ino_t directly into slab reuse), I'd been
> > > > given the impression that it was known to be one.
> > > >
> > > > From my perspective I have no evidence that inode32 is needed other than the
> > > > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > > > tmpfs.
> > >
> > > Well, that's my comment above about 32 bit apps using non-LFS
> > > compliant interfaces in this day and age. It's essentially a legacy
> > > interface these days, and anyone trying to access a modern linux
> > > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> > > 64 bit inodes because they all can create >32bit inode numbers
> > > in their default configurations.
> 
> I thought ext4 still deals in 32-bit inos, so ext4-world would not
> necessarily have caught up?

Hmmm - I might have got my wires crossed there - there *was* a
project some time ago to increase the ext4 inode number size for
large filesystems. Ah, yeah, there we go:

https://lore.kernel.org/linux-ext4/20171101212455.47964-1-artem.blagodarenko@gmail.com/

I thought it had been rolled in with all the other "make stuff
larger" features that ext4 has...

> Though, arguing the other way, IIUC 64-bit
> ino support comes bundled with file sizes over 2GB (on all architectures?),
> and it's hard to imagine who gets by with a 2GB file size limit nowadays.

Right, you need to define LFS support for >2GB file support and
you get 64 bit inode support with that for free. It's only legacy
binaries that haven't been rebuilt in the past 15 years that are an
issue here, but there are so many other things that can go trivially
wrong with such binaries I think that inode numbers are the least of
the worries here...

Cheers,

Dave.
Dave Chinner Jan. 7, 2020, 9:07 p.m. UTC | #8
On Tue, Jan 07, 2020 at 12:12:00PM +0200, Amir Goldstein wrote:
> On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > > Dave Chinner writes:
> > > > > > It took 15 years for us to be able to essentially deprecate
> > > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > > to get that albatross off our necks.  In reality, almost everything
> > > > > > out there in the world handles 64 bit inodes correctly
> > > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > > don't using the *64() syscall variants directly and hence support
> > > > > > 64 bit inodes correctlyi out of the box on all platforms.
> >
> > Interesting take on it.  I'd all along imagined we would have to resort
> > to a mount option for safety, but Dave is right that I was too focused on
> > avoiding tmpfs regressions, without properly realizing that people were
> > very unlikely to have written such tools for tmpfs in particular, but
> > written them for all filesystems, and already encountered and fixed
> > such EOVERFLOWs for other filesystems.
> >
> > Hmm, though how readily does XFS actually reach the high inos on
> > ordinary users' systems?
> >
> 
> Define 'ordinary'
> I my calculations are correct, with default mkfs.xfs any inode allocated
> from logical offset > 2TB on a volume has high ino bits set.
> Besides, a deployment with more than 4G inodes shouldn't be hard to find.

You don't need to allocate 4 billion inodes to get >32bit inodes in
XFS - the inode number is an encoding of the physical location of
the inode in the filesystem. Hence we just have to allocate the
inode at a disk address higher than 2TB into the device and we
overflow 32bits.

e.g. make a sparse fs image file of 10TB, mount it, create 50
subdirs, then start creating zero length files spread across the 50
separate subdirectories. You should see >32bit inode numbers almost
immediately. (i.e. as soon as we allocate inodes in AG 2 or higher)

IOWs, there are *lots* of 64bit inode numbers out there on XFS
filesystems....

Cheers,

Dave.
Chris Mason Jan. 7, 2020, 9:37 p.m. UTC | #9
On 7 Jan 2020, at 16:07, Dave Chinner wrote:

> IOWs, there are *lots* of 64bit inode numbers out there on XFS
> filesystems....

It's less likely in btrfs but +1 to all of Dave's comments.  I'm happy 
to run a scan on machines in the fleet and see how many have 64 bit 
inodes (either buttery or x-y), but it's going to be a lot.

-chris
Hugh Dickins Jan. 8, 2020, 11:24 a.m. UTC | #10
On Tue, 7 Jan 2020, Chris Mason wrote:
> On 7 Jan 2020, at 16:07, Dave Chinner wrote:
> 
> > IOWs, there are *lots* of 64bit inode numbers out there on XFS
> > filesystems....
> 
> It's less likely in btrfs but +1 to all of Dave's comments.  I'm happy 
> to run a scan on machines in the fleet and see how many have 64 bit 
> inodes (either buttery or x-y), but it's going to be a lot.

Dave, Amir, Chris, many thanks for the info you've filled in -
and absolutely no need to run any scan on your fleet for this,
I think we can be confident that even if fb had some 15-year-old tool
in use on its fleet of 2GB-file filesystems, it would not be the one
to insist on a kernel revert of 64-bit tmpfs inos.

The picture looks clear now: while ChrisD does need to hold on to his
config option and inode32/inode64 mount option patch, it is much better
left out of the kernel until (very unlikely) proved necessary.

Thanks,
Hugh
Mikael Magnusson Jan. 8, 2020, 2:37 p.m. UTC | #11
Dave Chinner writes:
> On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote:
> > The default is still set to inode32 for backwards compatibility, but
> > system administrators can opt in to the new 64-bit inode numbers by
> > either:
> > 
> > 1. Passing inode64 on the command line when mounting, or
> > 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
> > 
> > The inode64 and inode32 names are used based on existing precedent from
> > XFS.
> 
> Please don't copy this misfeature of XFS.
> 
> The inode32/inode64 XFS options were a horrible hack made more than
> 20 years ago when NFSv2 was still in use and 64 bit inodes could
> not be used for NFSv2 exports. It was then continued to be used
> because 32bit NFSv3 clients were unable to handle 64 bit inodes.
> 
> It took 15 years for us to be able to essentially deprecate
> inode32 (inode64 is the default behaviour), and we were very happy
> to get that albatross off our necks.  In reality, almost everything
> out there in the world handles 64 bit inodes correctly
> including 32 bit machines and 32bit binaries on 64 bit machines.
> And, IMNSHO, there no excuse these days for 32 bit binaries that
> don't using the *64() syscall variants directly and hence support
> 64 bit inodes correctlyi out of the box on all platforms.
> 
> I don't think we should be repeating past mistakes by trying to
> cater for broken 32 bit applications on 64 bit machines in this day
> and age.

Hi,

It's unfortunately not true that everything handles this correctly.
32-bit binaries for games on Steam that use stat() without the 64 is
so prevalent that I got tired of adding the LD_PRELOAD wrapper script
and just patched out the EOVERFLOW return from glibc instead. (They
obviously don't care about the inode value at all, and I don't use any
other 32-bit binaries that do). This is probably a class of binaries
you don't care very much about, and not very likely to be installed on
a tmpfs that has wrapped around, but I thought it was worth mentioning
that they do exist anyway.
Jeff Layton Jan. 9, 2020, 12:43 a.m. UTC | #12
On Wed, 2020-01-08 at 03:24 -0800, Hugh Dickins wrote:
> On Tue, 7 Jan 2020, Chris Mason wrote:
> > On 7 Jan 2020, at 16:07, Dave Chinner wrote:
> > 
> > > IOWs, there are *lots* of 64bit inode numbers out there on XFS
> > > filesystems....
> > 
> > It's less likely in btrfs but +1 to all of Dave's comments.  I'm happy 
> > to run a scan on machines in the fleet and see how many have 64 bit 
> > inodes (either buttery or x-y), but it's going to be a lot.
> 
> Dave, Amir, Chris, many thanks for the info you've filled in -
> and absolutely no need to run any scan on your fleet for this,
> I think we can be confident that even if fb had some 15-year-old tool
> in use on its fleet of 2GB-file filesystems, it would not be the one
> to insist on a kernel revert of 64-bit tmpfs inos.
> 
> The picture looks clear now: while ChrisD does need to hold on to his
> config option and inode32/inode64 mount option patch, it is much better
> left out of the kernel until (very unlikely) proved necessary.

This approach seems like the best course to me.

FWIW, at the time we capped this at 32-bits (2007), 64-bit machines were
really just becoming widely available, and it was quite common to run
32-bit, non-LFS apps on a 64-bit kernel. Users were hitting spurious
EOVERFLOW errors all over the place so this seemed like the best way to
address it.

The world has changed a lot since then though, and one would hope that
almost everything these days is compiled with FILE_OFFSET_BITS=64.

Fingers crossed!
Chris Down Jan. 10, 2020, 4:45 p.m. UTC | #13
Hi Hugh, Dave,

Hugh Dickins writes:
>Dave, Amir, Chris, many thanks for the info you've filled in -
>and absolutely no need to run any scan on your fleet for this,
>I think we can be confident that even if fb had some 15-year-old tool
>in use on its fleet of 2GB-file filesystems, it would not be the one
>to insist on a kernel revert of 64-bit tmpfs inos.
>
>The picture looks clear now: while ChrisD does need to hold on to his
>config option and inode32/inode64 mount option patch, it is much better
>left out of the kernel until (very unlikely) proved necessary.

Based on Mikael's comment above about Steam binaries, and the lack of 
likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32}, 
but make legacy behaviour require explicit opt-in. That is:

- Default it to inode64
- Remove the Kconfig option
- Only print it as an option if tmpfs was explicitly mounted with inode32

The reason I suggest keeping this is that I'm mildly concerned that the kind of 
users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS -- 
like the not-too-uncommon case that Mikael brings up -- seem unlikely to be the 
kind of people that would find it in an rc.

Other than that, the first patch could be similar to how it is now, 
incorporating Hugh's improvements to the first patch to put everything under 
the same stat_lock in shmem_reserve_inode.

What do you folks think?

Thanks,

Chris
Hugh Dickins Jan. 13, 2020, 6:58 a.m. UTC | #14
On Wed, 8 Jan 2020, Mikael Magnusson wrote:
> 
> It's unfortunately not true that everything handles this correctly.
> 32-bit binaries for games on Steam that use stat() without the 64 is
> so prevalent that I got tired of adding the LD_PRELOAD wrapper script
> and just patched out the EOVERFLOW return from glibc instead. (They
> obviously don't care about the inode value at all, and I don't use any
> other 32-bit binaries that do). This is probably a class of binaries
> you don't care very much about, and not very likely to be installed on
> a tmpfs that has wrapped around, but I thought it was worth mentioning
> that they do exist anyway.

Thank you for alerting us to reality, Mikael: not what any of us wanted
to hear, but we do care about them, and it was well worth mentioning.

Hugh
Hugh Dickins Jan. 13, 2020, 7:36 a.m. UTC | #15
On Fri, 10 Jan 2020, Chris Down wrote:
> Hugh Dickins writes:
> > Dave, Amir, Chris, many thanks for the info you've filled in -
> > and absolutely no need to run any scan on your fleet for this,
> > I think we can be confident that even if fb had some 15-year-old tool
> > in use on its fleet of 2GB-file filesystems, it would not be the one
> > to insist on a kernel revert of 64-bit tmpfs inos.
> > 
> > The picture looks clear now: while ChrisD does need to hold on to his
> > config option and inode32/inode64 mount option patch, it is much better
> > left out of the kernel until (very unlikely) proved necessary.
> 
> Based on Mikael's comment above about Steam binaries, and the lack of
> likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32},
> but make legacy behaviour require explicit opt-in. That is:
> 
> - Default it to inode64
> - Remove the Kconfig option
> - Only print it as an option if tmpfs was explicitly mounted with inode32
> 
> The reason I suggest keeping this is that I'm mildly concerned that the kind
> of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS
> -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to
> be the kind of people that would find it in an rc.

Okay.  None of us are thrilled with it, but I agree that
Mikael's observation should override our developer's preference.

So the "inode64" option will be accepted but redundant on mounting,
but exists for use as a remount option after mounting or remounting
with "inode32": allowing the admin to switch temporarily to mask off
the high ino bits with "inode32" when needing to run a limited binary.

Documentation and commit message to alert Andrew and Linus and distros
that we are risking some breakage with this, but supplying the antidote
(not breakage of any distros themselves, no doubt they're all good;
but breakage of what some users might run on them).

> 
> Other than that, the first patch could be similar to how it is now,
> incorporating Hugh's improvements to the first patch to put everything under
> the same stat_lock in shmem_reserve_inode.

So, I persuaded Amir to the other aspects my version, but did not
persuade you?  Well, I can live with that (or if not, can send mods
on top of yours): but please read again why I was uncomfortable with
yours, to check that you still prefer it (I agree that your patch is
simpler, and none of my discomfort decisive).

Thanks,
Hugh
Chris Down Jan. 20, 2020, 3:11 p.m. UTC | #16
Hi Hugh,

Sorry this response took so long, I had some non-work issues that took a lot of 
time last week.

Hugh Dickins writes:
>On Fri, 10 Jan 2020, Chris Down wrote:
>> Hugh Dickins writes:
>> > Dave, Amir, Chris, many thanks for the info you've filled in -
>> > and absolutely no need to run any scan on your fleet for this,
>> > I think we can be confident that even if fb had some 15-year-old tool
>> > in use on its fleet of 2GB-file filesystems, it would not be the one
>> > to insist on a kernel revert of 64-bit tmpfs inos.
>> >
>> > The picture looks clear now: while ChrisD does need to hold on to his
>> > config option and inode32/inode64 mount option patch, it is much better
>> > left out of the kernel until (very unlikely) proved necessary.
>>
>> Based on Mikael's comment above about Steam binaries, and the lack of
>> likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32},
>> but make legacy behaviour require explicit opt-in. That is:
>>
>> - Default it to inode64
>> - Remove the Kconfig option
>> - Only print it as an option if tmpfs was explicitly mounted with inode32
>>
>> The reason I suggest keeping this is that I'm mildly concerned that the kind
>> of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS
>> -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to
>> be the kind of people that would find it in an rc.
>
>Okay.  None of us are thrilled with it, but I agree that
>Mikael's observation should override our developer's preference.
>
>So the "inode64" option will be accepted but redundant on mounting,
>but exists for use as a remount option after mounting or remounting
>with "inode32": allowing the admin to switch temporarily to mask off
>the high ino bits with "inode32" when needing to run a limited binary.
>
>Documentation and commit message to alert Andrew and Linus and distros
>that we are risking some breakage with this, but supplying the antidote
>(not breakage of any distros themselves, no doubt they're all good;
>but breakage of what some users might run on them).

Sounds good.

>>
>> Other than that, the first patch could be similar to how it is now,
>> incorporating Hugh's improvements to the first patch to put everything under
>> the same stat_lock in shmem_reserve_inode.
>
>So, I persuaded Amir to the other aspects my version, but did not
>persuade you?  Well, I can live with that (or if not, can send mods
>on top of yours): but please read again why I was uncomfortable with
>yours, to check that you still prefer it (I agree that your patch is
>simpler, and none of my discomfort decisive).

Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(), 
or the fact that nr_inodes can now be 0 on non-internal mounts?

For batching, I'm neutral. I'm happy to use the approach from your patch and 
integrate it (and credit you, of course).

For shmem_encode_fh, I'm not totally sure I understand the concern, if that's 
what you mean.

For nr_inodes, I agree that intentional or unintentional, we should at least 
handle this case for now and can adjust later if the behaviour changes.

Thanks again,

Chris
Hugh Dickins Feb. 25, 2020, 11:14 p.m. UTC | #17
On Mon, 20 Jan 2020, Chris Down wrote:
> Hi Hugh,
> 
> Sorry this response took so long, I had some non-work issues that took a lot
> of time last week.

No, clearly it's I who must apologize to you for very slow response.

> 
> Hugh Dickins writes:
> > 
> > So the "inode64" option will be accepted but redundant on mounting,
> > but exists for use as a remount option after mounting or remounting
> > with "inode32": allowing the admin to switch temporarily to mask off
> > the high ino bits with "inode32" when needing to run a limited binary.
> > 
> > Documentation and commit message to alert Andrew and Linus and distros
> > that we are risking some breakage with this, but supplying the antidote
> > (not breakage of any distros themselves, no doubt they're all good;
> > but breakage of what some users might run on them).
> 
> Sounds good.
> 
> > > 
> > > Other than that, the first patch could be similar to how it is now,
> > > incorporating Hugh's improvements to the first patch to put everything
> > > under
> > > the same stat_lock in shmem_reserve_inode.
> > 
> > So, I persuaded Amir to the other aspects my version, but did not
> > persuade you?  Well, I can live with that (or if not, can send mods
> > on top of yours): but please read again why I was uncomfortable with
> > yours, to check that you still prefer it (I agree that your patch is
> > simpler, and none of my discomfort decisive).
> 
> Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(),
> or the fact that nr_inodes can now be 0 on non-internal mounts?

I was uncomfortable with tmpfs depleting get_next_ino()'s pool in some
configurations, and wanted the (get_next_ino-like) per-cpu but per-sb
batching for nr_inodes=0, to minimize its stat_lock contention.

I did not have a patch to shmem_encode_fh(), had gone through thinking
that 64-bit inos made an additional fix there necessary; but Amir then
educated us that it is safe as is, though could be cleaned up later.

nr_inodes can be 0 on non-internal mounts, for many years.

> 
> For batching, I'm neutral. I'm happy to use the approach from your patch and
> integrate it (and credit you, of course).

Credit not important, but you may well want to blame me for that
complication :)

> 
> For shmem_encode_fh, I'm not totally sure I understand the concern, if that's
> what you mean.

My concern had been that shmem_encode_fh() builds up an fh from i_ino
and more, looks well prepared for a 64-bit ino, but appeared to be
announcing a 32-bit ino in its return value: Amir reassures us that
that return value does not matter.

> 
> For nr_inodes, I agree that intentional or unintentional, we should at least
> handle this case for now and can adjust later if the behaviour changes.

nr_inodes=0 is an intentional configuration (but 0 denoting infinity:
not very clean, and I've sometimes regretted that choice).

If there's any behavior change, that's a separate matter from these
64-bit ino patches; maybe I mentioned it in passing and confused us -
that we seem to have recently allowed a remounting limited<->unlimited
that was not permitted before, and might or might not need a fix patch.

Sorry again for delaying you, Chris: not at all what I'd wanted to do.
Hugh
diff mbox series

Patch

diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 5ecbc03e6b2f..203e12a684c9 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -136,6 +136,15 @@  These options do not have any effect on remount. You can change these
 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
 
 
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+inode64   Use 64-bit inode numbers
+inode32   Use 32-bit inode numbers
+
+On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
+not legal and will produce an error at mount time.
+
 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
 will give you tmpfs instance on /mytmpfs which can allocate 10GB
 RAM/SWAP in 10240 inodes and it is only accessible by root.
@@ -147,3 +156,5 @@  Updated:
    Hugh Dickins, 4 June 2007
 Updated:
    KOSAKI Motohiro, 16 Mar 2010
+Updated:
+   Chris Down, 2 Jan 2020
diff --git a/fs/Kconfig b/fs/Kconfig
index 7b623e9fc1b0..e74f127df22a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -199,6 +199,21 @@  config TMPFS_XATTR
 
 	  If unsure, say N.
 
+config TMPFS_INODE64
+	bool "Use 64-bit ino_t by default in tmpfs"
+	depends on TMPFS && 64BIT
+	default n
+	help
+	  tmpfs has historically used only inode numbers as wide as an unsigned
+	  int. In some cases this can cause wraparound, potentially resulting in
+	  multiple files with the same inode number on a single device. This option
+	  makes tmpfs use the full width of ino_t by default, similarly to the
+	  inode64 mount option.
+
+	  To override this default, use the inode32 or inode64 mount options.
+
+	  If unsure, say N.
+
 config HUGETLBFS
 	bool "HugeTLB file system support"
 	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 7fac91f490dc..8925eb774518 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@  struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	bool full_inums;	    /* If i_ino should be uint or ino_t */
 	ino_t next_ino;		    /* The next per-sb inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9e97ba972225..05de65112bed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,11 +115,13 @@  struct shmem_options {
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
+	bool full_inums;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
+#define SHMEM_SEEN_INUMS 8
 };
 
 #ifdef CONFIG_TMPFS
@@ -2261,15 +2263,24 @@  static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 			 * since max_inodes is always 0, and is called from
 			 * potentially unknown contexts. As such, use the global
 			 * allocator which doesn't require the per-sb stat_lock.
+			 *
+			 * No special behaviour is needed for
+			 * sbinfo->full_inums, because it's not possible to
+			 * manually set on callers of this type, and
+			 * CONFIG_TMPFS_INODE64 only applies to user-visible
+			 * mounts.
 			 */
 			inode->i_ino = get_next_ino();
 		} else {
 			spin_lock(&sbinfo->stat_lock);
-			if (unlikely(sbinfo->next_ino > UINT_MAX)) {
+			if (unlikely(!sbinfo->full_inums &&
+				     sbinfo->next_ino > UINT_MAX)) {
 				/*
 				 * Emulate get_next_ino uint wraparound for
 				 * compatibility
 				 */
+				pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
+					__func__, MINOR(sb->s_dev));
 				sbinfo->next_ino = 1;
 			}
 			inode->i_ino = sbinfo->next_ino++;
@@ -3406,6 +3417,8 @@  enum shmem_param {
 	Opt_nr_inodes,
 	Opt_size,
 	Opt_uid,
+	Opt_inode32,
+	Opt_inode64,
 };
 
 static const struct fs_parameter_spec shmem_param_specs[] = {
@@ -3417,6 +3430,8 @@  static const struct fs_parameter_spec shmem_param_specs[] = {
 	fsparam_string("nr_inodes",	Opt_nr_inodes),
 	fsparam_string("size",		Opt_size),
 	fsparam_u32   ("uid",		Opt_uid),
+	fsparam_flag  ("inode32",	Opt_inode32),
+	fsparam_flag  ("inode64",	Opt_inode64),
 	{}
 };
 
@@ -3441,6 +3456,7 @@  static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	unsigned long long size;
 	char *rest;
 	int opt;
+	const char *err;
 
 	opt = fs_parse(fc, &shmem_fs_parameters, param, &result);
 	if (opt < 0)
@@ -3502,6 +3518,18 @@  static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 			break;
 		}
 		goto unsupported_parameter;
+	case Opt_inode32:
+		ctx->full_inums = false;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
+	case Opt_inode64:
+		if (sizeof(ino_t) < 8) {
+			err = "Cannot use inode64 with <64bit inums in kernel";
+			goto err_msg;
+		}
+		ctx->full_inums = true;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
 	}
 	return 0;
 
@@ -3509,6 +3537,8 @@  static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key);
 bad_value:
 	return invalf(fc, "tmpfs: Bad value for '%s'", param->key);
+err_msg:
+	return invalf(fc, "tmpfs: %s", err);
 }
 
 static int shmem_parse_options(struct fs_context *fc, void *data)
@@ -3593,8 +3623,16 @@  static int shmem_reconfigure(struct fs_context *fc)
 		}
 	}
 
+	if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
+	    sbinfo->next_ino > UINT_MAX) {
+		err = "Current inum too high to switch to 32-bit inums";
+		goto out;
+	}
+
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
+	if (ctx->seen & SHMEM_SEEN_INUMS)
+		sbinfo->full_inums = ctx->full_inums;
 	if (ctx->seen & SHMEM_SEEN_BLOCKS)
 		sbinfo->max_blocks  = ctx->blocks;
 	if (ctx->seen & SHMEM_SEEN_INODES) {
@@ -3634,6 +3672,29 @@  static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+
+	/*
+	 * Showing inode{64,32} might be useful even if it's the system default,
+	 * since then people don't have to resort to checking both here and
+	 * /proc/config.gz to confirm 64-bit inums were successfully applied
+	 * (which may not even exist if IKCONFIG_PROC isn't enabled).
+	 *
+	 * We hide it when inode64 isn't the default and we are using 32-bit
+	 * inodes, since that probably just means the feature isn't even under
+	 * consideration.
+	 *
+	 * As such:
+	 *
+	 *                     +-----------------+-----------------+
+	 *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
+	 *  +------------------+-----------------+-----------------+
+	 *  | full_inums=true  | show            | show            |
+	 *  | full_inums=false | show            | hide            |
+	 *  +------------------+-----------------+-----------------+
+	 *
+	 */
+	if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
+		seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
 	if (sbinfo->huge)
@@ -3681,6 +3742,8 @@  static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 			ctx->blocks = shmem_default_max_blocks();
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
 			ctx->inodes = shmem_default_max_inodes();
+		if (!(ctx->seen & SHMEM_SEEN_INUMS))
+			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
 	} else {
 		sb->s_flags |= SB_NOUSER;
 	}
@@ -3694,6 +3757,7 @@  static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
+	sbinfo->full_inums = ctx->full_inums;
 	sbinfo->mode = ctx->mode;
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;