diff mbox series

[09/20] ext4: Initialize timestamps limits

Message ID 20190730014924.2193-10-deepa.kernel@gmail.com (mailing list archive)
State New, archived
Headers show
Series vfs: Add support for timestamp limits | expand

Commit Message

Deepa Dinamani July 30, 2019, 1:49 a.m. UTC
ext4 has different overflow limits for max filesystem
timestamps based on the extra bytes available.

The timestamp limits are calculated according to the
encoding table in
a4dad1ae24f85i(ext4: Fix handling of extended tv_sec):

* extra  msb of                         adjust for signed
* epoch  32-bit                         32-bit tv_sec to
* bits   time    decoded 64-bit tv_sec  64-bit tv_sec      valid time range
* 0 0    1    -0x80000000..-0x00000001  0x000000000   1901-12-13..1969-12-31
* 0 0    0    0x000000000..0x07fffffff  0x000000000   1970-01-01..2038-01-19
* 0 1    1    0x080000000..0x0ffffffff  0x100000000   2038-01-19..2106-02-07
* 0 1    0    0x100000000..0x17fffffff  0x100000000   2106-02-07..2174-02-25
* 1 0    1    0x180000000..0x1ffffffff  0x200000000   2174-02-25..2242-03-16
* 1 0    0    0x200000000..0x27fffffff  0x200000000   2242-03-16..2310-04-04
* 1 1    1    0x280000000..0x2ffffffff  0x300000000   2310-04-04..2378-04-22
* 1 1    0    0x300000000..0x37fffffff  0x300000000   2378-04-22..2446-05-10

Note that the time limits are not correct for deletion times.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: tytso@mit.edu
Cc: adilger.kernel@dilger.ca
Cc: linux-ext4@vger.kernel.org
---
 fs/ext4/ext4.h  |  4 ++++
 fs/ext4/super.c | 17 +++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

Comments

Darrick J. Wong July 31, 2019, 3:26 p.m. UTC | #1
On Mon, Jul 29, 2019 at 06:49:13PM -0700, Deepa Dinamani wrote:
> ext4 has different overflow limits for max filesystem
> timestamps based on the extra bytes available.
> 
> The timestamp limits are calculated according to the
> encoding table in
> a4dad1ae24f85i(ext4: Fix handling of extended tv_sec):
> 
> * extra  msb of                         adjust for signed
> * epoch  32-bit                         32-bit tv_sec to
> * bits   time    decoded 64-bit tv_sec  64-bit tv_sec      valid time range
> * 0 0    1    -0x80000000..-0x00000001  0x000000000   1901-12-13..1969-12-31
> * 0 0    0    0x000000000..0x07fffffff  0x000000000   1970-01-01..2038-01-19
> * 0 1    1    0x080000000..0x0ffffffff  0x100000000   2038-01-19..2106-02-07
> * 0 1    0    0x100000000..0x17fffffff  0x100000000   2106-02-07..2174-02-25
> * 1 0    1    0x180000000..0x1ffffffff  0x200000000   2174-02-25..2242-03-16
> * 1 0    0    0x200000000..0x27fffffff  0x200000000   2242-03-16..2310-04-04
> * 1 1    1    0x280000000..0x2ffffffff  0x300000000   2310-04-04..2378-04-22
> * 1 1    0    0x300000000..0x37fffffff  0x300000000   2378-04-22..2446-05-10

My recollection of ext4 has gotten rusty, so this could be a bogus
question:

Say you have a filesystem with s_inode_size > 128 where not all of the
ondisk inodes have been upgraded to i_extra_isize > 0 and therefore
don't support nanoseconds or times beyond 2038.  I think this happens on
ext3 filesystems that reserved extra space for inode attrs that are
subsequently converted to ext4?

In any case, that means that you have some inodes that support 34-bit
tv_sec and some inodes that only support 32-bit tv_sec.  For the inodes
with 32-bit tv_sec, I think all that happens is that a large timestamp
will be truncated further, right?

And no mount time warning because at least /some/ of the inodes are
ready to go for the next 30 years?

--D

> Note that the time limits are not correct for deletion times.
> 
> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
> Cc: tytso@mit.edu
> Cc: adilger.kernel@dilger.ca
> Cc: linux-ext4@vger.kernel.org
> ---
>  fs/ext4/ext4.h  |  4 ++++
>  fs/ext4/super.c | 17 +++++++++++++++--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1cb67859e051..3f13cf12ae7f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1631,6 +1631,10 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
>  
>  #define EXT4_GOOD_OLD_INODE_SIZE 128
>  
> +#define EXT4_EXTRA_TIMESTAMP_MAX	(((s64)1 << 34) - 1  + S32_MIN)
> +#define EXT4_NON_EXTRA_TIMESTAMP_MAX	S32_MAX
> +#define EXT4_TIMESTAMP_MIN		S32_MIN
> +
>  /*
>   * Feature set definitions
>   */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4079605d437a..3ea2d60f33aa 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4035,8 +4035,21 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  			       sbi->s_inode_size);
>  			goto failed_mount;
>  		}
> -		if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE)
> -			sb->s_time_gran = 1 << (EXT4_EPOCH_BITS - 2);
> +		/*
> +		 * i_atime_extra is the last extra field available for [acm]times in
> +		 * struct ext4_inode. Checking for that field should suffice to ensure
> +		 * we have extra space for all three.
> +		 */
> +		if (sbi->s_inode_size >= offsetof(struct ext4_inode, i_atime_extra) +
> +			sizeof(((struct ext4_inode *)0)->i_atime_extra)) {
> +			sb->s_time_gran = 1;
> +			sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
> +		} else {
> +			sb->s_time_gran = NSEC_PER_SEC;
> +			sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
> +		}
> +
> +		sb->s_time_min = EXT4_TIMESTAMP_MIN;
>  	}
>  
>  	sbi->s_desc_size = le16_to_cpu(es->s_desc_size);
> -- 
> 2.17.1
>
Deepa Dinamani Aug. 1, 2019, 7:18 p.m. UTC | #2
On Wed, Jul 31, 2019 at 8:26 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Mon, Jul 29, 2019 at 06:49:13PM -0700, Deepa Dinamani wrote:
> > ext4 has different overflow limits for max filesystem
> > timestamps based on the extra bytes available.
> >
> > The timestamp limits are calculated according to the
> > encoding table in
> > a4dad1ae24f85i(ext4: Fix handling of extended tv_sec):
> >
> > * extra  msb of                         adjust for signed
> > * epoch  32-bit                         32-bit tv_sec to
> > * bits   time    decoded 64-bit tv_sec  64-bit tv_sec      valid time range
> > * 0 0    1    -0x80000000..-0x00000001  0x000000000   1901-12-13..1969-12-31
> > * 0 0    0    0x000000000..0x07fffffff  0x000000000   1970-01-01..2038-01-19
> > * 0 1    1    0x080000000..0x0ffffffff  0x100000000   2038-01-19..2106-02-07
> > * 0 1    0    0x100000000..0x17fffffff  0x100000000   2106-02-07..2174-02-25
> > * 1 0    1    0x180000000..0x1ffffffff  0x200000000   2174-02-25..2242-03-16
> > * 1 0    0    0x200000000..0x27fffffff  0x200000000   2242-03-16..2310-04-04
> > * 1 1    1    0x280000000..0x2ffffffff  0x300000000   2310-04-04..2378-04-22
> > * 1 1    0    0x300000000..0x37fffffff  0x300000000   2378-04-22..2446-05-10
>
> My recollection of ext4 has gotten rusty, so this could be a bogus
> question:
>
> Say you have a filesystem with s_inode_size > 128 where not all of the
> ondisk inodes have been upgraded to i_extra_isize > 0 and therefore
> don't support nanoseconds or times beyond 2038.  I think this happens on
> ext3 filesystems that reserved extra space for inode attrs that are
> subsequently converted to ext4?
>
> In any case, that means that you have some inodes that support 34-bit
> tv_sec and some inodes that only support 32-bit tv_sec.  For the inodes
> with 32-bit tv_sec, I think all that happens is that a large timestamp
> will be truncated further, right?
>
> And no mount time warning because at least /some/ of the inodes are
> ready to go for the next 30 years?

I'm confused about ext3 being converted to ext4. If the converted
inodes have extra space, then ext4_iget() will start using the extra
space when it modifies the on disk inode, won't it?

But, if there are 32 bit tv_sec and 34 bit tv_sec in a superblock then
from this macro below, if an inode has space for extra bits in
timestamps, it uses it. Otherwise, only the first 32 bits are copied
to the on disk timestamp. This matches the behavior today for 32 bit
tv_sec. But, the 34 bit tv_sec has a better behavior after the series
because of the clamping and warning.

#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)                \
do {                                        \
    (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec);        \
    if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra))     {\
        (raw_inode)->xtime ## _extra =                    \
                ext4_encode_extra_time(&(inode)->xtime);    \
        }                                \
} while (0)

I'm not sure if this corner case if important. Maybe the maintainers
can help me here. If this is important, then the inode time updates
for an ext4 inode should always be through ext4_setattr() and we can
clamp the timestamps there as a special case. And, this patch can be
added separately?

Thanks,
Deepa
Theodore Ts'o Aug. 1, 2019, 10:43 p.m. UTC | #3
On Thu, Aug 01, 2019 at 12:18:28PM -0700, Deepa Dinamani wrote:
> > Say you have a filesystem with s_inode_size > 128 where not all of the
> > ondisk inodes have been upgraded to i_extra_isize > 0 and therefore
> > don't support nanoseconds or times beyond 2038.  I think this happens on
> > ext3 filesystems that reserved extra space for inode attrs that are
> > subsequently converted to ext4?
> 
> I'm confused about ext3 being converted to ext4. If the converted
> inodes have extra space, then ext4_iget() will start using the extra
> space when it modifies the on disk inode, won't it?i

It is possible that you can have an ext3 file system with (for
example) 256 byte inodes, and all of the extra space was used for
extended attributes, then ext4 won't have the extra space available.
This is going toh be on an inode-by-inode basis, and if an extended
attribute is motdified or deleted, the space would become available,t
and then inode would start getting a higher resolution timestamp.

I really don't think it's worth worrying about that, though.  It's
highly unlikely ext3 file systems will be still be in service by the
time it's needed in 2038.  And if so, it's highly unlikely they would
be converted to ext4.

						- Ted
Arnd Bergmann Aug. 2, 2019, 10:39 a.m. UTC | #4
On Fri, Aug 2, 2019 at 12:43 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Thu, Aug 01, 2019 at 12:18:28PM -0700, Deepa Dinamani wrote:
> > > Say you have a filesystem with s_inode_size > 128 where not all of the
> > > ondisk inodes have been upgraded to i_extra_isize > 0 and therefore
> > > don't support nanoseconds or times beyond 2038.  I think this happens on
> > > ext3 filesystems that reserved extra space for inode attrs that are
> > > subsequently converted to ext4?
> >
> > I'm confused about ext3 being converted to ext4. If the converted
> > inodes have extra space, then ext4_iget() will start using the extra
> > space when it modifies the on disk inode, won't it?i
>
> It is possible that you can have an ext3 file system with (for
> example) 256 byte inodes, and all of the extra space was used for
> extended attributes, then ext4 won't have the extra space available.
> This is going toh be on an inode-by-inode basis, and if an extended
> attribute is motdified or deleted, the space would become available,t
> and then inode would start getting a higher resolution timestamp.

Is it correct to assume that this kind of file would have to be
created using the ext3.ko file system implementation that was
removed in linux-4.3, but not using ext2.ko or ext4.ko (which
would always set the extended timestamps even in "-t ext2" or
"-t ext3" mode)?

I tried to reproduce this on a modern kernel and with and
moderately old debugfs (1.42.13) but failed.

> I really don't think it's worth worrying about that, though.  It's
> highly unlikely ext3 file systems will be still be in service by the
> time it's needed in 2038.  And if so, it's highly unlikely they would
> be converted to ext4.

As the difference is easily visible even before y2038 by using
utimensat(old_inode, future_date) on a file, we should at least
decide what the sanest behavior is that we can easily implement,
and then document what is expected to happen here.

If we check for s_min_extra_isize instead of s_inode_size
to determine s_time_gran/s_time_max, we would warn
at mount time as well as and consistently truncate all
timestamps to full 32-bit seconds, regardless of whether
there is actually space or not.

Alternatively, we could warn if s_min_extra_isize is
too small, but use i_inode_size to determine
s_time_gran/s_time_max anyway.

From looking at e2fsprogs git history, I see that
s_min_extra_isize has always been set by mkfs since
2008, but I'm not sure if there would have been a
case in which it remains set but the ext3.ko would
ignore it and use that space anyway.

       Arnd
Theodore Ts'o Aug. 2, 2019, 3:43 p.m. UTC | #5
On Fri, Aug 02, 2019 at 12:39:41PM +0200, Arnd Bergmann wrote:
> Is it correct to assume that this kind of file would have to be
> created using the ext3.ko file system implementation that was
> removed in linux-4.3, but not usiing ext2.ko or ext4.ko (which
> would always set the extended timestamps even in "-t ext2" or
> "-t ext3" mode)?

Correct.  Some of the enterprise distro's were using ext4 to support
"mount -t ext3" even before 4.3.  There's a CONFIG option to enable
using ext4 for ext2 or ext3 if they aren't enabled.

> If we check for s_min_extra_isize instead of s_inode_size
> to determine s_time_gran/s_time_max, we would warn
> at mount time as well as and consistently truncate all
> timestamps to full 32-bit seconds, regardless of whether
> there is actually space or not.
> 
> Alternatively, we could warn if s_min_extra_isize is
> too small, but use i_inode_size to determine
> s_time_gran/s_time_max anyway.

Even with ext4, s_min_extra_isize doesn't guarantee that will be able
to expand the inode.  This can fail if (a) we aren't able to expand
existing the transaction handle because there isn't enough space in
the journal, or (b) there is already an external xattr block which is
also full, so there is no space to evacuate an extended attribute out
of the inode's extra space.

We could be more aggressive by trying to expand make room in the inode
in ext4_iget (when we're reading in the inode, assuming the file
system isn't mounted read/only), instead of in the middle of
mark_inode_dirty().  That will eliminate failure mode (a) --- which is
statistically rare --- but it won't eliminate failure mode (b).

Ultimately, the question is which is worse: having a timestamp be
wrong, or randomly dropping an xattr from the inode to make room for
the extended timestamp.  We've come down on it being less harmful to
have the timestamp be wrong.

But again, this is a pretty rare case.  I'm not convinced it's worth
stressing about, since it's going to require multiple things to go
wrong before a timestamp will be bad.

					- Ted
Arnd Bergmann Aug. 2, 2019, 7 p.m. UTC | #6
On Fri, Aug 2, 2019 at 5:43 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Aug 02, 2019 at 12:39:41PM +0200, Arnd Bergmann wrote:
> > Is it correct to assume that this kind of file would have to be
> > created using the ext3.ko file system implementation that was
> > removed in linux-4.3, but not usiing ext2.ko or ext4.ko (which
> > would always set the extended timestamps even in "-t ext2" or
> > "-t ext3" mode)?
>
> Correct.  Some of the enterprise distro's were using ext4 to support
> "mount -t ext3" even before 4.3.  There's a CONFIG option to enable
> using ext4 for ext2 or ext3 if they aren't enabled.
>
> > If we check for s_min_extra_isize instead of s_inode_size
> > to determine s_time_gran/s_time_max, we would warn
> > at mount time as well as and consistently truncate all
> > timestamps to full 32-bit seconds, regardless of whether
> > there is actually space or not.
> >
> > Alternatively, we could warn if s_min_extra_isize is
> > too small, but use i_inode_size to determine
> > s_time_gran/s_time_max anyway.
>
> Even with ext4, s_min_extra_isize doesn't guarantee that will be able
> to expand the inode.  This can fail if (a) we aren't able to expand
> existing the transaction handle because there isn't enough space in
> the journal, or (b) there is already an external xattr block which is
> also full, so there is no space to evacuate an extended attribute out
> of the inode's extra space.

I must have misunderstood what the field says. I expected that
with s_min_extra_isize set beyond the nanosecond fields, there
would be a guarantee that all inodes have at least as many
extra bytes already allocated. What circumstances would lead to
an i_extra_isize smaller than s_min_extra_isize?

> We could be more aggressive by trying to expand make room in the inode
> in ext4_iget (when we're reading in the inode, assuming the file
> system isn't mounted read/only), instead of in the middle of
> mark_inode_dirty().  That will eliminate failure mode (a) --- which is
> statistically rare --- but it won't eliminate failure mode (b).
>
> Ultimately, the question is which is worse: having a timestamp be
> wrong, or randomly dropping an xattr from the inode to make room for
> the extended timestamp.  We've come down on it being less harmful to
> have the timestamp be wrong.
>
> But again, this is a pretty rare case.  I'm not convinced it's worth
> stressing about, since it's going to require multiple things to go
> wrong before a timestamp will be bad.

Agreed, I'm not overly worried about this happening frequently,
I'd just feel better if we could reliably warn about the few instances
that might theoretically be affected.

        Arnd
Theodore Ts'o Aug. 2, 2019, 9:39 p.m. UTC | #7
On Fri, Aug 02, 2019 at 09:00:52PM +0200, Arnd Bergmann wrote:
> 
> I must have misunderstood what the field says. I expected that
> with s_min_extra_isize set beyond the nanosecond fields, there
> would be a guarantee that all inodes have at least as many
> extra bytes already allocated. What circumstances would lead to
> an i_extra_isize smaller than s_min_extra_isize?

When allocating new inodes, i_extra_isize is set to
s_want_extra_isize.  When modifying existing inodes, if i_extra_isize
is less than s_min_extra_isize, then we will attempt to move out
extended attribute(s) to the external xattr block.  So the
s_min_extra_isize field is not a guarantee, but rather an aspirationa
goal.  The idea is that at some point when we want to enable a new
feature, which needs more extra inode space, we can adjust
s_min_extra_size and s_want_extra_size, and the file system will
migrate things to meet these constraints.

The plan was to teach e2fsck how to fix all of the inodes to meet theh
s_min_extra_size value, but that never got implemented, and we even
then, e2fsck would have to deal with the case where tit couldn't move
the extended attribute(s) in the inode out, because there was no place
to put them.

In practice, this hasn't been that much of a limitation because we
haven't been adding that many extra inode fields.  Keep in mind that
Red Hat for example, has explicitly said they will *never* support
adding new features to an existing file system.  Their only supported
method is back up the file system, reformat it with the new file
system features, and then restore the file system.

Of course, if the backup/restore includes backing up the extended
attributes, and then restoring them, the xattr restore could fail,
unless the user also increased the inode size (e.g., from 256 bytes to
512 bytes).

Getting this right in the general case is *hard*.  Fortunately, the
corner cases really don't happen that often in practice, at least not
for pure Linux workloads.  Windows which can have arbitrarily large
security id's and ACL's might make this harder, of course --- although
ext4's EA in inode feature would make this better, modulo needing to
write more complex file system code to handle moving xattrs around.

Since the extended timestamps were one of the first extra inode fields
to be added, I strongly suggest that we not try to borrow trouble.
Solving the general case problem is *hard*.

					- Ted
Arnd Bergmann Aug. 3, 2019, 9:30 a.m. UTC | #8
On Fri, Aug 2, 2019 at 11:39 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Aug 02, 2019 at 09:00:52PM +0200, Arnd Bergmann wrote:
> >
> > I must have misunderstood what the field says. I expected that
> > with s_min_extra_isize set beyond the nanosecond fields, there
> > would be a guarantee that all inodes have at least as many
> > extra bytes already allocated. What circumstances would lead to
> > an i_extra_isize smaller than s_min_extra_isize?
>
> When allocating new inodes, i_extra_isize is set to
> s_want_extra_isize.  When modifying existing inodes, if i_extra_isize
> is less than s_min_extra_isize, then we will attempt to move out
> extended attribute(s) to the external xattr block.  So the
> s_min_extra_isize field is not a guarantee, but rather an aspirationa
> goal.  The idea is that at some point when we want to enable a new
> feature, which needs more extra inode space, we can adjust
> s_min_extra_size and s_want_extra_size, and the file system will
> migrate things to meet these constraints.

I see in the ext4 code that we always try to expand i_extra_size
to s_want_extra_isize in ext4_mark_inode_dirty(), and that
s_want_extra_isize is always at least  s_min_extra_isize, so
we constantly try to expand the inode to fit.

What I still don't see is how any inode on the file system
image could have ended up with less than s_min_extra_isize
in the first place if s_min_extra_isize is never modified and
all inodes in the file system would have originally been
created with  i_extra_isize >= s_min_extra_isize if
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE is set.

Did older versions of ext4 or ext3 ignore s_min_extra_isize
when creating inodes despite
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE,
or is there another possibility I'm missing?

> Since the extended timestamps were one of the first extra inode fields
> to be added, I strongly suggest that we not try to borrow trouble.
> Solving the general case problem is *hard*.

As I said before, I absolutely don't suggest we solve the problem
of reliably setting the timestamps, I'm just trying to find out if there
is a way to know for sure that it cannot happen and alert the user
otherwise. So far I think we have concluded

- just checking s_inode_size is not sufficient because ext3
  may have created inodes with s_extra_isize too small
- checking s_min_extra_isize may not be sufficient either, for
  similar reasons I don't yet fully understand (see above).

If there is any other way to be sure that the file system
has never been mounted as a writable ext3, maybe that can
be used instead?

        Arnd
Theodore Ts'o Aug. 3, 2019, 4:02 p.m. UTC | #9
On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote:
> 
> I see in the ext4 code that we always try to expand i_extra_size
> to s_want_extra_isize in ext4_mark_inode_dirty(), and that
> s_want_extra_isize is always at least  s_min_extra_isize, so
> we constantly try to expand the inode to fit.

Yes, we *try*.  But we may not succeed.  There may actually be a
problem here if the cause is due to there simply is no space in the
external xattr block, so we might try and try every time we try to
modify that inode, and it would be a performance mess.  If it's due to
there being no room in the current transaction, then it's highly
likely it will succeed the next time.

> Did older versions of ext4 or ext3 ignore s_min_extra_isize
> when creating inodes despite
> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE,
> or is there another possibility I'm missing?

s_min_extra_isize could get changed in order to make room for some new
file system feature --- such as extended timestamps.  That's how we
might take an old ext3 file system with an inode size > 128, and try
to evacuate space for extended timestamps, on a best efforts basis.
And since it's best efforts is why Red Hat refuses to support that
case.  It'll work 99.9% of the time, but they don't want to deal with
the 0.01% cases showing up at their help desk.

If you want to pretend that file systems never get upgraded, then life
is much simpler.  The general approach is that for less-sophisticated
customers (e.g., most people running enterprise distros) file system
upgrades are not a thing.  But for sophisticated users, we do try to
make thing work for people who are aware of the risks / caveats /
rough edges.  Google won't have been able to upgrade thousands and
thousands of servers in data centers all over the world if we limited
ourselves to Red Hat's support restrictions.  Backup / reformat /
restore really isn't a practical rollout strategy for many exabytes of
file systems.

It sounds like your safety checks / warnings are mostly targeted at
low-information customers, no?

					- Ted
Arnd Bergmann Aug. 3, 2019, 8:24 p.m. UTC | #10
On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote:
> >
> > I see in the ext4 code that we always try to expand i_extra_size
> > to s_want_extra_isize in ext4_mark_inode_dirty(), and that
> > s_want_extra_isize is always at least  s_min_extra_isize, so
> > we constantly try to expand the inode to fit.
>
> Yes, we *try*.  But we may not succeed.  There may actually be a
> problem here if the cause is due to there simply is no space in the
> external xattr block, so we might try and try every time we try to
> modify that inode, and it would be a performance mess.  If it's due to
> there being no room in the current transaction, then it's highly
> likely it will succeed the next time.
>
> > Did older versions of ext4 or ext3 ignore s_min_extra_isize
> > when creating inodes despite
> > EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE,
> > or is there another possibility I'm missing?
>
> s_min_extra_isize could get changed in order to make room for some new
> file system feature --- such as extended timestamps.

Ok, that explains it. I assumed s_min_extra_isize was meant to
not be modifiable, and did not find a way to change it using the
kernel or tune2fs, but now I can see that debugfs can set it.

> If you want to pretend that file systems never get upgraded, then life
> is much simpler.  The general approach is that for less-sophisticated
> customers (e.g., most people running enterprise distros) file system
> upgrades are not a thing.  But for sophisticated users, we do try to
> make thing work for people who are aware of the risks / caveats /
> rough edges.  Google won't have been able to upgrade thousands and
> thousands of servers in data centers all over the world if we limited
> ourselves to Red Hat's support restrictions.  Backup / reformat /
> restore really isn't a practical rollout strategy for many exabytes of
> file systems.
>
> It sounds like your safety checks / warnings are mostly targeted at
> low-information customers, no?

Yes, that seems like a reasonable compromise: just warn based
on s_min_extra_isize, and assume that anyone who used debugfs
to set s_min_extra_isize to a higher value from an ext3 file system
during the migration to ext4 was aware of the risks already.

That leaves the question of what we should set the s_time_gran
and s_time_max to on a superblock with s_min_extra_isize<16
and s_want_extra_isize>=16.

If we base it on s_min_extra_isize, we never try to set a timestamp
later than 2038 and so will never fail, but anyone with a grandfathered
s_min_extra_isize from ext3 won't be able to set extended
timestamps on any files any more. Based on s_want_extra_isize
we would keep the current behavior, but could add a custom
warning in the ext4 code about the small s_min_extra_isize
indicating a theoretical problem.

       Arnd
Andreas Dilger Aug. 7, 2019, 6:04 p.m. UTC | #11
On Aug 3, 2019, at 2:24 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> 
>> On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote:
>>> 
>>> I see in the ext4 code that we always try to expand i_extra_size
>>> to s_want_extra_isize in ext4_mark_inode_dirty(), and that
>>> s_want_extra_isize is always at least  s_min_extra_isize, so
>>> we constantly try to expand the inode to fit.
>> 
>> Yes, we *try*.  But we may not succeed.  There may actually be a
>> problem here if the cause is due to there simply is no space in the
>> external xattr block, so we might try and try every time we try to
>> modify that inode, and it would be a performance mess.  If it's due to
>> there being no room in the current transaction, then it's highly
>> likely it will succeed the next time.
>> 
>>> Did older versions of ext4 or ext3 ignore s_min_extra_isize
>>> when creating inodes despite
>>> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE,
>>> or is there another possibility I'm missing?
>> 
>> s_min_extra_isize could get changed in order to make room for some new
>> file system feature --- such as extended timestamps.
> 
> Ok, that explains it. I assumed s_min_extra_isize was meant to
> not be modifiable, and did not find a way to change it using the
> kernel or tune2fs, but now I can see that debugfs can set it.
> 
>> If you want to pretend that file systems never get upgraded, then life
>> is much simpler.  The general approach is that for less-sophisticated
>> customers (e.g., most people running enterprise distros) file system
>> upgrades are not a thing.  But for sophisticated users, we do try to
>> make thing work for people who are aware of the risks / caveats /
>> rough edges.  Google won't have been able to upgrade thousands and
>> thousands of servers in data centers all over the world if we limited
>> ourselves to Red Hat's support restrictions.  Backup / reformat /
>> restore really isn't a practical rollout strategy for many exabytes of
>> file systems.
>> 
>> It sounds like your safety checks / warnings are mostly targeted at
>> low-information customers, no?
> 
> Yes, that seems like a reasonable compromise: just warn based
> on s_min_extra_isize, and assume that anyone who used debugfs
> to set s_min_extra_isize to a higher value from an ext3 file system
> during the migration to ext4 was aware of the risks already.
> 
> That leaves the question of what we should set the s_time_gran
> and s_time_max to on a superblock with s_min_extra_isize<16
> and s_want_extra_isize>=16.
> 
> If we base it on s_min_extra_isize, we never try to set a timestamp
> later than 2038 and so will never fail, but anyone with a grandfathered
> s_min_extra_isize from ext3 won't be able to set extended
> timestamps on any files any more. Based on s_want_extra_isize
> we would keep the current behavior, but could add a custom
> warning in the ext4 code about the small s_min_extra_isize
> indicating a theoretical problem.

I think it makes the most sense to always try to set timestamps on
inodes that have enough space for them.  The chance of running into
a filesystem with 256-byte inode size but *no* space in the inode to
store an extended timestamp, but is *also* being modified by a new
kernel after 2038 is vanishingly small.  This would require formatting
the filesystem with non-default mke2fs for ext3, using the filesystem
and storing enough xattrs on inodes that there isn't space for 12 bytes
of extra isize, and using it for 30+ years without upgrading to use
ext4 (which will also try to expand the inode to store the nsec
timestamps) and then modifying the inode after 2038.

Rather than printing a warning at mount time (which may be confusing
to users for a problem they may never see), it makes sense to only
print such a warning in the vanishingly small case that someone actually
tries to modify the inode timestamp but it doesn't fit, rather than on
the theoretical case that may never happen.


Cheers, Andreas
Deepa Dinamani Aug. 8, 2019, 6:27 p.m. UTC | #12
> Rather than printing a warning at mount time (which may be confusing
> to users for a problem they may never see), it makes sense to only
> print such a warning in the vanishingly small case that someone actually
> tries to modify the inode timestamp but it doesn't fit, rather than on
> the theoretical case that may never happen.

Arnd and I were discussing and we came to a similar conclusion that we
would not warn at mount. Arnd suggested maybe we include a
pr_warn_ratelimited() when we do write to these special inodes.

In general, there is an agreement to leave the fs granularity to a
higher resolution for such super blocks. And hence, the timestamps for
these special cases will never be clamped in memory.(Assuming we do
not want to make more changes and try to do something like
__ext4_expand_extra_isize() for in memory inode updates)
We could easily try to clamp these timestamps when they get written
out to the disk by modifying the EXT4_INODE_SET_XTIME to include such
a clamp. And, at this point we could also warn.

If this is acceptable, I could post an update.

-Deepa
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1cb67859e051..3f13cf12ae7f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1631,6 +1631,10 @@  static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 
 #define EXT4_GOOD_OLD_INODE_SIZE 128
 
+#define EXT4_EXTRA_TIMESTAMP_MAX	(((s64)1 << 34) - 1  + S32_MIN)
+#define EXT4_NON_EXTRA_TIMESTAMP_MAX	S32_MAX
+#define EXT4_TIMESTAMP_MIN		S32_MIN
+
 /*
  * Feature set definitions
  */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605d437a..3ea2d60f33aa 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4035,8 +4035,21 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			       sbi->s_inode_size);
 			goto failed_mount;
 		}
-		if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE)
-			sb->s_time_gran = 1 << (EXT4_EPOCH_BITS - 2);
+		/*
+		 * i_atime_extra is the last extra field available for [acm]times in
+		 * struct ext4_inode. Checking for that field should suffice to ensure
+		 * we have extra space for all three.
+		 */
+		if (sbi->s_inode_size >= offsetof(struct ext4_inode, i_atime_extra) +
+			sizeof(((struct ext4_inode *)0)->i_atime_extra)) {
+			sb->s_time_gran = 1;
+			sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
+		} else {
+			sb->s_time_gran = NSEC_PER_SEC;
+			sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
+		}
+
+		sb->s_time_min = EXT4_TIMESTAMP_MIN;
 	}
 
 	sbi->s_desc_size = le16_to_cpu(es->s_desc_size);