Message ID | 159797594159.965217.2504039364311840477.stgit@magnolia (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: widen timestamps to deal with y2038 | expand |
> * in the AGI header so that we can skip the finobt walk at mount time when > @@ -855,12 +862,18 @@ struct xfs_agfl { > * > * Inode timestamps consist of signed 32-bit counters for seconds and > * nanoseconds; time zero is the Unix epoch, Jan 1 00:00:00 UTC 1970. > + * > + * When bigtime is enabled, timestamps become an unsigned 64-bit nanoseconds > + * counter. Time zero is the start of the classic timestamp range. > */ > union xfs_timestamp { > struct { > __be32 t_sec; /* timestamp seconds */ > __be32 t_nsec; /* timestamp nanoseconds */ > }; > + > + /* Nanoseconds since the bigtime epoch. */ > + __be64 t_bigtime; > }; So do we really need the union here? What about: (1) keep the typedef instead of removing it (2) switch the typedef to be just a __be64, and use trivial helpers to extract the two separate legacy sec/nsec field (3) PROFIT!!! > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ > void > xfs_inode_from_disk_timestamp( > + struct xfs_dinode *dip, > struct timespec64 *tv, > const union xfs_timestamp *ts) I think passing ts by value might lead to somewhat better code generation on modern ABIs (and older ABIs just fall back to pass by reference transparently). > { > + if (dip->di_version >= 3 && > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { Do we want a helper for this condition? > + uint64_t t = be64_to_cpu(ts->t_bigtime); > + uint64_t s; > + uint32_t n; > + > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > + tv->tv_nsec = n; > + return; > + } > + > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); Nit: for these kinds of symmetric conditions and if/else feels a little more natural. > + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); This adds a > 80 char line. > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > + uint64_t t; > + > + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); > + t *= NSEC_PER_SEC; > + its->t_bigtime = t + ts->tv_nsec; This calculation is dupliated in two places, might be worth adding a little helper (which will need to get the sec/nsec values passed separately due to the different structures). > + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); Another line over 8 characters here. > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) { > + sb->s_time_min = XFS_INO_BIGTIME_MIN; > + sb->s_time_max = XFS_INO_BIGTIME_MAX; > + } else { > + sb->s_time_min = XFS_INO_TIME_MIN; > + sb->s_time_max = XFS_INO_TIME_MAX; > + } This is really a comment on the earlier patch, but maybe we should name the old constants with "OLD" or "LEGACY" or "SMALL" in the name? > @@ -1494,6 +1499,10 @@ xfs_fc_fill_super( > if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) > sb->s_flags |= SB_I_VERSION; > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > + xfs_warn(mp, > + "EXPERIMENTAL big timestamp feature in use. Use at your own risk!"); > + Is there any good reason to mark this experimental?
On Thu, Aug 20, 2020 at 07:12:21PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix > time epoch). This enables us to handle dates up to 2486, which solves > the y2038 problem. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > Reviewed-by: Amir Goldstein <amir73il@gmail.com> .... > @@ -875,6 +888,25 @@ union xfs_timestamp { > */ > #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) > > +/* > + * Number of seconds between the start of the bigtime timestamp range and the > + * start of the Unix epoch. > + */ > +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) This is confusing. It's taken me 15 minutes so far to get my head around this because the reference frame for all these definitions is not clear. I though these had something to do with nanosecond timestamp limits because that's what BIGTIME records, but..... The start of the epoch is a negative number based on the definition of the on-disk format for the minimum number of seconds that the "Unix" timestamp format can store? Why is this not defined in nanoseconds given that is what is stored on disk? XFS_INO_BIGTIME_EPOCH = (-XFS_INO_TIME_MIN) = (-((int64_t)S32_MIN)) = (-((int64_t)-2^31)) = 2^31? So the bigtime epoch is considered to be 2^31 *seconds* into the range of the on-disk nanosecond timestamp? Huh? > + > +/* > + * Smallest possible timestamp with big timestamps, which is > + * Dec 13 20:45:52 UTC 1901. > + */ > +#define XFS_INO_BIGTIME_MIN (XFS_INO_TIME_MIN) Same here. The reference here is "seconds before the Unix epoch", not the minimum valid on disk nanosecond value. Oh, these are defining the post-disk-to-in-memory-format conversion limits? They have nothing to do with on-disk limits nor that on-disk format is stored in nanosecond? i.e. the reference frame for these limits is actually still the in-kernel Unix epoch definition? > +/* > + * Largest possible timestamp with big timestamps, which is > + * Jul 2 20:20:25 UTC 2486. > + */ > +#define XFS_INO_BIGTIME_MAX ((int64_t)((-1ULL / NSEC_PER_SEC) - \ > + XFS_INO_BIGTIME_EPOCH)) Urk. This makes my head -hurt-. It's converting the maximum on-disk format nanosecond count to a maximum second count then taking away the second count for the epoch and then casting it to a signed int because the in-memory format for seconds is signed? Oh, and the 64 bit division is via constants, so the compiler does it and doesn't need to use something like div_u64(), right? Mind you, I'm just guessing here that the "-1ULL" is the representation of the maximum on-disk nanosecond timestamp here, because that doesn't seem to be defined anywhere.... > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c > index cc1316a5fe0c..c59ddb56bb90 100644 > --- a/fs/xfs/libxfs/xfs_inode_buf.c > +++ b/fs/xfs/libxfs/xfs_inode_buf.c > @@ -157,11 +157,25 @@ xfs_imap_to_bp( > return 0; > } > > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ > void > xfs_inode_from_disk_timestamp( > + struct xfs_dinode *dip, > struct timespec64 *tv, > const union xfs_timestamp *ts) > { > + if (dip->di_version >= 3 && > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { Helper. xfs_dinode_has_bigtime() was what I suggested previously. > + uint64_t t = be64_to_cpu(ts->t_bigtime); > + uint64_t s; > + uint32_t n; > + > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > + tv->tv_nsec = n; > + return; > + } > + > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > } I still don't really like the way this turned out :( > @@ -220,9 +234,9 @@ xfs_inode_from_disk( > * a time before epoch is converted to a time long after epoch > * on 64 bit systems. > */ > - xfs_inode_from_disk_timestamp(&inode->i_atime, &from->di_atime); > - xfs_inode_from_disk_timestamp(&inode->i_mtime, &from->di_mtime); > - xfs_inode_from_disk_timestamp(&inode->i_ctime, &from->di_ctime); > + xfs_inode_from_disk_timestamp(from, &inode->i_atime, &from->di_atime); > + xfs_inode_from_disk_timestamp(from, &inode->i_mtime, &from->di_mtime); > + xfs_inode_from_disk_timestamp(from, &inode->i_ctime, &from->di_ctime); > > to->di_size = be64_to_cpu(from->di_size); > to->di_nblocks = be64_to_cpu(from->di_nblocks); > @@ -235,9 +249,17 @@ xfs_inode_from_disk( > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > inode_set_iversion_queried(inode, > be64_to_cpu(from->di_changecount)); > - xfs_inode_from_disk_timestamp(&to->di_crtime, &from->di_crtime); > + xfs_inode_from_disk_timestamp(from, &to->di_crtime, > + &from->di_crtime); > to->di_flags2 = be64_to_cpu(from->di_flags2); > to->di_cowextsize = be32_to_cpu(from->di_cowextsize); > + /* > + * Set the bigtime flag incore so that we automatically convert > + * this inode's ondisk timestamps to bigtime format the next > + * time we write the inode core to disk. > + */ > + if (xfs_sb_version_hasbigtime(&ip->i_mount->m_sb)) > + to->di_flags2 |= XFS_DIFLAG2_BIGTIME; > } iAs I said before, you can't set this flag here - it needs to be done transactionally when the timestamp is next logged. > @@ -259,9 +281,19 @@ xfs_inode_from_disk( > > void > xfs_inode_to_disk_timestamp( > + struct xfs_icdinode *from, > union xfs_timestamp *ts, > const struct timespec64 *tv) > { > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { Shouldn't this also check the inode version number like the dinode decoder? Perhaps a helper like xfs_inode_has_bigtime(ip)? > + uint64_t t; > + > + t = (uint64_t)(tv->tv_sec + XFS_INO_BIGTIME_EPOCH); tv_sec can still overflow, right? t = (uint64_t)tv->tv_sec + XFS_INO_BIGTIME_EPOCH; > + t *= NSEC_PER_SEC; > + ts->t_bigtime = cpu_to_be64(t + tv->tv_nsec); > + return; > + } > + > ts->t_sec = cpu_to_be32(tv->tv_sec); > ts->t_nsec = cpu_to_be32(tv->tv_nsec); > } > @@ -285,9 +317,9 @@ xfs_inode_to_disk( > to->di_projid_hi = cpu_to_be16(from->di_projid >> 16); > > memset(to->di_pad, 0, sizeof(to->di_pad)); > - xfs_inode_to_disk_timestamp(&to->di_atime, &inode->i_atime); > - xfs_inode_to_disk_timestamp(&to->di_mtime, &inode->i_mtime); > - xfs_inode_to_disk_timestamp(&to->di_ctime, &inode->i_ctime); > + xfs_inode_to_disk_timestamp(from, &to->di_atime, &inode->i_atime); > + xfs_inode_to_disk_timestamp(from, &to->di_mtime, &inode->i_mtime); > + xfs_inode_to_disk_timestamp(from, &to->di_ctime, &inode->i_ctime); > to->di_nlink = cpu_to_be32(inode->i_nlink); > to->di_gen = cpu_to_be32(inode->i_generation); > to->di_mode = cpu_to_be16(inode->i_mode); > @@ -306,7 +338,8 @@ xfs_inode_to_disk( > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > to->di_version = 3; > to->di_changecount = cpu_to_be64(inode_peek_iversion(inode)); > - xfs_inode_to_disk_timestamp(&to->di_crtime, &from->di_crtime); > + xfs_inode_to_disk_timestamp(from, &to->di_crtime, > + &from->di_crtime); > to->di_flags2 = cpu_to_be64(from->di_flags2); > to->di_cowextsize = cpu_to_be32(from->di_cowextsize); > to->di_ino = cpu_to_be64(ip->i_ino); > @@ -526,6 +559,11 @@ xfs_dinode_verify( > if (fa) > return fa; > > + /* bigtime iflag can only happen on bigtime filesystems */ > + if ((flags2 & (XFS_DIFLAG2_BIGTIME)) && > + !xfs_sb_version_hasbigtime(&mp->m_sb)) if (xfs_dinode_has_bigtime(dip) && !xfs_sb_version_hasbigtime(&mp->m_sb)) > +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, > + union xfs_timestamp *ts, const struct timespec64 *tv); > > #endif /* __XFS_INODE_BUF_H__ */ > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h > index 17c83d29998c..569721f7f9e5 100644 > --- a/fs/xfs/libxfs/xfs_log_format.h > +++ b/fs/xfs/libxfs/xfs_log_format.h > @@ -373,6 +373,9 @@ union xfs_ictimestamp { > int32_t t_sec; /* timestamp seconds */ > int32_t t_nsec; /* timestamp nanoseconds */ > }; > + > + /* Nanoseconds since the bigtime epoch. */ > + uint64_t t_bigtime; > }; Where are we using this again? Right now the timestamps are converted directly into the VFS inode timestamp fields so we can get rid of these incore timestamp fields. So shouldn't we be trying to get rid of this structure rather than adding more functionality to it? > @@ -131,6 +131,17 @@ xfs_trans_log_inode( > iversion_flags = XFS_ILOG_CORE; > } > > + /* > + * If we're updating the inode core or the timestamps and it's possible > + * to upgrade this inode to bigtime format, do so now. > + */ > + if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) && > + xfs_sb_version_hasbigtime(&tp->t_mountp->m_sb) && > + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { The latter two checks could be wrapped up into a helper named something obvious like xfs_inode_need_bigtime(ip)? > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > + flags |= XFS_ILOG_CORE; > + } > + > /* > * Record the specific change for fdatasync optimisation. This allows > * fdatasync to skip log forces for inodes that are only timestamp > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c > index 9f036053fdb7..6f00309de2d4 100644 > --- a/fs/xfs/scrub/inode.c > +++ b/fs/xfs/scrub/inode.c > @@ -190,6 +190,11 @@ xchk_inode_flags2( > if ((flags2 & XFS_DIFLAG2_DAX) && (flags2 & XFS_DIFLAG2_REFLINK)) > goto bad; > > + /* no bigtime iflag without the bigtime feature */ > + if (!xfs_sb_version_hasbigtime(&mp->m_sb) && > + (flags2 & XFS_DIFLAG2_BIGTIME)) Can we get all these open coded checks wrapped up into a single helper? > + xchk_dinode_nsec(sc, ino, dip, &dip->di_crtime); > xchk_inode_flags2(sc, dip, ino, mode, flags, flags2); > xchk_inode_cowextsize(sc, dip, ino, mode, flags, > flags2); > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index c06129cffba9..bbc309bc70c4 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -841,6 +841,8 @@ xfs_ialloc( > if (xfs_sb_version_has_v3inode(&mp->m_sb)) { > inode_set_iversion(inode, 1); > ip->i_d.di_flags2 = 0; > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; Rather than calculate the initial inode falgs on every allocation, shouldn't we just have the defaults pre-calculated at mount time? > ip->i_d.di_cowextsize = 0; > ip->i_d.di_crtime = tv; > } > @@ -2717,7 +2719,11 @@ xfs_ifree( > > VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ > ip->i_d.di_flags = 0; > - ip->i_d.di_flags2 = 0; > + /* > + * Preserve the bigtime flag so that di_ctime accurately stores the > + * deletion time. > + */ > + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; Oh, that's a nasty wart. > ip->i_d.di_dmevmask = 0; > ip->i_d.di_forkoff = 0; /* mark the attr fork not in use */ > ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; > @@ -3503,6 +3509,13 @@ xfs_iflush( > __func__, ip->i_ino, ip->i_d.di_forkoff, ip); > goto flush_out; > } > + if (xfs_sb_version_hasbigtime(&mp->m_sb) && > + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { > + xfs_alert_tag(mp, XFS_PTAG_IFLUSH, > + "%s: bad inode %Lu, bigtime unset, ptr "PTR_FMT, > + __func__, ip->i_ino, ip); > + goto flush_out; > + } Why is this a fatal corruption error? if the bigtime flag is not set, we can still store and read the timestamp if it is within the unix epoch range... > > /* > * Inode item log recovery for v2 inodes are dependent on the > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index cec6d4d82bc4..c38af3eea48f 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -298,9 +298,16 @@ xfs_inode_item_format_attr_fork( > /* Write a log_dinode timestamp into an ondisk inode timestamp. */ > static inline void > xfs_log_dinode_to_disk_ts( > + struct xfs_log_dinode *from, > union xfs_timestamp *ts, > const union xfs_ictimestamp *its) > { > + if (from->di_version >= 3 && > + (from->di_flags2 & XFS_DIFLAG2_BIGTIME)) { helper. > + ts->t_bigtime = cpu_to_be64(its->t_bigtime); > + return; > + } > + > ts->t_sec = cpu_to_be32(its->t_sec); > ts->t_nsec = cpu_to_be32(its->t_nsec); > } > @@ -322,9 +329,9 @@ xfs_log_dinode_to_disk( > to->di_projid_hi = cpu_to_be16(from->di_projid_hi); > memcpy(to->di_pad, from->di_pad, sizeof(to->di_pad)); > > - xfs_log_dinode_to_disk_ts(&to->di_atime, &from->di_atime); > - xfs_log_dinode_to_disk_ts(&to->di_mtime, &from->di_mtime); > - xfs_log_dinode_to_disk_ts(&to->di_ctime, &from->di_ctime); > + xfs_log_dinode_to_disk_ts(from, &to->di_atime, &from->di_atime); > + xfs_log_dinode_to_disk_ts(from, &to->di_mtime, &from->di_mtime); > + xfs_log_dinode_to_disk_ts(from, &to->di_ctime, &from->di_ctime); > > to->di_size = cpu_to_be64(from->di_size); > to->di_nblocks = cpu_to_be64(from->di_nblocks); > @@ -340,7 +347,7 @@ xfs_log_dinode_to_disk( > > if (from->di_version == 3) { > to->di_changecount = cpu_to_be64(from->di_changecount); > - xfs_log_dinode_to_disk_ts(&to->di_crtime, &from->di_crtime); > + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); > to->di_flags2 = cpu_to_be64(from->di_flags2); > to->di_cowextsize = cpu_to_be32(from->di_cowextsize); > to->di_ino = cpu_to_be64(from->di_ino); > @@ -356,9 +363,19 @@ xfs_log_dinode_to_disk( > /* Write an incore inode timestamp into a log_dinode timestamp. */ > static inline void > xfs_inode_to_log_dinode_ts( > + struct xfs_icdinode *from, > union xfs_ictimestamp *its, > const struct timespec64 *ts) > { > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > + uint64_t t; > + > + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); > + t *= NSEC_PER_SEC; > + its->t_bigtime = t + ts->tv_nsec; helper, > + return; > + } > + > its->t_sec = ts->tv_sec; > its->t_nsec = ts->tv_nsec; > } > @@ -381,9 +398,9 @@ xfs_inode_to_log_dinode( > > memset(to->di_pad, 0, sizeof(to->di_pad)); > memset(to->di_pad3, 0, sizeof(to->di_pad3)); > - xfs_inode_to_log_dinode_ts(&to->di_atime, &inode->i_atime); > - xfs_inode_to_log_dinode_ts(&to->di_mtime, &inode->i_mtime); > - xfs_inode_to_log_dinode_ts(&to->di_ctime, &inode->i_ctime); > + xfs_inode_to_log_dinode_ts(from, &to->di_atime, &inode->i_atime); > + xfs_inode_to_log_dinode_ts(from, &to->di_mtime, &inode->i_mtime); > + xfs_inode_to_log_dinode_ts(from, &to->di_ctime, &inode->i_ctime); > to->di_nlink = inode->i_nlink; > to->di_gen = inode->i_generation; > to->di_mode = inode->i_mode; > @@ -405,7 +422,7 @@ xfs_inode_to_log_dinode( > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > to->di_version = 3; > to->di_changecount = inode_peek_iversion(inode); > - xfs_inode_to_log_dinode_ts(&to->di_crtime, &from->di_crtime); > + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); > to->di_flags2 = from->di_flags2; > to->di_cowextsize = from->di_cowextsize; > to->di_ino = ip->i_ino; > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > index 6f22a66777cd..13396c3665d1 100644 > --- a/fs/xfs/xfs_ioctl.c > +++ b/fs/xfs/xfs_ioctl.c > @@ -1190,7 +1190,8 @@ xfs_flags2diflags2( > unsigned int xflags) > { > uint64_t di_flags2 = > - (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK); > + (ip->i_d.di_flags2 & (XFS_DIFLAG2_REFLINK | > + XFS_DIFLAG2_BIGTIME)); > > if (xflags & FS_XFLAG_DAX) > di_flags2 |= XFS_DIFLAG2_DAX; > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > index 7158a8de719f..3e0c677cff15 100644 > --- a/fs/xfs/xfs_ondisk.h > +++ b/fs/xfs/xfs_ondisk.h > @@ -25,6 +25,9 @@ xfs_check_limits(void) > /* make sure timestamp limits are correct */ > XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); > XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); That still just doesn't look right to me :/ This implies that the epoch is 2^32 seconds after then minimum supported time (2038), when in fact it is only 2^31 seconds after the minimum supported timestamp (1970). :/ > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MAX, 16299260425LL); Hmmm. I got 16299260424 when I just ran this through a simple calc. Mind you, no calculator app I found could handle unsigned 64 bit values natively (signed 64 bit is good enough for everyone!) so maybe I got an off-by one here... Cheers, Dave.
On Sat, Aug 22, 2020 at 08:33:19AM +0100, Christoph Hellwig wrote: > > * in the AGI header so that we can skip the finobt walk at mount time when > > @@ -855,12 +862,18 @@ struct xfs_agfl { > > * > > * Inode timestamps consist of signed 32-bit counters for seconds and > > * nanoseconds; time zero is the Unix epoch, Jan 1 00:00:00 UTC 1970. > > + * > > + * When bigtime is enabled, timestamps become an unsigned 64-bit nanoseconds > > + * counter. Time zero is the start of the classic timestamp range. > > */ > > union xfs_timestamp { > > struct { > > __be32 t_sec; /* timestamp seconds */ > > __be32 t_nsec; /* timestamp nanoseconds */ > > }; > > + > > + /* Nanoseconds since the bigtime epoch. */ > > + __be64 t_bigtime; > > }; > > So do we really need the union here? What about: > > (1) keep the typedef instead of removing it > (2) switch the typedef to be just a __be64, and use trivial helpers > to extract the two separate legacy sec/nsec field > (3) PROFIT!!! Been there, done that. Dave suggested some replacement code (which corrupted the values), then I modified that into a correct version, which then made smatch angry because it doesn't like code that does bit shifts on __be64 values. > > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ > > void > > xfs_inode_from_disk_timestamp( > > + struct xfs_dinode *dip, > > struct timespec64 *tv, > > const union xfs_timestamp *ts) > > I think passing ts by value might lead to somewhat better code > generation on modern ABIs (and older ABIs just fall back to pass > by reference transparently). Hm, ok. I did not know that. :) > > { > > + if (dip->di_version >= 3 && > > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { > > Do we want a helper for this condition? Yes, yes we do. Will add. > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > + uint64_t s; > > + uint32_t n; > > + > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > + tv->tv_nsec = n; > > + return; > > + } > > + > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > Nit: for these kinds of symmetric conditions and if/else feels a little > more natural. > > > + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); > > This adds a > 80 char line. Do we care now that checkpatch has been changed to allow up to 100 columns? > > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > > + uint64_t t; > > + > > + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); > > + t *= NSEC_PER_SEC; > > + its->t_bigtime = t + ts->tv_nsec; > > This calculation is dupliated in two places, might be worth > adding a little helper (which will need to get the sec/nsec values > passed separately due to the different structures). > > > + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); > > Another line over 8 characters here. > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) { > > + sb->s_time_min = XFS_INO_BIGTIME_MIN; > > + sb->s_time_max = XFS_INO_BIGTIME_MAX; > > + } else { > > + sb->s_time_min = XFS_INO_TIME_MIN; > > + sb->s_time_max = XFS_INO_TIME_MAX; > > + } > > This is really a comment on the earlier patch, but maybe we should > name the old constants with "OLD" or "LEGACY" or "SMALL" in the name? Yes, good suggestion! > > @@ -1494,6 +1499,10 @@ xfs_fc_fill_super( > > if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) > > sb->s_flags |= SB_I_VERSION; > > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > + xfs_warn(mp, > > + "EXPERIMENTAL big timestamp feature in use. Use at your own risk!"); > > + > > Is there any good reason to mark this experimental? As you and Dave have both pointed out, there are plenty of stupid bugs still in this. I think I'd like to have at least one EXPERIMENTAL cycle to make sure I didn't commit anything pathologically stupid in here. <cough> ext4 34-bit sign extension bug <cough>. --D
On Mon, Aug 25, 2020 at 11:25:27AM +1000, Dave Chinner wrote: > On Thu, Aug 20, 2020 at 07:12:21PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of > > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix > > time epoch). This enables us to handle dates up to 2486, which solves > > the y2038 problem. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > .... > > > @@ -875,6 +888,25 @@ union xfs_timestamp { > > */ > > #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) > > > > +/* > > + * Number of seconds between the start of the bigtime timestamp range and the > > + * start of the Unix epoch. > > + */ > > +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) > > This is confusing. It's taken me 15 minutes so far to get my head > around this because the reference frame for all these definitions is > not clear. I though these had something to do with nanosecond > timestamp limits because that's what BIGTIME records, but..... > > The start of the epoch is a negative number based on the definition > of the on-disk format for the minimum number of seconds that the > "Unix" timestamp format can store? Why is this not defined in > nanoseconds given that is what is stored on disk? > > XFS_INO_BIGTIME_EPOCH = (-XFS_INO_TIME_MIN) > = (-((int64_t)S32_MIN)) > = (-((int64_t)-2^31)) > = 2^31? > > So the bigtime epoch is considered to be 2^31 *seconds* into the > range of the on-disk nanosecond timestamp? Huh? They're the incore limits, not the ondisk limits. Prior to bigtime, the ondisk timestamp epoch was the Unix epoch. This isn't the case anymore in bigtime (bigtime's epoch is Dec. 1901, aka the minimum timestamp under the old scheme), so that misnamed XFS_INO_BIGTIME_EPOCH value is the conversion factor between epochs. (I'll come back to this at the bottom.) > > + > > +/* > > + * Smallest possible timestamp with big timestamps, which is > > + * Dec 13 20:45:52 UTC 1901. > > + */ > > +#define XFS_INO_BIGTIME_MIN (XFS_INO_TIME_MIN) > > Same here. The reference here is "seconds before the Unix epoch", > not the minimum valid on disk nanosecond value. > > Oh, these are defining the post-disk-to-in-memory-format conversion > limits? They have nothing to do with on-disk limits nor that on-disk > format is stored in nanosecond? i.e. the reference frame for these > limits is actually still the in-kernel Unix epoch definition? Yes. The on-disk limits are 0 and -1ULL with an epoch of Dec 1901, and these XFS_INO_BIGTIME_{MIN,MAX} constants are the incore limits, which of course have to be relative to the Unix epoch. > > +/* > > + * Largest possible timestamp with big timestamps, which is > > + * Jul 2 20:20:25 UTC 2486. > > + */ > > +#define XFS_INO_BIGTIME_MAX ((int64_t)((-1ULL / NSEC_PER_SEC) - \ > > + XFS_INO_BIGTIME_EPOCH)) > > Urk. This makes my head -hurt-. > > It's converting the maximum on-disk format nanosecond count to a > maximum second count then taking away the second count for the epoch > and then casting it to a signed int because the in-memory format for > seconds is signed? Oh, and the 64 bit division is via constants, so > the compiler does it and doesn't need to use something like > div_u64(), right? Right. > Mind you, I'm just guessing here that the "-1ULL" is the > representation of the maximum on-disk nanosecond timestamp here, > because that doesn't seem to be defined anywhere.... Yes. Clearly I shouldn't have taken that shortcut. > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c > > index cc1316a5fe0c..c59ddb56bb90 100644 > > --- a/fs/xfs/libxfs/xfs_inode_buf.c > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c > > @@ -157,11 +157,25 @@ xfs_imap_to_bp( > > return 0; > > } > > > > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ > > void > > xfs_inode_from_disk_timestamp( > > + struct xfs_dinode *dip, > > struct timespec64 *tv, > > const union xfs_timestamp *ts) > > { > > + if (dip->di_version >= 3 && > > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { > > Helper. xfs_dinode_has_bigtime() was what I suggested previously. > > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > + uint64_t s; > > + uint32_t n; > > + > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > + tv->tv_nsec = n; > > + return; > > + } > > + > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > } > > I still don't really like the way this turned out :( I'll think about this further and hope that hch comes up with something that's both functional and doesn't piss off smatch/sparse. Note that I also don't have any big endian machines anymore, so I don't really have a good way to test this. powerpc32 and sparc are verrrrry dead now. > > @@ -220,9 +234,9 @@ xfs_inode_from_disk( > > * a time before epoch is converted to a time long after epoch > > * on 64 bit systems. > > */ > > - xfs_inode_from_disk_timestamp(&inode->i_atime, &from->di_atime); > > - xfs_inode_from_disk_timestamp(&inode->i_mtime, &from->di_mtime); > > - xfs_inode_from_disk_timestamp(&inode->i_ctime, &from->di_ctime); > > + xfs_inode_from_disk_timestamp(from, &inode->i_atime, &from->di_atime); > > + xfs_inode_from_disk_timestamp(from, &inode->i_mtime, &from->di_mtime); > > + xfs_inode_from_disk_timestamp(from, &inode->i_ctime, &from->di_ctime); > > > > to->di_size = be64_to_cpu(from->di_size); > > to->di_nblocks = be64_to_cpu(from->di_nblocks); > > @@ -235,9 +249,17 @@ xfs_inode_from_disk( > > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > > inode_set_iversion_queried(inode, > > be64_to_cpu(from->di_changecount)); > > - xfs_inode_from_disk_timestamp(&to->di_crtime, &from->di_crtime); > > + xfs_inode_from_disk_timestamp(from, &to->di_crtime, > > + &from->di_crtime); > > to->di_flags2 = be64_to_cpu(from->di_flags2); > > to->di_cowextsize = be32_to_cpu(from->di_cowextsize); > > + /* > > + * Set the bigtime flag incore so that we automatically convert > > + * this inode's ondisk timestamps to bigtime format the next > > + * time we write the inode core to disk. > > + */ > > + if (xfs_sb_version_hasbigtime(&ip->i_mount->m_sb)) > > + to->di_flags2 |= XFS_DIFLAG2_BIGTIME; > > } > > iAs I said before, you can't set this flag here - it needs to be > done transactionally when the timestamp is next logged. ARRGH. I added code to xfs_trans_log_inode to do the conversion, and then forgot to remove this. I /swear/ I removed it, but there it still is. > > > @@ -259,9 +281,19 @@ xfs_inode_from_disk( > > > > void > > xfs_inode_to_disk_timestamp( > > + struct xfs_icdinode *from, > > union xfs_timestamp *ts, > > const struct timespec64 *tv) > > { > > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > > Shouldn't this also check the inode version number like the dinode > decoder? Perhaps a helper like xfs_inode_has_bigtime(ip)? Yes, will add. > > + uint64_t t; > > + > > + t = (uint64_t)(tv->tv_sec + XFS_INO_BIGTIME_EPOCH); > > tv_sec can still overflow, right? > > t = (uint64_t)tv->tv_sec + XFS_INO_BIGTIME_EPOCH; It /shouldn't/ since we also set the superblock timestamp limits such that the VFS will never pass us an over-large value, but I guess we should be defensive here anyway. > > > + t *= NSEC_PER_SEC; > > + ts->t_bigtime = cpu_to_be64(t + tv->tv_nsec); > > + return; > > + } > > + > > ts->t_sec = cpu_to_be32(tv->tv_sec); > > ts->t_nsec = cpu_to_be32(tv->tv_nsec); > > } > > @@ -285,9 +317,9 @@ xfs_inode_to_disk( > > to->di_projid_hi = cpu_to_be16(from->di_projid >> 16); > > > > memset(to->di_pad, 0, sizeof(to->di_pad)); > > - xfs_inode_to_disk_timestamp(&to->di_atime, &inode->i_atime); > > - xfs_inode_to_disk_timestamp(&to->di_mtime, &inode->i_mtime); > > - xfs_inode_to_disk_timestamp(&to->di_ctime, &inode->i_ctime); > > + xfs_inode_to_disk_timestamp(from, &to->di_atime, &inode->i_atime); > > + xfs_inode_to_disk_timestamp(from, &to->di_mtime, &inode->i_mtime); > > + xfs_inode_to_disk_timestamp(from, &to->di_ctime, &inode->i_ctime); > > to->di_nlink = cpu_to_be32(inode->i_nlink); > > to->di_gen = cpu_to_be32(inode->i_generation); > > to->di_mode = cpu_to_be16(inode->i_mode); > > @@ -306,7 +338,8 @@ xfs_inode_to_disk( > > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > > to->di_version = 3; > > to->di_changecount = cpu_to_be64(inode_peek_iversion(inode)); > > - xfs_inode_to_disk_timestamp(&to->di_crtime, &from->di_crtime); > > + xfs_inode_to_disk_timestamp(from, &to->di_crtime, > > + &from->di_crtime); > > to->di_flags2 = cpu_to_be64(from->di_flags2); > > to->di_cowextsize = cpu_to_be32(from->di_cowextsize); > > to->di_ino = cpu_to_be64(ip->i_ino); > > @@ -526,6 +559,11 @@ xfs_dinode_verify( > > if (fa) > > return fa; > > > > + /* bigtime iflag can only happen on bigtime filesystems */ > > + if ((flags2 & (XFS_DIFLAG2_BIGTIME)) && > > + !xfs_sb_version_hasbigtime(&mp->m_sb)) > > if (xfs_dinode_has_bigtime(dip) && > !xfs_sb_version_hasbigtime(&mp->m_sb)) <nod> > > > +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, > > + union xfs_timestamp *ts, const struct timespec64 *tv); > > > > #endif /* __XFS_INODE_BUF_H__ */ > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h > > index 17c83d29998c..569721f7f9e5 100644 > > --- a/fs/xfs/libxfs/xfs_log_format.h > > +++ b/fs/xfs/libxfs/xfs_log_format.h > > @@ -373,6 +373,9 @@ union xfs_ictimestamp { > > int32_t t_sec; /* timestamp seconds */ > > int32_t t_nsec; /* timestamp nanoseconds */ > > }; > > + > > + /* Nanoseconds since the bigtime epoch. */ > > + uint64_t t_bigtime; > > }; > > Where are we using this again? Right now the timestamps are > converted directly into the VFS inode timestamp fields so we can get > rid of these incore timestamp fields. So shouldn't we be trying to > get rid of this structure rather than adding more functionality to > it? We would have to enlarge xfs_log_dinode to log a full timespec64-like entity. I understand that it's annoying to convert a vfs timestamp back into a u64 nanoseconds counter for the sake of the log, but doing so will add complexity to the log for absolutely zero gain because having 96 bits per timestamp in the log doesn't buy us anything. > > @@ -131,6 +131,17 @@ xfs_trans_log_inode( > > iversion_flags = XFS_ILOG_CORE; > > } > > > > + /* > > + * If we're updating the inode core or the timestamps and it's possible > > + * to upgrade this inode to bigtime format, do so now. > > + */ > > + if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) && > > + xfs_sb_version_hasbigtime(&tp->t_mountp->m_sb) && > > + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { > > The latter two checks could be wrapped up into a helper named > something obvious like xfs_inode_need_bigtime(ip)? Ok. > > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > > + flags |= XFS_ILOG_CORE; > > + } > > + > > /* > > * Record the specific change for fdatasync optimisation. This allows > > * fdatasync to skip log forces for inodes that are only timestamp > > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c > > index 9f036053fdb7..6f00309de2d4 100644 > > --- a/fs/xfs/scrub/inode.c > > +++ b/fs/xfs/scrub/inode.c > > @@ -190,6 +190,11 @@ xchk_inode_flags2( > > if ((flags2 & XFS_DIFLAG2_DAX) && (flags2 & XFS_DIFLAG2_REFLINK)) > > goto bad; > > > > + /* no bigtime iflag without the bigtime feature */ > > + if (!xfs_sb_version_hasbigtime(&mp->m_sb) && > > + (flags2 & XFS_DIFLAG2_BIGTIME)) > > Can we get all these open coded checks wrapped up into a single > helper? Ok. > > + xchk_dinode_nsec(sc, ino, dip, &dip->di_crtime); > > xchk_inode_flags2(sc, dip, ino, mode, flags, flags2); > > xchk_inode_cowextsize(sc, dip, ino, mode, flags, > > flags2); > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index c06129cffba9..bbc309bc70c4 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -841,6 +841,8 @@ xfs_ialloc( > > if (xfs_sb_version_has_v3inode(&mp->m_sb)) { > > inode_set_iversion(inode, 1); > > ip->i_d.di_flags2 = 0; > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > > Rather than calculate the initial inode falgs on every allocation, > shouldn't we just have the defaults pre-calculated at mount time? Hm, yes. Add that to the inode geometry structure? > > ip->i_d.di_cowextsize = 0; > > ip->i_d.di_crtime = tv; > > } > > @@ -2717,7 +2719,11 @@ xfs_ifree( > > > > VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ > > ip->i_d.di_flags = 0; > > - ip->i_d.di_flags2 = 0; > > + /* > > + * Preserve the bigtime flag so that di_ctime accurately stores the > > + * deletion time. > > + */ > > + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; > > Oh, that's a nasty wart. And here again? > > ip->i_d.di_forkoff = 0; /* mark the attr fork not in use */ > > ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; > > @@ -3503,6 +3509,13 @@ xfs_iflush( > > __func__, ip->i_ino, ip->i_d.di_forkoff, ip); > > goto flush_out; > > } > > + if (xfs_sb_version_hasbigtime(&mp->m_sb) && > > + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { > > + xfs_alert_tag(mp, XFS_PTAG_IFLUSH, > > + "%s: bad inode %Lu, bigtime unset, ptr "PTR_FMT, > > + __func__, ip->i_ino, ip); > > + goto flush_out; > > + } > > Why is this a fatal corruption error? if the bigtime flag is not > set, we can still store and read the timestamp if it is within > the unix epoch range... Ugh, it's not. It's a leftover artifact from when I would just set the flag incore unconditionally, all of which should have been removed when I added the modifications to xfs_trans_log_inode, but clearly I forgot. > > > > /* > > * Inode item log recovery for v2 inodes are dependent on the > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > index cec6d4d82bc4..c38af3eea48f 100644 > > --- a/fs/xfs/xfs_inode_item.c > > +++ b/fs/xfs/xfs_inode_item.c > > @@ -298,9 +298,16 @@ xfs_inode_item_format_attr_fork( > > /* Write a log_dinode timestamp into an ondisk inode timestamp. */ > > static inline void > > xfs_log_dinode_to_disk_ts( > > + struct xfs_log_dinode *from, > > union xfs_timestamp *ts, > > const union xfs_ictimestamp *its) > > { > > + if (from->di_version >= 3 && > > + (from->di_flags2 & XFS_DIFLAG2_BIGTIME)) { > > helper. > > > + ts->t_bigtime = cpu_to_be64(its->t_bigtime); > > + return; > > + } > > + > > ts->t_sec = cpu_to_be32(its->t_sec); > > ts->t_nsec = cpu_to_be32(its->t_nsec); > > } > > @@ -322,9 +329,9 @@ xfs_log_dinode_to_disk( > > to->di_projid_hi = cpu_to_be16(from->di_projid_hi); > > memcpy(to->di_pad, from->di_pad, sizeof(to->di_pad)); > > > > - xfs_log_dinode_to_disk_ts(&to->di_atime, &from->di_atime); > > - xfs_log_dinode_to_disk_ts(&to->di_mtime, &from->di_mtime); > > - xfs_log_dinode_to_disk_ts(&to->di_ctime, &from->di_ctime); > > + xfs_log_dinode_to_disk_ts(from, &to->di_atime, &from->di_atime); > > + xfs_log_dinode_to_disk_ts(from, &to->di_mtime, &from->di_mtime); > > + xfs_log_dinode_to_disk_ts(from, &to->di_ctime, &from->di_ctime); > > > > to->di_size = cpu_to_be64(from->di_size); > > to->di_nblocks = cpu_to_be64(from->di_nblocks); > > @@ -340,7 +347,7 @@ xfs_log_dinode_to_disk( > > > > if (from->di_version == 3) { > > to->di_changecount = cpu_to_be64(from->di_changecount); > > - xfs_log_dinode_to_disk_ts(&to->di_crtime, &from->di_crtime); > > + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); > > to->di_flags2 = cpu_to_be64(from->di_flags2); > > to->di_cowextsize = cpu_to_be32(from->di_cowextsize); > > to->di_ino = cpu_to_be64(from->di_ino); > > @@ -356,9 +363,19 @@ xfs_log_dinode_to_disk( > > /* Write an incore inode timestamp into a log_dinode timestamp. */ > > static inline void > > xfs_inode_to_log_dinode_ts( > > + struct xfs_icdinode *from, > > union xfs_ictimestamp *its, > > const struct timespec64 *ts) > > { > > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > > + uint64_t t; > > + > > + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); > > + t *= NSEC_PER_SEC; > > + its->t_bigtime = t + ts->tv_nsec; > > helper, > > > + return; > > + } > > + > > its->t_sec = ts->tv_sec; > > its->t_nsec = ts->tv_nsec; > > } > > @@ -381,9 +398,9 @@ xfs_inode_to_log_dinode( > > > > memset(to->di_pad, 0, sizeof(to->di_pad)); > > memset(to->di_pad3, 0, sizeof(to->di_pad3)); > > - xfs_inode_to_log_dinode_ts(&to->di_atime, &inode->i_atime); > > - xfs_inode_to_log_dinode_ts(&to->di_mtime, &inode->i_mtime); > > - xfs_inode_to_log_dinode_ts(&to->di_ctime, &inode->i_ctime); > > + xfs_inode_to_log_dinode_ts(from, &to->di_atime, &inode->i_atime); > > + xfs_inode_to_log_dinode_ts(from, &to->di_mtime, &inode->i_mtime); > > + xfs_inode_to_log_dinode_ts(from, &to->di_ctime, &inode->i_ctime); > > to->di_nlink = inode->i_nlink; > > to->di_gen = inode->i_generation; > > to->di_mode = inode->i_mode; > > @@ -405,7 +422,7 @@ xfs_inode_to_log_dinode( > > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { > > to->di_version = 3; > > to->di_changecount = inode_peek_iversion(inode); > > - xfs_inode_to_log_dinode_ts(&to->di_crtime, &from->di_crtime); > > + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); > > to->di_flags2 = from->di_flags2; > > to->di_cowextsize = from->di_cowextsize; > > to->di_ino = ip->i_ino; > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > > index 6f22a66777cd..13396c3665d1 100644 > > --- a/fs/xfs/xfs_ioctl.c > > +++ b/fs/xfs/xfs_ioctl.c > > @@ -1190,7 +1190,8 @@ xfs_flags2diflags2( > > unsigned int xflags) > > { > > uint64_t di_flags2 = > > - (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK); > > + (ip->i_d.di_flags2 & (XFS_DIFLAG2_REFLINK | > > + XFS_DIFLAG2_BIGTIME)); > > > > if (xflags & FS_XFLAG_DAX) > > di_flags2 |= XFS_DIFLAG2_DAX; > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > > index 7158a8de719f..3e0c677cff15 100644 > > --- a/fs/xfs/xfs_ondisk.h > > +++ b/fs/xfs/xfs_ondisk.h > > @@ -25,6 +25,9 @@ xfs_check_limits(void) > > /* make sure timestamp limits are correct */ > > XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); > > XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); > > That still just doesn't look right to me :/ > > This implies that the epoch is 2^32 seconds after then minimum > supported time (2038), when in fact it is only 2^31 seconds after the > minimum supported timestamp (1970). :/ Ok, so XFS_INO_UNIX_BIGTIME_MIN is -2147483648, to signify that the smallest bigtime timestamp is (still) December 1901. That thing currently known as XFS_INO_BIGTIME_EPOCH should probably get renamed to something less confusing, like... /* * Since the bigtime epoch is Dec. 1901, add this number of seconds to * an ondisk bigtime timestamp to convert it to the Unix epoch. */ #define XFS_BIGTIME_TO_UNIX (-XFS_INO_UNIX_BIGTIME_MIN) /* * Subtract this many seconds from a Unix epoch timestamp to get the * ondisk bigtime timestamp. */ #define XFS_UNIX_TO_BIGTIME (-XFS_BIGTIME_TO_UNIX) Is that clearer? > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MAX, 16299260425LL); > > Hmmm. I got 16299260424 when I just ran this through a simple calc. > Mind you, no calculator app I found could handle unsigned 64 bit > values natively (signed 64 bit is good enough for everyone!) so > maybe I got an off-by one here... -1ULL = 18,446,744,073,709,551,615 -1ULL / NSEC_PER_SEC = 18,446,744,073 (-1ULL / NSEC_PER_SEC) - XFS_INO_BIGTIME_EPOCH = 16,299,260,425 Assuming you accept XFS_INO_BIGTIME_EPOCH being 2^31. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Sun, Aug 23, 2020 at 08:13:54PM -0700, Darrick J. Wong wrote: > On Mon, Aug 25, 2020 at 11:25:27AM +1000, Dave Chinner wrote: > > On Thu, Aug 20, 2020 at 07:12:21PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of > > > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix > > > time epoch). This enables us to handle dates up to 2486, which solves > > > the y2038 problem. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > > > .... > > > > > @@ -875,6 +888,25 @@ union xfs_timestamp { > > > */ > > > #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) > > > > > > +/* > > > + * Number of seconds between the start of the bigtime timestamp range and the > > > + * start of the Unix epoch. > > > + */ > > > +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) > > > > This is confusing. It's taken me 15 minutes so far to get my head > > around this because the reference frame for all these definitions is > > not clear. I though these had something to do with nanosecond > > timestamp limits because that's what BIGTIME records, but..... > > > > The start of the epoch is a negative number based on the definition > > of the on-disk format for the minimum number of seconds that the > > "Unix" timestamp format can store? Why is this not defined in > > nanoseconds given that is what is stored on disk? > > > > XFS_INO_BIGTIME_EPOCH = (-XFS_INO_TIME_MIN) > > = (-((int64_t)S32_MIN)) > > = (-((int64_t)-2^31)) > > = 2^31? > > > > So the bigtime epoch is considered to be 2^31 *seconds* into the > > range of the on-disk nanosecond timestamp? Huh? > > They're the incore limits, not the ondisk limits. > > Prior to bigtime, the ondisk timestamp epoch was the Unix epoch. This > isn't the case anymore in bigtime (bigtime's epoch is Dec. 1901, aka the > minimum timestamp under the old scheme), so that misnamed > XFS_INO_BIGTIME_EPOCH value is the conversion factor between epochs. > > (I'll come back to this at the bottom.) Ok, I'll come back to that at the bottom :) > > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > > + uint64_t s; > > > + uint32_t n; > > > + > > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > > + tv->tv_nsec = n; > > > + return; > > > + } > > > + > > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > > } > > > > I still don't really like the way this turned out :( > > I'll think about this further and hope that hch comes up with something > that's both functional and doesn't piss off smatch/sparse. Note that I > also don't have any big endian machines anymore, so I don't really have > a good way to test this. powerpc32 and sparc are verrrrry dead now. I'm not sure that anyone has current BE machines to test on.... > > > +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, > > > + union xfs_timestamp *ts, const struct timespec64 *tv); > > > > > > #endif /* __XFS_INODE_BUF_H__ */ > > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h > > > index 17c83d29998c..569721f7f9e5 100644 > > > --- a/fs/xfs/libxfs/xfs_log_format.h > > > +++ b/fs/xfs/libxfs/xfs_log_format.h > > > @@ -373,6 +373,9 @@ union xfs_ictimestamp { > > > int32_t t_sec; /* timestamp seconds */ > > > int32_t t_nsec; /* timestamp nanoseconds */ > > > }; > > > + > > > + /* Nanoseconds since the bigtime epoch. */ > > > + uint64_t t_bigtime; > > > }; > > > > Where are we using this again? Right now the timestamps are > > converted directly into the VFS inode timestamp fields so we can get > > rid of these incore timestamp fields. So shouldn't we be trying to > > get rid of this structure rather than adding more functionality to > > it? > > We would have to enlarge xfs_log_dinode to log a full timespec64-like > entity. I understand that it's annoying to convert a vfs timestamp > back into a u64 nanoseconds counter for the sake of the log, but doing > so will add complexity to the log for absolutely zero gain because > having 96 bits per timestamp in the log doesn't buy us anything. Sure, I understand that we only need to log a 64bit value, but we don't actually need a structure for that as the log is in native endian format. Hence it can just be a 64 bit field that we mask and shift for !bigtime inodes... Note that we have to be real careful about dynamic conversion, especially in recovery, as the inode read from disk might be in small time format, but logged and recovered in bigtime format. I didn't actually check the recovery code does that correctly, because it only just occurred to me that the logged timestamp format may not match the inode flags read from disk during recovery... > > > --- a/fs/xfs/xfs_inode.c > > > +++ b/fs/xfs/xfs_inode.c > > > @@ -841,6 +841,8 @@ xfs_ialloc( > > > if (xfs_sb_version_has_v3inode(&mp->m_sb)) { > > > inode_set_iversion(inode, 1); > > > ip->i_d.di_flags2 = 0; > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > > > > Rather than calculate the initial inode falgs on every allocation, > > shouldn't we just have the defaults pre-calculated at mount time? > > Hm, yes. Add that to the inode geometry structure? Sounds like a reasonable place to me. > > > ip->i_d.di_cowextsize = 0; > > > ip->i_d.di_crtime = tv; > > > } > > > @@ -2717,7 +2719,11 @@ xfs_ifree( > > > > > > VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ > > > ip->i_d.di_flags = 0; > > > - ip->i_d.di_flags2 = 0; > > > + /* > > > + * Preserve the bigtime flag so that di_ctime accurately stores the > > > + * deletion time. > > > + */ > > > + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; > > > > Oh, that's a nasty wart. > > And here again? *nod*. Good idea - we will have logged the inode core and converted it in-core to bigtime by this point... > > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > > > index 7158a8de719f..3e0c677cff15 100644 > > > --- a/fs/xfs/xfs_ondisk.h > > > +++ b/fs/xfs/xfs_ondisk.h > > > @@ -25,6 +25,9 @@ xfs_check_limits(void) > > > /* make sure timestamp limits are correct */ > > > XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); > > > XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); > > > > That still just doesn't look right to me :/ > > > > This implies that the epoch is 2^32 seconds after then minimum > > supported time (2038), when in fact it is only 2^31 seconds after the > > minimum supported timestamp (1970). :/ > > Ok, so XFS_INO_UNIX_BIGTIME_MIN is -2147483648, to signify that the > smallest bigtime timestamp is (still) December 1901. Let's drop the "ino" from the name - it's unnecessary, I think. > That thing currently known as XFS_INO_BIGTIME_EPOCH should probably get > renamed to something less confusing, like... > > /* > * Since the bigtime epoch is Dec. 1901, add this number of seconds to > * an ondisk bigtime timestamp to convert it to the Unix epoch. > */ > #define XFS_BIGTIME_TO_UNIX (-XFS_INO_UNIX_BIGTIME_MIN) > > /* > * Subtract this many seconds from a Unix epoch timestamp to get the > * ondisk bigtime timestamp. > */ > #define XFS_UNIX_TO_BIGTIME (-XFS_BIGTIME_TO_UNIX) > > Is that clearer? Hmmm. Definitely better, but how about: /* * Bigtime epoch is set exactly to the minimum time value that a * traditional 32 bit timestamp can represent when using the Unix * epoch as a reference. Hence the Unix epoch is at a fixed offset * into the supported bigtime timestamp range. * * The bigtime epoch also matches the minimum value an on-disk 32 * bit XFS timestamp can represent so we will not lose any fidelity * in converting to/from unix and bigtime timestamps. */ #define XFS_BIGTIME_EPOCH_OFFSET (XFS_INO_TIME_MIN) And then two static inline helpers follow immediately - xfs_bigtime_to_unix() and xfs_bigtime_from_unix() can do the conversion between the two formats and the XFS_BIGTIME_EPOCH_OFFSET variable never gets seen anywhere else in the code. To set the max timestamp value the superblock holds for the filesystem, just calculate it directly via a call to xfs_bigtime_to_unix(-1ULL, ...) > > Hmmm. I got 16299260424 when I just ran this through a simple calc. > > Mind you, no calculator app I found could handle unsigned 64 bit > > values natively (signed 64 bit is good enough for everyone!) so > > maybe I got an off-by one here... > > -1ULL = 18,446,744,073,709,551,615 > -1ULL / NSEC_PER_SEC = 18,446,744,073 > (-1ULL / NSEC_PER_SEC) - XFS_INO_BIGTIME_EPOCH = 16,299,260,425 Yup, I got an off by one thanks to integer rounding on the division. I should have just done it long hand like that... Cheers, Dave.
On Mon, Aug 24, 2020 at 04:15:31PM +1000, Dave Chinner wrote: > On Sun, Aug 23, 2020 at 08:13:54PM -0700, Darrick J. Wong wrote: > > On Mon, Aug 25, 2020 at 11:25:27AM +1000, Dave Chinner wrote: > > > On Thu, Aug 20, 2020 at 07:12:21PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of > > > > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix > > > > time epoch). This enables us to handle dates up to 2486, which solves > > > > the y2038 problem. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > > > > > .... > > > > > > > @@ -875,6 +888,25 @@ union xfs_timestamp { > > > > */ > > > > #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) > > > > > > > > +/* > > > > + * Number of seconds between the start of the bigtime timestamp range and the > > > > + * start of the Unix epoch. > > > > + */ > > > > +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) > > > > > > This is confusing. It's taken me 15 minutes so far to get my head > > > around this because the reference frame for all these definitions is > > > not clear. I though these had something to do with nanosecond > > > timestamp limits because that's what BIGTIME records, but..... > > > > > > The start of the epoch is a negative number based on the definition > > > of the on-disk format for the minimum number of seconds that the > > > "Unix" timestamp format can store? Why is this not defined in > > > nanoseconds given that is what is stored on disk? > > > > > > XFS_INO_BIGTIME_EPOCH = (-XFS_INO_TIME_MIN) > > > = (-((int64_t)S32_MIN)) > > > = (-((int64_t)-2^31)) > > > = 2^31? > > > > > > So the bigtime epoch is considered to be 2^31 *seconds* into the > > > range of the on-disk nanosecond timestamp? Huh? > > > > They're the incore limits, not the ondisk limits. > > > > Prior to bigtime, the ondisk timestamp epoch was the Unix epoch. This > > isn't the case anymore in bigtime (bigtime's epoch is Dec. 1901, aka the > > minimum timestamp under the old scheme), so that misnamed > > XFS_INO_BIGTIME_EPOCH value is the conversion factor between epochs. > > > > (I'll come back to this at the bottom.) > > Ok, I'll come back to that at the bottom :) > > > > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > > > + uint64_t s; > > > > + uint32_t n; > > > > + > > > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > > > + tv->tv_nsec = n; > > > > + return; > > > > + } > > > > + > > > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > > > } > > > > > > I still don't really like the way this turned out :( > > > > I'll think about this further and hope that hch comes up with something > > that's both functional and doesn't piss off smatch/sparse. Note that I > > also don't have any big endian machines anymore, so I don't really have > > a good way to test this. powerpc32 and sparc are verrrrry dead now. > > I'm not sure that anyone has current BE machines to test on.... ...which makes me all the more nervous about replacing the timestamp union with open-coded bit shifting. We know the existing code does the conversions properly with the separate sec/nsec fields since that code has been around for a while. We can use BUILD_BUG_ON macros to ensure that inside the union, the bigtime nanoseconds counter is overlayed /exactly/ on top of the old structure. There's a feature flag within the ondisk structure, which means that reasoning about this code is no more difficult than any other tagged union. Flag == 0? Use the same old code from before. Flag == 1? Use the new code. I was about to say that I'll experiment with this as a new patch at the end of the series, but I guess converting xfs_timestamp back to a typedef is more churn and belongs at the start of the series... > > > > +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, > > > > + union xfs_timestamp *ts, const struct timespec64 *tv); > > > > > > > > #endif /* __XFS_INODE_BUF_H__ */ > > > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h > > > > index 17c83d29998c..569721f7f9e5 100644 > > > > --- a/fs/xfs/libxfs/xfs_log_format.h > > > > +++ b/fs/xfs/libxfs/xfs_log_format.h > > > > @@ -373,6 +373,9 @@ union xfs_ictimestamp { > > > > int32_t t_sec; /* timestamp seconds */ > > > > int32_t t_nsec; /* timestamp nanoseconds */ > > > > }; > > > > + > > > > + /* Nanoseconds since the bigtime epoch. */ > > > > + uint64_t t_bigtime; > > > > }; > > > > > > Where are we using this again? Right now the timestamps are > > > converted directly into the VFS inode timestamp fields so we can get > > > rid of these incore timestamp fields. So shouldn't we be trying to > > > get rid of this structure rather than adding more functionality to > > > it? > > > > We would have to enlarge xfs_log_dinode to log a full timespec64-like > > entity. I understand that it's annoying to convert a vfs timestamp > > back into a u64 nanoseconds counter for the sake of the log, but doing > > so will add complexity to the log for absolutely zero gain because > > having 96 bits per timestamp in the log doesn't buy us anything. > > Sure, I understand that we only need to log a 64bit value, but we > don't actually need a structure for that as the log is in native > endian format. Hence it can just be a 64 bit field that we mask and > shift for !bigtime inodes... > > Note that we have to be real careful about dynamic conversion, > especially in recovery, as the inode read from disk might be in > small time format, but logged and recovered in bigtime format. I > didn't actually check the recovery code does that correctly, because > it only just occurred to me that the logged timestamp format may not > match the inode flags read from disk during recovery... Oh my, you're right, that xfs_log_dinode_to_disk_timestamp needs to be more careful to convert whatever we logged into something that is agnostic to disk format, and then convert it to whatever is the xfs_dinode format. I'll throw that on the fixme pile too. > > > > --- a/fs/xfs/xfs_inode.c > > > > +++ b/fs/xfs/xfs_inode.c > > > > @@ -841,6 +841,8 @@ xfs_ialloc( > > > > if (xfs_sb_version_has_v3inode(&mp->m_sb)) { > > > > inode_set_iversion(inode, 1); > > > > ip->i_d.di_flags2 = 0; > > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > > > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > > > > > > Rather than calculate the initial inode falgs on every allocation, > > > shouldn't we just have the defaults pre-calculated at mount time? > > > > Hm, yes. Add that to the inode geometry structure? > > Sounds like a reasonable place to me. > > > > > ip->i_d.di_cowextsize = 0; > > > > ip->i_d.di_crtime = tv; > > > > } > > > > @@ -2717,7 +2719,11 @@ xfs_ifree( > > > > > > > > VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ > > > > ip->i_d.di_flags = 0; > > > > - ip->i_d.di_flags2 = 0; > > > > + /* > > > > + * Preserve the bigtime flag so that di_ctime accurately stores the > > > > + * deletion time. > > > > + */ > > > > + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; > > > > > > Oh, that's a nasty wart. > > > > And here again? > > *nod*. Good idea - we will have logged the inode core and converted > it in-core to bigtime by this point... > > > > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > > > > index 7158a8de719f..3e0c677cff15 100644 > > > > --- a/fs/xfs/xfs_ondisk.h > > > > +++ b/fs/xfs/xfs_ondisk.h > > > > @@ -25,6 +25,9 @@ xfs_check_limits(void) > > > > /* make sure timestamp limits are correct */ > > > > XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); > > > > XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); > > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); > > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); > > > > > > That still just doesn't look right to me :/ > > > > > > This implies that the epoch is 2^32 seconds after then minimum > > > supported time (2038), when in fact it is only 2^31 seconds after the > > > minimum supported timestamp (1970). :/ > > > > Ok, so XFS_INO_UNIX_BIGTIME_MIN is -2147483648, to signify that the > > smallest bigtime timestamp is (still) December 1901. > > Let's drop the "ino" from the name - it's unnecessary, I think. Ok. > > That thing currently known as XFS_INO_BIGTIME_EPOCH should probably get > > renamed to something less confusing, like... > > > > /* > > * Since the bigtime epoch is Dec. 1901, add this number of seconds to > > * an ondisk bigtime timestamp to convert it to the Unix epoch. > > */ > > #define XFS_BIGTIME_TO_UNIX (-XFS_INO_UNIX_BIGTIME_MIN) > > > > /* > > * Subtract this many seconds from a Unix epoch timestamp to get the > > * ondisk bigtime timestamp. > > */ > > #define XFS_UNIX_TO_BIGTIME (-XFS_BIGTIME_TO_UNIX) > > > > Is that clearer? > > Hmmm. Definitely better, but how about: > > /* > * Bigtime epoch is set exactly to the minimum time value that a > * traditional 32 bit timestamp can represent when using the Unix > * epoch as a reference. Hence the Unix epoch is at a fixed offset > * into the supported bigtime timestamp range. > * > * The bigtime epoch also matches the minimum value an on-disk 32 > * bit XFS timestamp can represent so we will not lose any fidelity > * in converting to/from unix and bigtime timestamps. > */ > #define XFS_BIGTIME_EPOCH_OFFSET (XFS_INO_TIME_MIN) > > And then two static inline helpers follow immediately - > xfs_bigtime_to_unix() and xfs_bigtime_from_unix() can do the > conversion between the two formats and the XFS_BIGTIME_EPOCH_OFFSET > variable never gets seen anywhere else in the code. To set the max > timestamp value the superblock holds for the filesystem, just > calculate it directly via a call to xfs_bigtime_to_unix(-1ULL, ...) <nod> --D > > > Hmmm. I got 16299260424 when I just ran this through a simple calc. > > > Mind you, no calculator app I found could handle unsigned 64 bit > > > values natively (signed 64 bit is good enough for everyone!) so > > > maybe I got an off-by one here... > > > > -1ULL = 18,446,744,073,709,551,615 > > -1ULL / NSEC_PER_SEC = 18,446,744,073 > > (-1ULL / NSEC_PER_SEC) - XFS_INO_BIGTIME_EPOCH = 16,299,260,425 > > Yup, I got an off by one thanks to integer rounding on the > division. I should have just done it long hand like that... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Aug 24, 2020 at 09:24:20AM -0700, Darrick J. Wong wrote: > On Mon, Aug 24, 2020 at 04:15:31PM +1000, Dave Chinner wrote: > > On Sun, Aug 23, 2020 at 08:13:54PM -0700, Darrick J. Wong wrote: > > > On Mon, Aug 25, 2020 at 11:25:27AM +1000, Dave Chinner wrote: > > > > On Thu, Aug 20, 2020 at 07:12:21PM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of > > > > > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix > > > > > time epoch). This enables us to handle dates up to 2486, which solves > > > > > the y2038 problem. > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > > > > > > > .... > > > > > > > > > @@ -875,6 +888,25 @@ union xfs_timestamp { > > > > > */ > > > > > #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) > > > > > > > > > > +/* > > > > > + * Number of seconds between the start of the bigtime timestamp range and the > > > > > + * start of the Unix epoch. > > > > > + */ > > > > > +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) > > > > > > > > This is confusing. It's taken me 15 minutes so far to get my head > > > > around this because the reference frame for all these definitions is > > > > not clear. I though these had something to do with nanosecond > > > > timestamp limits because that's what BIGTIME records, but..... > > > > > > > > The start of the epoch is a negative number based on the definition > > > > of the on-disk format for the minimum number of seconds that the > > > > "Unix" timestamp format can store? Why is this not defined in > > > > nanoseconds given that is what is stored on disk? > > > > > > > > XFS_INO_BIGTIME_EPOCH = (-XFS_INO_TIME_MIN) > > > > = (-((int64_t)S32_MIN)) > > > > = (-((int64_t)-2^31)) > > > > = 2^31? > > > > > > > > So the bigtime epoch is considered to be 2^31 *seconds* into the > > > > range of the on-disk nanosecond timestamp? Huh? > > > > > > They're the incore limits, not the ondisk limits. > > > > > > Prior to bigtime, the ondisk timestamp epoch was the Unix epoch. This > > > isn't the case anymore in bigtime (bigtime's epoch is Dec. 1901, aka the > > > minimum timestamp under the old scheme), so that misnamed > > > XFS_INO_BIGTIME_EPOCH value is the conversion factor between epochs. > > > > > > (I'll come back to this at the bottom.) > > > > Ok, I'll come back to that at the bottom :) > > > > > > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > > > > + uint64_t s; > > > > > + uint32_t n; > > > > > + > > > > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > > > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > > > > + tv->tv_nsec = n; > > > > > + return; > > > > > + } > > > > > + > > > > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > > > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > > > > } > > > > > > > > I still don't really like the way this turned out :( > > > > > > I'll think about this further and hope that hch comes up with something > > > that's both functional and doesn't piss off smatch/sparse. Note that I > > > also don't have any big endian machines anymore, so I don't really have > > > a good way to test this. powerpc32 and sparc are verrrrry dead now. > > > > I'm not sure that anyone has current BE machines to test on.... > > ...which makes me all the more nervous about replacing the timestamp > union with open-coded bit shifting. We know the existing code does the > conversions properly with the separate sec/nsec fields since that code > has been around for a while. We can use BUILD_BUG_ON macros to ensure > that inside the union, the bigtime nanoseconds counter is overlayed > /exactly/ on top of the old structure. There's a feature flag within > the ondisk structure, which means that reasoning about this code is no > more difficult than any other tagged union. > > Flag == 0? Use the same old code from before. > Flag == 1? Use the new code. > > I was about to say that I'll experiment with this as a new patch at the > end of the series, but I guess converting xfs_timestamp back to a > typedef is more churn and belongs at the start of the series... > > > > > > +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, > > > > > + union xfs_timestamp *ts, const struct timespec64 *tv); > > > > > > > > > > #endif /* __XFS_INODE_BUF_H__ */ > > > > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h > > > > > index 17c83d29998c..569721f7f9e5 100644 > > > > > --- a/fs/xfs/libxfs/xfs_log_format.h > > > > > +++ b/fs/xfs/libxfs/xfs_log_format.h > > > > > @@ -373,6 +373,9 @@ union xfs_ictimestamp { > > > > > int32_t t_sec; /* timestamp seconds */ > > > > > int32_t t_nsec; /* timestamp nanoseconds */ > > > > > }; > > > > > + > > > > > + /* Nanoseconds since the bigtime epoch. */ > > > > > + uint64_t t_bigtime; > > > > > }; > > > > > > > > Where are we using this again? Right now the timestamps are > > > > converted directly into the VFS inode timestamp fields so we can get > > > > rid of these incore timestamp fields. So shouldn't we be trying to > > > > get rid of this structure rather than adding more functionality to > > > > it? > > > > > > We would have to enlarge xfs_log_dinode to log a full timespec64-like > > > entity. I understand that it's annoying to convert a vfs timestamp > > > back into a u64 nanoseconds counter for the sake of the log, but doing > > > so will add complexity to the log for absolutely zero gain because > > > having 96 bits per timestamp in the log doesn't buy us anything. > > > > Sure, I understand that we only need to log a 64bit value, but we > > don't actually need a structure for that as the log is in native > > endian format. Hence it can just be a 64 bit field that we mask and > > shift for !bigtime inodes... > > > > Note that we have to be real careful about dynamic conversion, > > especially in recovery, as the inode read from disk might be in > > small time format, but logged and recovered in bigtime format. I > > didn't actually check the recovery code does that correctly, because > > it only just occurred to me that the logged timestamp format may not > > match the inode flags read from disk during recovery... > > Oh my, you're right, that xfs_log_dinode_to_disk_timestamp needs to be > more careful to convert whatever we logged into something that is > agnostic to disk format, and then convert it to whatever is the > xfs_dinode format. On second thought, I think the inode logging code handles the timestamp format correctly as it is now written because we always log the inode core even if we're doing an XFS_ILOG_TIMESTAMP update, right? Which means that we always log the timestamps along with flags2 in xfs_log_dinode, and therefore there's no chance for inconsistency. --D > I'll throw that on the fixme pile too. > > > > > > --- a/fs/xfs/xfs_inode.c > > > > > +++ b/fs/xfs/xfs_inode.c > > > > > @@ -841,6 +841,8 @@ xfs_ialloc( > > > > > if (xfs_sb_version_has_v3inode(&mp->m_sb)) { > > > > > inode_set_iversion(inode, 1); > > > > > ip->i_d.di_flags2 = 0; > > > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > > > > + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; > > > > > > > > Rather than calculate the initial inode falgs on every allocation, > > > > shouldn't we just have the defaults pre-calculated at mount time? > > > > > > Hm, yes. Add that to the inode geometry structure? > > > > Sounds like a reasonable place to me. > > > > > > > ip->i_d.di_cowextsize = 0; > > > > > ip->i_d.di_crtime = tv; > > > > > } > > > > > @@ -2717,7 +2719,11 @@ xfs_ifree( > > > > > > > > > > VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ > > > > > ip->i_d.di_flags = 0; > > > > > - ip->i_d.di_flags2 = 0; > > > > > + /* > > > > > + * Preserve the bigtime flag so that di_ctime accurately stores the > > > > > + * deletion time. > > > > > + */ > > > > > + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; > > > > > > > > Oh, that's a nasty wart. > > > > > > And here again? > > > > *nod*. Good idea - we will have logged the inode core and converted > > it in-core to bigtime by this point... > > > > > > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h > > > > > index 7158a8de719f..3e0c677cff15 100644 > > > > > --- a/fs/xfs/xfs_ondisk.h > > > > > +++ b/fs/xfs/xfs_ondisk.h > > > > > @@ -25,6 +25,9 @@ xfs_check_limits(void) > > > > > /* make sure timestamp limits are correct */ > > > > > XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); > > > > > XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); > > > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); > > > > > + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); > > > > > > > > That still just doesn't look right to me :/ > > > > > > > > This implies that the epoch is 2^32 seconds after then minimum > > > > supported time (2038), when in fact it is only 2^31 seconds after the > > > > minimum supported timestamp (1970). :/ > > > > > > Ok, so XFS_INO_UNIX_BIGTIME_MIN is -2147483648, to signify that the > > > smallest bigtime timestamp is (still) December 1901. > > > > Let's drop the "ino" from the name - it's unnecessary, I think. > > Ok. > > > > That thing currently known as XFS_INO_BIGTIME_EPOCH should probably get > > > renamed to something less confusing, like... > > > > > > /* > > > * Since the bigtime epoch is Dec. 1901, add this number of seconds to > > > * an ondisk bigtime timestamp to convert it to the Unix epoch. > > > */ > > > #define XFS_BIGTIME_TO_UNIX (-XFS_INO_UNIX_BIGTIME_MIN) > > > > > > /* > > > * Subtract this many seconds from a Unix epoch timestamp to get the > > > * ondisk bigtime timestamp. > > > */ > > > #define XFS_UNIX_TO_BIGTIME (-XFS_BIGTIME_TO_UNIX) > > > > > > Is that clearer? > > > > Hmmm. Definitely better, but how about: > > > > /* > > * Bigtime epoch is set exactly to the minimum time value that a > > * traditional 32 bit timestamp can represent when using the Unix > > * epoch as a reference. Hence the Unix epoch is at a fixed offset > > * into the supported bigtime timestamp range. > > * > > * The bigtime epoch also matches the minimum value an on-disk 32 > > * bit XFS timestamp can represent so we will not lose any fidelity > > * in converting to/from unix and bigtime timestamps. > > */ > > #define XFS_BIGTIME_EPOCH_OFFSET (XFS_INO_TIME_MIN) > > > > And then two static inline helpers follow immediately - > > xfs_bigtime_to_unix() and xfs_bigtime_from_unix() can do the > > conversion between the two formats and the XFS_BIGTIME_EPOCH_OFFSET > > variable never gets seen anywhere else in the code. To set the max > > timestamp value the superblock holds for the filesystem, just > > calculate it directly via a call to xfs_bigtime_to_unix(-1ULL, ...) > > <nod> > > --D > > > > > Hmmm. I got 16299260424 when I just ran this through a simple calc. > > > > Mind you, no calculator app I found could handle unsigned 64 bit > > > > values natively (signed 64 bit is good enough for everyone!) so > > > > maybe I got an off-by one here... > > > > > > -1ULL = 18,446,744,073,709,551,615 > > > -1ULL / NSEC_PER_SEC = 18,446,744,073 > > > (-1ULL / NSEC_PER_SEC) - XFS_INO_BIGTIME_EPOCH = 16,299,260,425 > > > > Yup, I got an off by one thanks to integer rounding on the > > division. I should have just done it long hand like that... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Sun, Aug 23, 2020 at 07:43:41PM -0700, Darrick J. Wong wrote: > On Sat, Aug 22, 2020 at 08:33:19AM +0100, Christoph Hellwig wrote: > > > * in the AGI header so that we can skip the finobt walk at mount time when > > > @@ -855,12 +862,18 @@ struct xfs_agfl { > > > * > > > * Inode timestamps consist of signed 32-bit counters for seconds and > > > * nanoseconds; time zero is the Unix epoch, Jan 1 00:00:00 UTC 1970. > > > + * > > > + * When bigtime is enabled, timestamps become an unsigned 64-bit nanoseconds > > > + * counter. Time zero is the start of the classic timestamp range. > > > */ > > > union xfs_timestamp { > > > struct { > > > __be32 t_sec; /* timestamp seconds */ > > > __be32 t_nsec; /* timestamp nanoseconds */ > > > }; > > > + > > > + /* Nanoseconds since the bigtime epoch. */ > > > + __be64 t_bigtime; > > > }; > > > > So do we really need the union here? What about: > > > > (1) keep the typedef instead of removing it > > (2) switch the typedef to be just a __be64, and use trivial helpers > > to extract the two separate legacy sec/nsec field > > (3) PROFIT!!! > > Been there, done that. Dave suggested some replacement code (which > corrupted the values), then I modified that into a correct version, > which then made smatch angry because it doesn't like code that does bit > shifts on __be64 values. Backing up here, I've realized that my own analysis of Dave's pseudocode was incorrect. On a little endian machine, we'll start with the following. A is the LSB of seconds; D is the MSB of seconds; E is the LSB of nsec, and H is the MSB of nsec. sec nsec (incore) l m l m ABCD EFGH Now we encode that with an old kernel, which calls cpu_to_be32 to turn that into: sec nsec (ondisk) m l m l DCBA HGFE Move over to a new kernel, and that becomes: tstamp (ondisk) m l DCBAHGFE Next we decode with be64_to_cpu: tstamp (incore) l m EFGHABCD Now we extract nsec from (tstamp & -1U) and sec from (tstamp >> 32): sec nsec l m l m ABCD EFGH So yes, masking and shifting /after/ the endian conversion works just fine and doesn't throw any sparse/smatch errors. Now on a big endian machine: sec nsec (incore) m l m l DCBA HGFE Now we encode that with an old kernel, which calls cpu_to_be32 (a nop) to turn that into: sec nsec (ondisk) m l m l DCBA HGFE Move over to a new kernel, and that becomes: tstamp (ondisk) m l DCBAHGFE Next we decode with be64_to_cpu (a nop): tstamp (incore) m l DCBAHGFE Now we extract nsec from (tstamp & -1U) and sec from (tstamp >> 32): sec nsec m l m l DCBA HGFE Works fine here too. Now the /truly/ nasty case here is xfs_ictimestamp, since we log the inode core in host endian format. If we start with this the vfs timestamp on a new kernel: sec nsec (incore) l m l m ABCD EFGH We need to encode that as: tstamp (ondisk) l m ABCDEFGH The only way to do this is: (nsec << 32) | (sec & -1U). That makes the log timestamp encoding is the opposite of what we do for the ondisk inodes, because log formats don't use cpu_to_be64. At least for a big endian machine, log timestamp coding is easy: sec nsec (incore) m l m l DCBA HGFE We need to encode that as: tstamp (ondisk) m l DCBAHGFE And the only way to get there is (sec << 32) | (nsec & -1U), which is what the ondisk inode timestamp coding does. I still think this is grody, but at least now now I have a new fstest to make sure that log recovery doesn't trip over this. So, you were technically right and I was wrong. We'll see how you like the new stuff. ;) --D > > > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ > > > void > > > xfs_inode_from_disk_timestamp( > > > + struct xfs_dinode *dip, > > > struct timespec64 *tv, > > > const union xfs_timestamp *ts) > > > > I think passing ts by value might lead to somewhat better code > > generation on modern ABIs (and older ABIs just fall back to pass > > by reference transparently). > > Hm, ok. I did not know that. :) > > > > { > > > + if (dip->di_version >= 3 && > > > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { > > > > Do we want a helper for this condition? > > Yes, yes we do. Will add. > > > > + uint64_t t = be64_to_cpu(ts->t_bigtime); > > > + uint64_t s; > > > + uint32_t n; > > > + > > > + s = div_u64_rem(t, NSEC_PER_SEC, &n); > > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; > > > + tv->tv_nsec = n; > > > + return; > > > + } > > > + > > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec); > > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); > > > > Nit: for these kinds of symmetric conditions and if/else feels a little > > more natural. > > > > > + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); > > > > This adds a > 80 char line. > > Do we care now that checkpatch has been changed to allow up to 100 > columns? > > > > + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { > > > + uint64_t t; > > > + > > > + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); > > > + t *= NSEC_PER_SEC; > > > + its->t_bigtime = t + ts->tv_nsec; > > > > This calculation is dupliated in two places, might be worth > > adding a little helper (which will need to get the sec/nsec values > > passed separately due to the different structures). > > > > > + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); > > > > Another line over 8 characters here. > > > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) { > > > + sb->s_time_min = XFS_INO_BIGTIME_MIN; > > > + sb->s_time_max = XFS_INO_BIGTIME_MAX; > > > + } else { > > > + sb->s_time_min = XFS_INO_TIME_MIN; > > > + sb->s_time_max = XFS_INO_TIME_MAX; > > > + } > > > > This is really a comment on the earlier patch, but maybe we should > > name the old constants with "OLD" or "LEGACY" or "SMALL" in the name? > > Yes, good suggestion! > > > > @@ -1494,6 +1499,10 @@ xfs_fc_fill_super( > > > if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) > > > sb->s_flags |= SB_I_VERSION; > > > > > > + if (xfs_sb_version_hasbigtime(&mp->m_sb)) > > > + xfs_warn(mp, > > > + "EXPERIMENTAL big timestamp feature in use. Use at your own risk!"); > > > + > > > > Is there any good reason to mark this experimental? > > As you and Dave have both pointed out, there are plenty of stupid bugs > still in this. I think I'd like to have at least one EXPERIMENTAL cycle > to make sure I didn't commit anything pathologically stupid in here. > > <cough> ext4 34-bit sign extension bug <cough>. > > --D
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index 772113db41aa..57343973e5e5 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -467,6 +467,7 @@ xfs_sb_has_ro_compat_feature( #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ #define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */ #define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */ +#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */ #define XFS_SB_FEAT_INCOMPAT_ALL \ (XFS_SB_FEAT_INCOMPAT_FTYPE| \ XFS_SB_FEAT_INCOMPAT_SPINODES| \ @@ -565,6 +566,12 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp) (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK); } +static inline bool xfs_sb_version_hasbigtime(struct xfs_sb *sbp) +{ + return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 && + (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_BIGTIME); +} + /* * Inode btree block counter. We record the number of inobt and finobt blocks * in the AGI header so that we can skip the finobt walk at mount time when @@ -855,12 +862,18 @@ struct xfs_agfl { * * Inode timestamps consist of signed 32-bit counters for seconds and * nanoseconds; time zero is the Unix epoch, Jan 1 00:00:00 UTC 1970. + * + * When bigtime is enabled, timestamps become an unsigned 64-bit nanoseconds + * counter. Time zero is the start of the classic timestamp range. */ union xfs_timestamp { struct { __be32 t_sec; /* timestamp seconds */ __be32 t_nsec; /* timestamp nanoseconds */ }; + + /* Nanoseconds since the bigtime epoch. */ + __be64 t_bigtime; }; /* @@ -875,6 +888,25 @@ union xfs_timestamp { */ #define XFS_INO_TIME_MAX ((int64_t)S32_MAX) +/* + * Number of seconds between the start of the bigtime timestamp range and the + * start of the Unix epoch. + */ +#define XFS_INO_BIGTIME_EPOCH (-XFS_INO_TIME_MIN) + +/* + * Smallest possible timestamp with big timestamps, which is + * Dec 13 20:45:52 UTC 1901. + */ +#define XFS_INO_BIGTIME_MIN (XFS_INO_TIME_MIN) + +/* + * Largest possible timestamp with big timestamps, which is + * Jul 2 20:20:25 UTC 2486. + */ +#define XFS_INO_BIGTIME_MAX ((int64_t)((-1ULL / NSEC_PER_SEC) - \ + XFS_INO_BIGTIME_EPOCH)) + /* * On-disk inode structure. * @@ -1100,12 +1132,16 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev) #define XFS_DIFLAG2_DAX_BIT 0 /* use DAX for this inode */ #define XFS_DIFLAG2_REFLINK_BIT 1 /* file's blocks may be shared */ #define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */ +#define XFS_DIFLAG2_BIGTIME_BIT 3 /* big timestamps */ + #define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT) #define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT) #define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT) +#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT) #define XFS_DIFLAG2_ANY \ - (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE) + (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \ + XFS_DIFLAG2_BIGTIME) /* * Inode number format: diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index 84bcffa87753..2a2e3cfd94f0 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -249,6 +249,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_SPINODES (1 << 18) /* sparse inode chunks */ #define XFS_FSOP_GEOM_FLAGS_RMAPBT (1 << 19) /* reverse mapping btree */ #define XFS_FSOP_GEOM_FLAGS_REFLINK (1 << 20) /* files can share blocks */ +#define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */ /* * Minimum and maximum sizes need for growth checks. diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index cc1316a5fe0c..c59ddb56bb90 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -157,11 +157,25 @@ xfs_imap_to_bp( return 0; } +/* Convert an ondisk timestamp into the 64-bit safe incore format. */ void xfs_inode_from_disk_timestamp( + struct xfs_dinode *dip, struct timespec64 *tv, const union xfs_timestamp *ts) { + if (dip->di_version >= 3 && + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) { + uint64_t t = be64_to_cpu(ts->t_bigtime); + uint64_t s; + uint32_t n; + + s = div_u64_rem(t, NSEC_PER_SEC, &n); + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH; + tv->tv_nsec = n; + return; + } + tv->tv_sec = (int)be32_to_cpu(ts->t_sec); tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec); } @@ -220,9 +234,9 @@ xfs_inode_from_disk( * a time before epoch is converted to a time long after epoch * on 64 bit systems. */ - xfs_inode_from_disk_timestamp(&inode->i_atime, &from->di_atime); - xfs_inode_from_disk_timestamp(&inode->i_mtime, &from->di_mtime); - xfs_inode_from_disk_timestamp(&inode->i_ctime, &from->di_ctime); + xfs_inode_from_disk_timestamp(from, &inode->i_atime, &from->di_atime); + xfs_inode_from_disk_timestamp(from, &inode->i_mtime, &from->di_mtime); + xfs_inode_from_disk_timestamp(from, &inode->i_ctime, &from->di_ctime); to->di_size = be64_to_cpu(from->di_size); to->di_nblocks = be64_to_cpu(from->di_nblocks); @@ -235,9 +249,17 @@ xfs_inode_from_disk( if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { inode_set_iversion_queried(inode, be64_to_cpu(from->di_changecount)); - xfs_inode_from_disk_timestamp(&to->di_crtime, &from->di_crtime); + xfs_inode_from_disk_timestamp(from, &to->di_crtime, + &from->di_crtime); to->di_flags2 = be64_to_cpu(from->di_flags2); to->di_cowextsize = be32_to_cpu(from->di_cowextsize); + /* + * Set the bigtime flag incore so that we automatically convert + * this inode's ondisk timestamps to bigtime format the next + * time we write the inode core to disk. + */ + if (xfs_sb_version_hasbigtime(&ip->i_mount->m_sb)) + to->di_flags2 |= XFS_DIFLAG2_BIGTIME; } error = xfs_iformat_data_fork(ip, from); @@ -259,9 +281,19 @@ xfs_inode_from_disk( void xfs_inode_to_disk_timestamp( + struct xfs_icdinode *from, union xfs_timestamp *ts, const struct timespec64 *tv) { + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { + uint64_t t; + + t = (uint64_t)(tv->tv_sec + XFS_INO_BIGTIME_EPOCH); + t *= NSEC_PER_SEC; + ts->t_bigtime = cpu_to_be64(t + tv->tv_nsec); + return; + } + ts->t_sec = cpu_to_be32(tv->tv_sec); ts->t_nsec = cpu_to_be32(tv->tv_nsec); } @@ -285,9 +317,9 @@ xfs_inode_to_disk( to->di_projid_hi = cpu_to_be16(from->di_projid >> 16); memset(to->di_pad, 0, sizeof(to->di_pad)); - xfs_inode_to_disk_timestamp(&to->di_atime, &inode->i_atime); - xfs_inode_to_disk_timestamp(&to->di_mtime, &inode->i_mtime); - xfs_inode_to_disk_timestamp(&to->di_ctime, &inode->i_ctime); + xfs_inode_to_disk_timestamp(from, &to->di_atime, &inode->i_atime); + xfs_inode_to_disk_timestamp(from, &to->di_mtime, &inode->i_mtime); + xfs_inode_to_disk_timestamp(from, &to->di_ctime, &inode->i_ctime); to->di_nlink = cpu_to_be32(inode->i_nlink); to->di_gen = cpu_to_be32(inode->i_generation); to->di_mode = cpu_to_be16(inode->i_mode); @@ -306,7 +338,8 @@ xfs_inode_to_disk( if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { to->di_version = 3; to->di_changecount = cpu_to_be64(inode_peek_iversion(inode)); - xfs_inode_to_disk_timestamp(&to->di_crtime, &from->di_crtime); + xfs_inode_to_disk_timestamp(from, &to->di_crtime, + &from->di_crtime); to->di_flags2 = cpu_to_be64(from->di_flags2); to->di_cowextsize = cpu_to_be32(from->di_cowextsize); to->di_ino = cpu_to_be64(ip->i_ino); @@ -526,6 +559,11 @@ xfs_dinode_verify( if (fa) return fa; + /* bigtime iflag can only happen on bigtime filesystems */ + if ((flags2 & (XFS_DIFLAG2_BIGTIME)) && + !xfs_sb_version_hasbigtime(&mp->m_sb)) + return __this_address; + return NULL; } diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h index f6160033fcbd..3b25e43eafa1 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.h +++ b/fs/xfs/libxfs/xfs_inode_buf.h @@ -58,9 +58,9 @@ xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp, uint32_t cowextsize, uint16_t mode, uint16_t flags, uint64_t flags2); -void xfs_inode_from_disk_timestamp(struct timespec64 *tv, - const union xfs_timestamp *ts); -void xfs_inode_to_disk_timestamp(union xfs_timestamp *ts, - const struct timespec64 *tv); +void xfs_inode_from_disk_timestamp(struct xfs_dinode *dip, + struct timespec64 *tv, const union xfs_timestamp *ts); +void xfs_inode_to_disk_timestamp(struct xfs_icdinode *from, + union xfs_timestamp *ts, const struct timespec64 *tv); #endif /* __XFS_INODE_BUF_H__ */ diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 17c83d29998c..569721f7f9e5 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -373,6 +373,9 @@ union xfs_ictimestamp { int32_t t_sec; /* timestamp seconds */ int32_t t_nsec; /* timestamp nanoseconds */ }; + + /* Nanoseconds since the bigtime epoch. */ + uint64_t t_bigtime; }; /* diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index ae9aaf1f34bf..d5d60cd1c2ea 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -1166,6 +1166,8 @@ xfs_fs_geometry( geo->flags |= XFS_FSOP_GEOM_FLAGS_RMAPBT; if (xfs_sb_version_hasreflink(sbp)) geo->flags |= XFS_FSOP_GEOM_FLAGS_REFLINK; + if (xfs_sb_version_hasbigtime(sbp)) + geo->flags |= XFS_FSOP_GEOM_FLAGS_BIGTIME; if (xfs_sb_version_hassector(sbp)) geo->logsectsize = sbp->sb_logsectsize; else diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index e15129647e00..71aca4587715 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -131,6 +131,17 @@ xfs_trans_log_inode( iversion_flags = XFS_ILOG_CORE; } + /* + * If we're updating the inode core or the timestamps and it's possible + * to upgrade this inode to bigtime format, do so now. + */ + if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) && + xfs_sb_version_hasbigtime(&tp->t_mountp->m_sb) && + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; + flags |= XFS_ILOG_CORE; + } + /* * Record the specific change for fdatasync optimisation. This allows * fdatasync to skip log forces for inodes that are only timestamp diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 9f036053fdb7..6f00309de2d4 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -190,6 +190,11 @@ xchk_inode_flags2( if ((flags2 & XFS_DIFLAG2_DAX) && (flags2 & XFS_DIFLAG2_REFLINK)) goto bad; + /* no bigtime iflag without the bigtime feature */ + if (!xfs_sb_version_hasbigtime(&mp->m_sb) && + (flags2 & XFS_DIFLAG2_BIGTIME)) + goto bad; + return; bad: xchk_ino_set_corrupt(sc, ino); @@ -199,11 +204,12 @@ static inline void xchk_dinode_nsec( struct xfs_scrub *sc, xfs_ino_t ino, + struct xfs_dinode *dip, const union xfs_timestamp *ts) { struct timespec64 tv; - xfs_inode_from_disk_timestamp(&tv, ts); + xfs_inode_from_disk_timestamp(dip, &tv, ts); if (tv.tv_nsec < 0 || tv.tv_nsec >= NSEC_PER_SEC) xchk_ino_set_corrupt(sc, ino); } @@ -306,9 +312,9 @@ xchk_dinode( } /* di_[amc]time.nsec */ - xchk_dinode_nsec(sc, ino, &dip->di_atime); - xchk_dinode_nsec(sc, ino, &dip->di_mtime); - xchk_dinode_nsec(sc, ino, &dip->di_ctime); + xchk_dinode_nsec(sc, ino, dip, &dip->di_atime); + xchk_dinode_nsec(sc, ino, dip, &dip->di_mtime); + xchk_dinode_nsec(sc, ino, dip, &dip->di_ctime); /* * di_size. xfs_dinode_verify checks for things that screw up @@ -413,7 +419,7 @@ xchk_dinode( } if (dip->di_version >= 3) { - xchk_dinode_nsec(sc, ino, &dip->di_crtime); + xchk_dinode_nsec(sc, ino, dip, &dip->di_crtime); xchk_inode_flags2(sc, dip, ino, mode, flags, flags2); xchk_inode_cowextsize(sc, dip, ino, mode, flags, flags2); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index c06129cffba9..bbc309bc70c4 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -841,6 +841,8 @@ xfs_ialloc( if (xfs_sb_version_has_v3inode(&mp->m_sb)) { inode_set_iversion(inode, 1); ip->i_d.di_flags2 = 0; + if (xfs_sb_version_hasbigtime(&mp->m_sb)) + ip->i_d.di_flags2 |= XFS_DIFLAG2_BIGTIME; ip->i_d.di_cowextsize = 0; ip->i_d.di_crtime = tv; } @@ -2717,7 +2719,11 @@ xfs_ifree( VFS_I(ip)->i_mode = 0; /* mark incore inode as free */ ip->i_d.di_flags = 0; - ip->i_d.di_flags2 = 0; + /* + * Preserve the bigtime flag so that di_ctime accurately stores the + * deletion time. + */ + ip->i_d.di_flags2 &= XFS_DIFLAG2_BIGTIME; ip->i_d.di_dmevmask = 0; ip->i_d.di_forkoff = 0; /* mark the attr fork not in use */ ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; @@ -3503,6 +3509,13 @@ xfs_iflush( __func__, ip->i_ino, ip->i_d.di_forkoff, ip); goto flush_out; } + if (xfs_sb_version_hasbigtime(&mp->m_sb) && + !(ip->i_d.di_flags2 & XFS_DIFLAG2_BIGTIME)) { + xfs_alert_tag(mp, XFS_PTAG_IFLUSH, + "%s: bad inode %Lu, bigtime unset, ptr "PTR_FMT, + __func__, ip->i_ino, ip); + goto flush_out; + } /* * Inode item log recovery for v2 inodes are dependent on the diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index cec6d4d82bc4..c38af3eea48f 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -298,9 +298,16 @@ xfs_inode_item_format_attr_fork( /* Write a log_dinode timestamp into an ondisk inode timestamp. */ static inline void xfs_log_dinode_to_disk_ts( + struct xfs_log_dinode *from, union xfs_timestamp *ts, const union xfs_ictimestamp *its) { + if (from->di_version >= 3 && + (from->di_flags2 & XFS_DIFLAG2_BIGTIME)) { + ts->t_bigtime = cpu_to_be64(its->t_bigtime); + return; + } + ts->t_sec = cpu_to_be32(its->t_sec); ts->t_nsec = cpu_to_be32(its->t_nsec); } @@ -322,9 +329,9 @@ xfs_log_dinode_to_disk( to->di_projid_hi = cpu_to_be16(from->di_projid_hi); memcpy(to->di_pad, from->di_pad, sizeof(to->di_pad)); - xfs_log_dinode_to_disk_ts(&to->di_atime, &from->di_atime); - xfs_log_dinode_to_disk_ts(&to->di_mtime, &from->di_mtime); - xfs_log_dinode_to_disk_ts(&to->di_ctime, &from->di_ctime); + xfs_log_dinode_to_disk_ts(from, &to->di_atime, &from->di_atime); + xfs_log_dinode_to_disk_ts(from, &to->di_mtime, &from->di_mtime); + xfs_log_dinode_to_disk_ts(from, &to->di_ctime, &from->di_ctime); to->di_size = cpu_to_be64(from->di_size); to->di_nblocks = cpu_to_be64(from->di_nblocks); @@ -340,7 +347,7 @@ xfs_log_dinode_to_disk( if (from->di_version == 3) { to->di_changecount = cpu_to_be64(from->di_changecount); - xfs_log_dinode_to_disk_ts(&to->di_crtime, &from->di_crtime); + xfs_log_dinode_to_disk_ts(from, &to->di_crtime, &from->di_crtime); to->di_flags2 = cpu_to_be64(from->di_flags2); to->di_cowextsize = cpu_to_be32(from->di_cowextsize); to->di_ino = cpu_to_be64(from->di_ino); @@ -356,9 +363,19 @@ xfs_log_dinode_to_disk( /* Write an incore inode timestamp into a log_dinode timestamp. */ static inline void xfs_inode_to_log_dinode_ts( + struct xfs_icdinode *from, union xfs_ictimestamp *its, const struct timespec64 *ts) { + if (from->di_flags2 & XFS_DIFLAG2_BIGTIME) { + uint64_t t; + + t = (uint64_t)(ts->tv_sec + XFS_INO_BIGTIME_EPOCH); + t *= NSEC_PER_SEC; + its->t_bigtime = t + ts->tv_nsec; + return; + } + its->t_sec = ts->tv_sec; its->t_nsec = ts->tv_nsec; } @@ -381,9 +398,9 @@ xfs_inode_to_log_dinode( memset(to->di_pad, 0, sizeof(to->di_pad)); memset(to->di_pad3, 0, sizeof(to->di_pad3)); - xfs_inode_to_log_dinode_ts(&to->di_atime, &inode->i_atime); - xfs_inode_to_log_dinode_ts(&to->di_mtime, &inode->i_mtime); - xfs_inode_to_log_dinode_ts(&to->di_ctime, &inode->i_ctime); + xfs_inode_to_log_dinode_ts(from, &to->di_atime, &inode->i_atime); + xfs_inode_to_log_dinode_ts(from, &to->di_mtime, &inode->i_mtime); + xfs_inode_to_log_dinode_ts(from, &to->di_ctime, &inode->i_ctime); to->di_nlink = inode->i_nlink; to->di_gen = inode->i_generation; to->di_mode = inode->i_mode; @@ -405,7 +422,7 @@ xfs_inode_to_log_dinode( if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) { to->di_version = 3; to->di_changecount = inode_peek_iversion(inode); - xfs_inode_to_log_dinode_ts(&to->di_crtime, &from->di_crtime); + xfs_inode_to_log_dinode_ts(from, &to->di_crtime, &from->di_crtime); to->di_flags2 = from->di_flags2; to->di_cowextsize = from->di_cowextsize; to->di_ino = ip->i_ino; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 6f22a66777cd..13396c3665d1 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1190,7 +1190,8 @@ xfs_flags2diflags2( unsigned int xflags) { uint64_t di_flags2 = - (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK); + (ip->i_d.di_flags2 & (XFS_DIFLAG2_REFLINK | + XFS_DIFLAG2_BIGTIME)); if (xflags & FS_XFLAG_DAX) di_flags2 |= XFS_DIFLAG2_DAX; diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h index 7158a8de719f..3e0c677cff15 100644 --- a/fs/xfs/xfs_ondisk.h +++ b/fs/xfs/xfs_ondisk.h @@ -25,6 +25,9 @@ xfs_check_limits(void) /* make sure timestamp limits are correct */ XFS_CHECK_VALUE(XFS_INO_TIME_MIN, -2147483648LL); XFS_CHECK_VALUE(XFS_INO_TIME_MAX, 2147483647LL); + XFS_CHECK_VALUE(XFS_INO_BIGTIME_EPOCH, 2147483648LL); + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MIN, -2147483648LL); + XFS_CHECK_VALUE(XFS_INO_BIGTIME_MAX, 16299260425LL); XFS_CHECK_VALUE(XFS_DQ_TIMEOUT_MIN, 1LL); XFS_CHECK_VALUE(XFS_DQ_TIMEOUT_MAX, 4294967295LL); XFS_CHECK_VALUE(XFS_DQ_GRACE_MIN, 0LL); @@ -58,6 +61,9 @@ xfs_check_ondisk_structs(void) XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key, 20); XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec, 24); XFS_CHECK_STRUCT_SIZE(union xfs_timestamp, 8); + XFS_CHECK_OFFSET(union xfs_timestamp, t_bigtime, 0); + XFS_CHECK_OFFSET(union xfs_timestamp, t_sec, 0); + XFS_CHECK_OFFSET(union xfs_timestamp, t_nsec, 4); XFS_CHECK_STRUCT_SIZE(xfs_alloc_key_t, 8); XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t, 4); XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t, 8); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 375f05a47ba4..b19ae052f7a3 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1484,8 +1484,13 @@ xfs_fc_fill_super( sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_max_links = XFS_MAXLINK; sb->s_time_gran = 1; - sb->s_time_min = XFS_INO_TIME_MIN; - sb->s_time_max = XFS_INO_TIME_MAX; + if (xfs_sb_version_hasbigtime(&mp->m_sb)) { + sb->s_time_min = XFS_INO_BIGTIME_MIN; + sb->s_time_max = XFS_INO_BIGTIME_MAX; + } else { + sb->s_time_min = XFS_INO_TIME_MIN; + sb->s_time_max = XFS_INO_TIME_MAX; + } sb->s_iflags |= SB_I_CGROUPWB; set_posix_acl_flag(sb); @@ -1494,6 +1499,10 @@ xfs_fc_fill_super( if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) sb->s_flags |= SB_I_VERSION; + if (xfs_sb_version_hasbigtime(&mp->m_sb)) + xfs_warn(mp, + "EXPERIMENTAL big timestamp feature in use. Use at your own risk!"); + if (mp->m_flags & XFS_MOUNT_DAX_ALWAYS) { bool rtdev_is_dax = false, datadev_is_dax;