diff mbox

[2/2] xfs: don't dirty snapshot logs for unlinked inode recovery

Message ID 896a0202-aac8-e43f-7ea6-3718591e32aa@sandeen.net (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Sandeen March 7, 2018, 11:33 p.m. UTC
Now that unlinked inode recovery is done outside of
log recovery, there is no need to dirty the log on
snapshots just to handle unlinked inodes.  This means
that readonly snapshots can be mounted without requiring
-o ro,norecovery to avoid the log replay that can't happen
on a readonly block device.

(unlinked inodes will just hang out in the agi buckets until
the next writable mount)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Darrick J. Wong March 24, 2018, 4:20 p.m. UTC | #1
On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> Now that unlinked inode recovery is done outside of
> log recovery, there is no need to dirty the log on
> snapshots just to handle unlinked inodes.  This means
> that readonly snapshots can be mounted without requiring
> -o ro,norecovery to avoid the log replay that can't happen
> on a readonly block device.
> 
> (unlinked inodes will just hang out in the agi buckets until
> the next writable mount)

FWIW I put these two in a test kernel to see what would happen and
generic/311 failures popped up.  It looked like the _check_scratch_fs
found incorrect block counts on the snapshot(?)

--D

> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 93588ea..5669525 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1419,9 +1419,10 @@ struct proc_xfs_info {
>  
>  /*
>   * Second stage of a freeze. The data is already frozen so we only
> - * need to take care of the metadata. Once that's done sync the superblock
> - * to the log to dirty it in case of a crash while frozen. This ensures that we
> - * will recover the unlinked inode lists on the next mount.
> + * need to take care of the metadata.
> + * Any unlinked inode lists will remain at this point, and be recovered
> + * on the next writable mount if we crash while frozen, or create
> + * a snapshot from the frozen filesystem.
>   */
>  STATIC int
>  xfs_fs_freeze(
> @@ -1431,7 +1432,7 @@ struct proc_xfs_info {
>  
>  	xfs_save_resvblks(mp);
>  	xfs_quiesce_attr(mp);
> -	return xfs_sync_sb(mp, true);
> +	return 0;
>  }
>  
>  STATIC int
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster March 26, 2018, 12:46 p.m. UTC | #2
On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> > Now that unlinked inode recovery is done outside of
> > log recovery, there is no need to dirty the log on
> > snapshots just to handle unlinked inodes.  This means
> > that readonly snapshots can be mounted without requiring
> > -o ro,norecovery to avoid the log replay that can't happen
> > on a readonly block device.
> > 
> > (unlinked inodes will just hang out in the agi buckets until
> > the next writable mount)
> 
> FWIW I put these two in a test kernel to see what would happen and
> generic/311 failures popped up.  It looked like the _check_scratch_fs
> found incorrect block counts on the snapshot(?)
> 

Interesting. Just a wild guess, but perhaps it has something to do with
lazy sb accounting..? I see we call xfs_initialize_perag_data() when
mounting an unclean fs.

Brian

> --D
> 
> > Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> > ---
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 93588ea..5669525 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1419,9 +1419,10 @@ struct proc_xfs_info {
> >  
> >  /*
> >   * Second stage of a freeze. The data is already frozen so we only
> > - * need to take care of the metadata. Once that's done sync the superblock
> > - * to the log to dirty it in case of a crash while frozen. This ensures that we
> > - * will recover the unlinked inode lists on the next mount.
> > + * need to take care of the metadata.
> > + * Any unlinked inode lists will remain at this point, and be recovered
> > + * on the next writable mount if we crash while frozen, or create
> > + * a snapshot from the frozen filesystem.
> >   */
> >  STATIC int
> >  xfs_fs_freeze(
> > @@ -1431,7 +1432,7 @@ struct proc_xfs_info {
> >  
> >  	xfs_save_resvblks(mp);
> >  	xfs_quiesce_attr(mp);
> > -	return xfs_sync_sb(mp, true);
> > +	return 0;
> >  }
> >  
> >  STATIC int
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 27, 2018, 9:17 p.m. UTC | #3
On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> > > Now that unlinked inode recovery is done outside of
> > > log recovery, there is no need to dirty the log on
> > > snapshots just to handle unlinked inodes.  This means
> > > that readonly snapshots can be mounted without requiring
> > > -o ro,norecovery to avoid the log replay that can't happen
> > > on a readonly block device.
> > > 
> > > (unlinked inodes will just hang out in the agi buckets until
> > > the next writable mount)
> > 
> > FWIW I put these two in a test kernel to see what would happen and
> > generic/311 failures popped up.  It looked like the _check_scratch_fs
> > found incorrect block counts on the snapshot(?)
> > 
> 
> Interesting. Just a wild guess, but perhaps it has something to do with
> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
> mounting an unclean fs.

The freeze is calls xfs_log_sbcount() which should update the
superblock counters from the in-memory counters and write them to
disk.

If they are out, I'm guessing it's because the in-memory per-ag
reservations are not being returned to the global pool before the
in-memory counters are summed during a freeze....

Cheers,

Dave.
Gao Xiang Feb. 23, 2021, 1:42 p.m. UTC | #4
Hi folks,

On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
> > On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> > > > Now that unlinked inode recovery is done outside of
> > > > log recovery, there is no need to dirty the log on
> > > > snapshots just to handle unlinked inodes.  This means
> > > > that readonly snapshots can be mounted without requiring
> > > > -o ro,norecovery to avoid the log replay that can't happen
> > > > on a readonly block device.
> > > > 
> > > > (unlinked inodes will just hang out in the agi buckets until
> > > > the next writable mount)
> > > 
> > > FWIW I put these two in a test kernel to see what would happen and
> > > generic/311 failures popped up.  It looked like the _check_scratch_fs
> > > found incorrect block counts on the snapshot(?)
> > > 
> > 
> > Interesting. Just a wild guess, but perhaps it has something to do with
> > lazy sb accounting..? I see we call xfs_initialize_perag_data() when
> > mounting an unclean fs.
> 
> The freeze is calls xfs_log_sbcount() which should update the
> superblock counters from the in-memory counters and write them to
> disk.
> 
> If they are out, I'm guessing it's because the in-memory per-ag
> reservations are not being returned to the global pool before the
> in-memory counters are summed during a freeze....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

I spend some time on tracking this problem. I've made a quick
modification with per-AG reservation and tested with generic/311
it seems fine. My current question is that how such fsfreezed
images (with clean mount) work with old kernels without [PATCH 1/1]?
I'm afraid orphan inodes won't be freed with such old kernels....
Am I missing something?

Thanks,
Gao Xiang

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 06041834daa3..79d6d8858dcf 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -963,12 +963,17 @@ xfs_log_quiesce(
 	return xfs_log_cover(mp);
 }
 
-void
+int
 xfs_log_clean(
 	struct xfs_mount	*mp)
 {
-	xfs_log_quiesce(mp);
+	int ret;
+
+	ret = xfs_log_quiesce(mp);
+	if (ret)
+		return ret;
 	xfs_log_unmount_write(mp);
+	return 0;
 }
 
 /*
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 044e02cb8921..4061a219bfde 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -139,7 +139,7 @@ bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
 void	xfs_log_work_queue(struct xfs_mount *mp);
 int	xfs_log_quiesce(struct xfs_mount *mp);
-void	xfs_log_clean(struct xfs_mount *mp);
+int	xfs_log_clean(struct xfs_mount *mp);
 bool	xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
 bool	xfs_log_in_recovery(struct xfs_mount *);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 97f31308de03..3ef21f589d6b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3478,6 +3478,7 @@ xlog_recover_finish(
 						     : "internal");
 		log->l_flags &= ~XLOG_RECOVERY_NEEDED;
 	} else {
+		xlog_recover_process_iunlinks(log);
 		xfs_info(log->l_mp, "Ending clean mount");
 	}
 	return 0;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 21b1d034aca3..0db1e7e0e0c8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -884,7 +884,7 @@ xfs_fs_freeze(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 	unsigned int		flags;
-	int			ret;
+	int			error;
 
 	/*
 	 * The filesystem is now frozen far enough that memory reclaim
@@ -893,10 +893,25 @@ xfs_fs_freeze(
 	 */
 	flags = memalloc_nofs_save();
 	xfs_blockgc_stop(mp);
+
+	/* Get rid of any leftover CoW reservations... */
+	error = xfs_blockgc_free_space(mp, NULL);
+	if (error) {
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
+	}
+
+	/* Free the per-AG metadata reservation pool. */
+	error = xfs_fs_unreserve_ag_blocks(mp);
+	if (error) {
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
+	}
+
 	xfs_save_resvblks(mp);
-	ret = xfs_log_quiesce(mp);
+	error = xfs_log_clean(mp);
 	memalloc_nofs_restore(flags);
-	return ret;
+	return error;
 }
 
 STATIC int
@@ -904,10 +919,26 @@ xfs_fs_unfreeze(
 	struct super_block	*sb)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
+	int error;
 
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
+
+	/* Recover any CoW blocks that never got remapped. */
+	error = xfs_reflink_recover_cow(mp);
+	if (error) {
+		xfs_err(mp,
+			"Error %d recovering leftover CoW allocations.", error);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
+	}
+
 	xfs_blockgc_start(mp);
+
+	/* Create the per-AG metadata reservation pool .*/
+	error = xfs_fs_reserve_ag_blocks(mp);
+	if (error && error != -ENOSPC)
+		return error;
 	return 0;
 }
 
@@ -1440,7 +1471,6 @@ xfs_fs_fill_super(
 #endif
 	}
 
-	/* Filesystem claims it needs repair, so refuse the mount. */
 	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
 		xfs_warn(mp, "Filesystem needs repair.  Please run xfs_repair.");
 		error = -EFSCORRUPTED;
Eric Sandeen Feb. 23, 2021, 2:40 p.m. UTC | #5
On 2/23/21 7:42 AM, Gao Xiang wrote:
> Hi folks,
> 
> On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
>> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
>>> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
>>>> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
>>>>> Now that unlinked inode recovery is done outside of
>>>>> log recovery, there is no need to dirty the log on
>>>>> snapshots just to handle unlinked inodes.  This means
>>>>> that readonly snapshots can be mounted without requiring
>>>>> -o ro,norecovery to avoid the log replay that can't happen
>>>>> on a readonly block device.
>>>>>
>>>>> (unlinked inodes will just hang out in the agi buckets until
>>>>> the next writable mount)
>>>>
>>>> FWIW I put these two in a test kernel to see what would happen and
>>>> generic/311 failures popped up.  It looked like the _check_scratch_fs
>>>> found incorrect block counts on the snapshot(?)
>>>>
>>>
>>> Interesting. Just a wild guess, but perhaps it has something to do with
>>> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
>>> mounting an unclean fs.
>>
>> The freeze is calls xfs_log_sbcount() which should update the
>> superblock counters from the in-memory counters and write them to
>> disk.
>>
>> If they are out, I'm guessing it's because the in-memory per-ag
>> reservations are not being returned to the global pool before the
>> in-memory counters are summed during a freeze....
>>
>> Cheers,
>>
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
> 
> I spend some time on tracking this problem. I've made a quick
> modification with per-AG reservation and tested with generic/311
> it seems fine. My current question is that how such fsfreezed
> images (with clean mount) work with old kernels without [PATCH 1/1]?
> I'm afraid orphan inodes won't be freed with such old kernels....
> Am I missing something?

It's true, a snapshot created with these patches will not have their unlinked
inodes processed if mounted on an older kernel. I'm not sure how much of a
problem that is; the filesystem is not inconsistent, but some space is lost,
I guess. I'm not sure it's common to take a snapshot of a frozen filesystem on
one kernel and then move it back to an older kernel.  Maybe others have
thoughts on this.

-Eric
Gao Xiang Feb. 23, 2021, 3:03 p.m. UTC | #6
On Tue, Feb 23, 2021 at 08:40:56AM -0600, Eric Sandeen wrote:
> On 2/23/21 7:42 AM, Gao Xiang wrote:
> > Hi folks,
> > 
> > On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
> >> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
> >>> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> >>>> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> >>>>> Now that unlinked inode recovery is done outside of
> >>>>> log recovery, there is no need to dirty the log on
> >>>>> snapshots just to handle unlinked inodes.  This means
> >>>>> that readonly snapshots can be mounted without requiring
> >>>>> -o ro,norecovery to avoid the log replay that can't happen
> >>>>> on a readonly block device.
> >>>>>
> >>>>> (unlinked inodes will just hang out in the agi buckets until
> >>>>> the next writable mount)
> >>>>
> >>>> FWIW I put these two in a test kernel to see what would happen and
> >>>> generic/311 failures popped up.  It looked like the _check_scratch_fs
> >>>> found incorrect block counts on the snapshot(?)
> >>>>
> >>>
> >>> Interesting. Just a wild guess, but perhaps it has something to do with
> >>> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
> >>> mounting an unclean fs.
> >>
> >> The freeze is calls xfs_log_sbcount() which should update the
> >> superblock counters from the in-memory counters and write them to
> >> disk.
> >>
> >> If they are out, I'm guessing it's because the in-memory per-ag
> >> reservations are not being returned to the global pool before the
> >> in-memory counters are summed during a freeze....
> >>
> >> Cheers,
> >>
> >> Dave.
> >> -- 
> >> Dave Chinner
> >> david@fromorbit.com
> > 
> > I spend some time on tracking this problem. I've made a quick
> > modification with per-AG reservation and tested with generic/311
> > it seems fine. My current question is that how such fsfreezed
> > images (with clean mount) work with old kernels without [PATCH 1/1]?
> > I'm afraid orphan inodes won't be freed with such old kernels....
> > Am I missing something?
> 
> It's true, a snapshot created with these patches will not have their unlinked
> inodes processed if mounted on an older kernel. I'm not sure how much of a
> problem that is; the filesystem is not inconsistent, but some space is lost,
> I guess. I'm not sure it's common to take a snapshot of a frozen filesystem on
> one kernel and then move it back to an older kernel.  Maybe others have
> thoughts on this.

My current thought might be only to write clean mount without
unlinked inodes when freezing, but leave log dirty if any
unlinked inodes exist as Brian mentioned before and don't
handle such case (?). I'd like to hear more comments about
this as well.

Thanks,
Gao Xiang

> 
> -Eric
>
Eric Sandeen Feb. 23, 2021, 3:46 p.m. UTC | #7
On 2/23/21 9:03 AM, Gao Xiang wrote:
> On Tue, Feb 23, 2021 at 08:40:56AM -0600, Eric Sandeen wrote:
>> On 2/23/21 7:42 AM, Gao Xiang wrote:
>>> Hi folks,
>>>
>>> On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
>>>> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
>>>>> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
>>>>>> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
>>>>>>> Now that unlinked inode recovery is done outside of
>>>>>>> log recovery, there is no need to dirty the log on
>>>>>>> snapshots just to handle unlinked inodes.  This means
>>>>>>> that readonly snapshots can be mounted without requiring
>>>>>>> -o ro,norecovery to avoid the log replay that can't happen
>>>>>>> on a readonly block device.
>>>>>>>
>>>>>>> (unlinked inodes will just hang out in the agi buckets until
>>>>>>> the next writable mount)
>>>>>>
>>>>>> FWIW I put these two in a test kernel to see what would happen and
>>>>>> generic/311 failures popped up.  It looked like the _check_scratch_fs
>>>>>> found incorrect block counts on the snapshot(?)
>>>>>>
>>>>>
>>>>> Interesting. Just a wild guess, but perhaps it has something to do with
>>>>> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
>>>>> mounting an unclean fs.
>>>>
>>>> The freeze is calls xfs_log_sbcount() which should update the
>>>> superblock counters from the in-memory counters and write them to
>>>> disk.
>>>>
>>>> If they are out, I'm guessing it's because the in-memory per-ag
>>>> reservations are not being returned to the global pool before the
>>>> in-memory counters are summed during a freeze....
>>>>
>>>> Cheers,
>>>>
>>>> Dave.
>>>> -- 
>>>> Dave Chinner
>>>> david@fromorbit.com
>>>
>>> I spend some time on tracking this problem. I've made a quick
>>> modification with per-AG reservation and tested with generic/311
>>> it seems fine. My current question is that how such fsfreezed
>>> images (with clean mount) work with old kernels without [PATCH 1/1]?
>>> I'm afraid orphan inodes won't be freed with such old kernels....
>>> Am I missing something?
>>
>> It's true, a snapshot created with these patches will not have their unlinked
>> inodes processed if mounted on an older kernel. I'm not sure how much of a
>> problem that is; the filesystem is not inconsistent, but some space is lost,
>> I guess. I'm not sure it's common to take a snapshot of a frozen filesystem on
>> one kernel and then move it back to an older kernel.  Maybe others have
>> thoughts on this.
> 
> My current thought might be only to write clean mount without
> unlinked inodes when freezing, but leave log dirty if any
> unlinked inodes exist as Brian mentioned before and don't
> handle such case (?). I'd like to hear more comments about
> this as well.

I don't know if I had made this comment before ;) but I feel like that's even
more "surprise" (as in: gets further from the principle of least surprise)
and TBH I would rather not have that somewhat unpredictable behavior.

I think I'd rather /always/ make a dirty log than sometimes do it, other
times not. It'd just be more confusion for the admin IMHO.

Thanks,
-Eric

> Thanks,
> Gao Xiang
> 
>>
>> -Eric
>>
>
Gao Xiang Feb. 23, 2021, 3:58 p.m. UTC | #8
On Tue, Feb 23, 2021 at 09:46:38AM -0600, Eric Sandeen wrote:
> 
> 
> On 2/23/21 9:03 AM, Gao Xiang wrote:
> > On Tue, Feb 23, 2021 at 08:40:56AM -0600, Eric Sandeen wrote:
> >> On 2/23/21 7:42 AM, Gao Xiang wrote:
> >>> Hi folks,
> >>>
> >>> On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
> >>>> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
> >>>>> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> >>>>>> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> >>>>>>> Now that unlinked inode recovery is done outside of
> >>>>>>> log recovery, there is no need to dirty the log on
> >>>>>>> snapshots just to handle unlinked inodes.  This means
> >>>>>>> that readonly snapshots can be mounted without requiring
> >>>>>>> -o ro,norecovery to avoid the log replay that can't happen
> >>>>>>> on a readonly block device.
> >>>>>>>
> >>>>>>> (unlinked inodes will just hang out in the agi buckets until
> >>>>>>> the next writable mount)
> >>>>>>
> >>>>>> FWIW I put these two in a test kernel to see what would happen and
> >>>>>> generic/311 failures popped up.  It looked like the _check_scratch_fs
> >>>>>> found incorrect block counts on the snapshot(?)
> >>>>>>
> >>>>>
> >>>>> Interesting. Just a wild guess, but perhaps it has something to do with
> >>>>> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
> >>>>> mounting an unclean fs.
> >>>>
> >>>> The freeze is calls xfs_log_sbcount() which should update the
> >>>> superblock counters from the in-memory counters and write them to
> >>>> disk.
> >>>>
> >>>> If they are out, I'm guessing it's because the in-memory per-ag
> >>>> reservations are not being returned to the global pool before the
> >>>> in-memory counters are summed during a freeze....
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Dave.
> >>>> -- 
> >>>> Dave Chinner
> >>>> david@fromorbit.com
> >>>
> >>> I spend some time on tracking this problem. I've made a quick
> >>> modification with per-AG reservation and tested with generic/311
> >>> it seems fine. My current question is that how such fsfreezed
> >>> images (with clean mount) work with old kernels without [PATCH 1/1]?
> >>> I'm afraid orphan inodes won't be freed with such old kernels....
> >>> Am I missing something?
> >>
> >> It's true, a snapshot created with these patches will not have their unlinked
> >> inodes processed if mounted on an older kernel. I'm not sure how much of a
> >> problem that is; the filesystem is not inconsistent, but some space is lost,
> >> I guess. I'm not sure it's common to take a snapshot of a frozen filesystem on
> >> one kernel and then move it back to an older kernel.  Maybe others have
> >> thoughts on this.
> > 
> > My current thought might be only to write clean mount without
> > unlinked inodes when freezing, but leave log dirty if any
> > unlinked inodes exist as Brian mentioned before and don't
> > handle such case (?). I'd like to hear more comments about
> > this as well.
> 
> I don't know if I had made this comment before ;) but I feel like that's even
> more "surprise" (as in: gets further from the principle of least surprise)
> and TBH I would rather not have that somewhat unpredictable behavior.
> 

Yeah, I saw that comment as well....

> I think I'd rather /always/ make a dirty log than sometimes do it, other
> times not. It'd just be more confusion for the admin IMHO.

Ok, some other alternative approaches I could think out in my mind
aren't trivial (e.g. some hack on log recovery, etc).. Any ideas /
thoughts about this are welcomed :) Thanks!

Thanks,
Gao Xiang

> 
> Thanks,
> -Eric
> 
> > Thanks,
> > Gao Xiang
> > 
> >>
> >> -Eric
> >>
> > 
>
Darrick J. Wong Feb. 23, 2021, 4:25 p.m. UTC | #9
On Tue, Feb 23, 2021 at 09:46:38AM -0600, Eric Sandeen wrote:
> 
> 
> On 2/23/21 9:03 AM, Gao Xiang wrote:
> > On Tue, Feb 23, 2021 at 08:40:56AM -0600, Eric Sandeen wrote:
> >> On 2/23/21 7:42 AM, Gao Xiang wrote:
> >>> Hi folks,
> >>>
> >>> On Wed, Mar 28, 2018 at 08:17:28AM +1100, Dave Chinner wrote:
> >>>> On Mon, Mar 26, 2018 at 08:46:49AM -0400, Brian Foster wrote:
> >>>>> On Sat, Mar 24, 2018 at 09:20:49AM -0700, Darrick J. Wong wrote:
> >>>>>> On Wed, Mar 07, 2018 at 05:33:48PM -0600, Eric Sandeen wrote:
> >>>>>>> Now that unlinked inode recovery is done outside of
> >>>>>>> log recovery, there is no need to dirty the log on
> >>>>>>> snapshots just to handle unlinked inodes.  This means
> >>>>>>> that readonly snapshots can be mounted without requiring
> >>>>>>> -o ro,norecovery to avoid the log replay that can't happen
> >>>>>>> on a readonly block device.
> >>>>>>>
> >>>>>>> (unlinked inodes will just hang out in the agi buckets until
> >>>>>>> the next writable mount)
> >>>>>>
> >>>>>> FWIW I put these two in a test kernel to see what would happen and
> >>>>>> generic/311 failures popped up.  It looked like the _check_scratch_fs
> >>>>>> found incorrect block counts on the snapshot(?)
> >>>>>>
> >>>>>
> >>>>> Interesting. Just a wild guess, but perhaps it has something to do with
> >>>>> lazy sb accounting..? I see we call xfs_initialize_perag_data() when
> >>>>> mounting an unclean fs.
> >>>>
> >>>> The freeze is calls xfs_log_sbcount() which should update the
> >>>> superblock counters from the in-memory counters and write them to
> >>>> disk.
> >>>>
> >>>> If they are out, I'm guessing it's because the in-memory per-ag
> >>>> reservations are not being returned to the global pool before the
> >>>> in-memory counters are summed during a freeze....
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Dave.
> >>>> -- 
> >>>> Dave Chinner
> >>>> david@fromorbit.com
> >>>
> >>> I spend some time on tracking this problem. I've made a quick
> >>> modification with per-AG reservation and tested with generic/311
> >>> it seems fine. My current question is that how such fsfreezed
> >>> images (with clean mount) work with old kernels without [PATCH 1/1]?
> >>> I'm afraid orphan inodes won't be freed with such old kernels....
> >>> Am I missing something?
> >>
> >> It's true, a snapshot created with these patches will not have their unlinked
> >> inodes processed if mounted on an older kernel. I'm not sure how much of a
> >> problem that is; the filesystem is not inconsistent, but some space is lost,
> >> I guess. I'm not sure it's common to take a snapshot of a frozen filesystem on
> >> one kernel and then move it back to an older kernel.  Maybe others have
> >> thoughts on this.

Yes, I know of cloudy image generation factories that use old versions
of RHEL to generate images that are then frozen and copied to a
deployment system without an unmount.  I don't understand why they
insist that unmount is "too slow" but freeze isn't, nor why they then
file bugs that their instance deploy process is unacceptably slow
because of log recovery.

> > My current thought might be only to write clean mount without
> > unlinked inodes when freezing, but leave log dirty if any
> > unlinked inodes exist as Brian mentioned before and don't
> > handle such case (?). I'd like to hear more comments about
> > this as well.
> 
> I don't know if I had made this comment before ;) but I feel like that's even
> more "surprise" (as in: gets further from the principle of least surprise)
> and TBH I would rather not have that somewhat unpredictable behavior.
> 
> I think I'd rather /always/ make a dirty log than sometimes do it, other
> times not. It'd just be more confusion for the admin IMHO.

...but the next time anyone wants to introduce a new in/rocompat feature
flag for something inode related, then you can disable the "leave a
dirty log on freeze if there are unlinked inodes" behavior.

--D

> 
> Thanks,
> -Eric
> 
> > Thanks,
> > Gao Xiang
> > 
> >>
> >> -Eric
> >>
> >
diff mbox

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93588ea..5669525 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1419,9 +1419,10 @@  struct proc_xfs_info {
 
 /*
  * Second stage of a freeze. The data is already frozen so we only
- * need to take care of the metadata. Once that's done sync the superblock
- * to the log to dirty it in case of a crash while frozen. This ensures that we
- * will recover the unlinked inode lists on the next mount.
+ * need to take care of the metadata.
+ * Any unlinked inode lists will remain at this point, and be recovered
+ * on the next writable mount if we crash while frozen, or create
+ * a snapshot from the frozen filesystem.
  */
 STATIC int
 xfs_fs_freeze(
@@ -1431,7 +1432,7 @@  struct proc_xfs_info {
 
 	xfs_save_resvblks(mp);
 	xfs_quiesce_attr(mp);
-	return xfs_sync_sb(mp, true);
+	return 0;
 }
 
 STATIC int