diff mbox

[14/14] xfs: add growfs support for changing usable blocks

Message ID 20171026083322.20428-15-david@fromorbit.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Chinner Oct. 26, 2017, 8:33 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Now that we have persistent usable block counts, we need to be able
to change them. This allows us to control thin provisioned
filesystem space usage at the filesystem level, not the block device
level.

If the grow operation grows the usable space beyond the current
LBA size of the filesystem, then we also need to physically grow the
filesystem to match the new size of the underlying device. Hence
grow behaves like it always has, expect for the fact that it wont'
grow physically until usable space would exceed the LBA size.

Being able to modify usable space also allows us to shrink the
filesystem on thin devices as easily as growing it. We simply reduce
the usable space and the free space, and we're done. The user then
needs to run a fstrim pass to ensure all the unused space in the
filesystem LBA is marked as unused by the underlying device. No data
or metadata movement is required as the underlying LBA space has not
changed.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |   1 +
 fs/xfs/xfs_fsops.c         | 106 +++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_trans.c         |  21 +++++++++
 fs/xfs/xfs_trans.h         |   1 +
 4 files changed, 111 insertions(+), 18 deletions(-)

Comments

Amir Goldstein Oct. 26, 2017, 11:30 a.m. UTC | #1
On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that we have persistent usable block counts, we need to be able
> to change them. This allows us to control thin provisioned
> filesystem space usage at the filesystem level, not the block device
> level.
>
> If the grow operation grows the usable space beyond the current
> LBA size of the filesystem, then we also need to physically grow the
> filesystem to match the new size of the underlying device. Hence
> grow behaves like it always has, expect for the fact that it wont'
> grow physically until usable space would exceed the LBA size.
>
> Being able to modify usable space also allows us to shrink the
> filesystem on thin devices as easily as growing it. We simply reduce
> the usable space and the free space, and we're done. The user then
> needs to run a fstrim pass to ensure all the unused space in the
> filesystem LBA is marked as unused by the underlying device. No data
> or metadata movement is required as the underlying LBA space has not
> changed.
>
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

With this change, behavior of userspace program that tried to shrink filesystem
size will change from -EINVAL to success for filesystems that were configured
to allow that. But unmodified userspace program may still be caught by surprise
from this success return code that was never excersized in the past.

I have also argued elsewhere that the fact that the request to shrink the
"virtual" size vs. physical size is implicit and not explicit, that would hinder
future attempts to use the API as it was intended to implement physical shrink.

Suggestion:
Let userspace opt-in for the new "virtual grow" API by using the 3 upper
bytes in (struct xfs_growfs_data){.imaxpct}.
Those byes are guarantied to be zeroed by old application due to value
range check in current code, so there is plenty of room to add flags byte
and use it to request to grow USABLE_DBLOCK explicitly.

All the logic in your code stays the same (i.e. grow physical to accomodate
for growing virtual) only we stir away from being called by old apps by
mistake.

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 26, 2017, 12:48 p.m. UTC | #2
On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Now that we have persistent usable block counts, we need to be able
> > to change them. This allows us to control thin provisioned
> > filesystem space usage at the filesystem level, not the block device
> > level.
> >
> > If the grow operation grows the usable space beyond the current
> > LBA size of the filesystem, then we also need to physically grow the
> > filesystem to match the new size of the underlying device. Hence
> > grow behaves like it always has, expect for the fact that it wont'
> > grow physically until usable space would exceed the LBA size.
> >
> > Being able to modify usable space also allows us to shrink the
> > filesystem on thin devices as easily as growing it. We simply reduce
> > the usable space and the free space, and we're done. The user then
> > needs to run a fstrim pass to ensure all the unused space in the
> > filesystem LBA is marked as unused by the underlying device. No data
> > or metadata movement is required as the underlying LBA space has not
> > changed.
> >
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> 
> With this change, behavior of userspace program that tried to shrink filesystem
> size will change from -EINVAL to success for filesystems that were configured
> to allow that. But unmodified userspace program may still be caught by surprise
> from this success return code that was never excersized in the past.

What userspace program would be trying to shrink XFS filesystems
that doesn't already handle grow operations from the same ioctl call
returning success? Hell, I'd like to know what app is even trying to
shrink XFS filesystems...

> I have also argued elsewhere that the fact that the request to shrink the
> "virtual" size vs. physical size is implicit and not explicit, that would hinder
> future attempts to use the API as it was intended to implement physical shrink.

No, feature bits decide the action to take without any ambiguity.

> Suggestion:
> Let userspace opt-in for the new "virtual grow" API by using the 3 upper
> bytes in (struct xfs_growfs_data){.imaxpct}.
> Those byes are guarantied to be zeroed by old application due to value
> range check in current code, so there is plenty of room to add flags byte
> and use it to request to grow USABLE_DBLOCK explicitly.

What's the point of adding this complexity? AFAICT it's a solution
for a problem that doesn't exist....

> All the logic in your code stays the same (i.e. grow physical to accomodate
> for growing virtual) only we stir away from being called by old apps by
> mistake.

My care factor about old 3rd party apps that have never been able to
test that shrink code path actually succeeded because the kernel
didn't support it is pretty damn close to zero.

Actually, wait ..... Ahhhhh. I have just reached the state of Care
Factor Zero. :)

Cheers,

Dave.
Amir Goldstein Oct. 26, 2017, 1:32 p.m. UTC | #3
On Thu, Oct 26, 2017 at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
>> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > From: Dave Chinner <dchinner@redhat.com>
>> >
>> > Now that we have persistent usable block counts, we need to be able
>> > to change them. This allows us to control thin provisioned
>> > filesystem space usage at the filesystem level, not the block device
>> > level.
>> >
>> > If the grow operation grows the usable space beyond the current
>> > LBA size of the filesystem, then we also need to physically grow the
>> > filesystem to match the new size of the underlying device. Hence
>> > grow behaves like it always has, expect for the fact that it wont'
>> > grow physically until usable space would exceed the LBA size.
>> >
>> > Being able to modify usable space also allows us to shrink the
>> > filesystem on thin devices as easily as growing it. We simply reduce
>> > the usable space and the free space, and we're done. The user then
>> > needs to run a fstrim pass to ensure all the unused space in the
>> > filesystem LBA is marked as unused by the underlying device. No data
>> > or metadata movement is required as the underlying LBA space has not
>> > changed.
>> >
>> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
>>
>> With this change, behavior of userspace program that tried to shrink filesystem
>> size will change from -EINVAL to success for filesystems that were configured
>> to allow that. But unmodified userspace program may still be caught by surprise
>> from this success return code that was never excersized in the past.
>
> What userspace program would be trying to shrink XFS filesystems
> that doesn't already handle grow operations from the same ioctl call
> returning success? Hell, I'd like to know what app is even trying to
> shrink XFS filesystems...
>

A buggy program of course ;-)

>> I have also argued elsewhere that the fact that the request to shrink the
>> "virtual" size vs. physical size is implicit and not explicit, that would hinder
>> future attempts to use the API as it was intended to implement physical shrink.
>
> No, feature bits decide the action to take without any ambiguity.
>

The ambiguity I was referring to was of the user program's intent.
Did it request to the set filesystem size to N or filesystem usable size to N.
When growing, the difference in intent doesn't change the result.
When shrinking AND feature bit is set, the intent makes a difference.

>> Suggestion:
>> Let userspace opt-in for the new "virtual grow" API by using the 3 upper
>> bytes in (struct xfs_growfs_data){.imaxpct}.
>> Those byes are guarantied to be zeroed by old application due to value
>> range check in current code, so there is plenty of room to add flags byte
>> and use it to request to grow USABLE_DBLOCK explicitly.
>
> What's the point of adding this complexity? AFAICT it's a solution
> for a problem that doesn't exist....
>

AFAICT you are right, but API review is a bit like legal contract review -
picking to problem that don't seem to exist - until one day we realize
that they do...

>> All the logic in your code stays the same (i.e. grow physical to accomodate
>> for growing virtual) only we stir away from being called by old apps by
>> mistake.
>
> My care factor about old 3rd party apps that have never been able to
> test that shrink code path actually succeeded because the kernel
> didn't support it is pretty damn close to zero.
>
> Actually, wait ..... Ahhhhh. I have just reached the state of Care
> Factor Zero. :)
>

Look to your right. I am right there with you :)
Anyway, I think that the cost of being wrong on this one is not so high
(famous last words)

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Oct. 27, 2017, 10:26 a.m. UTC | #4
On Thu, Oct 26, 2017 at 4:32 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Oct 26, 2017 at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
>>> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> > From: Dave Chinner <dchinner@redhat.com>
>>> >
>>> > Now that we have persistent usable block counts, we need to be able
>>> > to change them. This allows us to control thin provisioned
>>> > filesystem space usage at the filesystem level, not the block device
>>> > level.
>>> >
>>> > If the grow operation grows the usable space beyond the current
>>> > LBA size of the filesystem, then we also need to physically grow the
>>> > filesystem to match the new size of the underlying device. Hence
>>> > grow behaves like it always has, expect for the fact that it wont'
>>> > grow physically until usable space would exceed the LBA size.
>>> >
>>> > Being able to modify usable space also allows us to shrink the
>>> > filesystem on thin devices as easily as growing it. We simply reduce
>>> > the usable space and the free space, and we're done. The user then
>>> > needs to run a fstrim pass to ensure all the unused space in the
>>> > filesystem LBA is marked as unused by the underlying device. No data
>>> > or metadata movement is required as the underlying LBA space has not
>>> > changed.
>>> >
>>> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
>>>
>>> With this change, behavior of userspace program that tried to shrink filesystem
>>> size will change from -EINVAL to success for filesystems that were configured
>>> to allow that. But unmodified userspace program may still be caught by surprise
>>> from this success return code that was never excersized in the past.
>>
>> What userspace program would be trying to shrink XFS filesystems
>> that doesn't already handle grow operations from the same ioctl call
>> returning success? Hell, I'd like to know what app is even trying to
>> shrink XFS filesystems...
>>
>
> A buggy program of course ;-)
>
>>> I have also argued elsewhere that the fact that the request to shrink the
>>> "virtual" size vs. physical size is implicit and not explicit, that would hinder
>>> future attempts to use the API as it was intended to implement physical shrink.
>>
>> No, feature bits decide the action to take without any ambiguity.
>>
>
> The ambiguity I was referring to was of the user program's intent.
> Did it request to the set filesystem size to N or filesystem usable size to N.
> When growing, the difference in intent doesn't change the result.

I take that back. I took a look at xfs_growfs.
If user requests to change only imaxpct xfs_growfs will set newblocks
to geo.datablocks, thus the user intent of "not changing data blocks"
when running an old version of xfs_growfs, is handled by current
patches as "blue moon is shining - give me the promised disk space".
Unless I am misunderstanding and geo.datablocks means usable
data block in FSGEOMETRY V4?

Don't see how you can get away without adding "my intent is to
modify usable blocks" to the API.
I hope you can prove me wrong, because the way you extended the
growfs API is very elegant.

Also, forgive me for not digging too deep myself to find the answer,
but is existing imaxpct being recalculated based on new usable_dblocks?
or does imaxpct still relative to total dblocks?

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 536fb353de03..41a34fb96047 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -106,6 +106,7 @@  int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_TRANS_SB_RBLOCKS		0x00000800
 #define	XFS_TRANS_SB_REXTENTS		0x00001000
 #define	XFS_TRANS_SB_REXTSLOG		0x00002000
+#define	XFS_TRANS_SB_USABLE_DBLOCKS	0x00004000
 
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index e0565eb01c0b..d0a6e723e924 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -504,7 +504,7 @@  xfs_grow_ag_headers(
 }
 
 static int
-xfs_growfs_data_private(
+xfs_growfs_data_physical(
 	xfs_mount_t		*mp,		/* mount point for filesystem */
 	xfs_growfs_data_t	*in)		/* growfs data input struct */
 {
@@ -520,11 +520,11 @@  xfs_growfs_data_private(
 	xfs_trans_t		*tp;
 	LIST_HEAD		(buffer_list);
 	struct aghdr_init_data	id = {};
+	struct xfs_owner_info	oinfo;
 
 	nb = in->newblocks;
-	if (nb < mp->m_LBA_size)
-		return -EINVAL;
-	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
+	error = xfs_sb_validate_fsb_count(&mp->m_sb, nb);
+	if (error)
 		return error;
 	error = xfs_buf_read_uncached(mp->m_ddev_targp,
 				XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
@@ -539,7 +539,7 @@  xfs_growfs_data_private(
 	if (nb_mod && nb_mod < XFS_MIN_AG_BLOCKS) {
 		nagcount--;
 		nb = (xfs_rfsblock_t)nagcount * mp->m_sb.sb_agblocks;
-		if (nb < mp->m_LBA_size)
+		if (nb <= mp->m_LBA_size)
 			return -EINVAL;
 	}
 	new = nb - mp->m_LBA_size;
@@ -596,7 +596,6 @@  xfs_growfs_data_private(
 	 * There are new blocks in the old last a.g.
 	 */
 	if (new) {
-		struct xfs_owner_info	oinfo;
 
 		/*
 		 * Change the agi length.
@@ -649,9 +648,12 @@  xfs_growfs_data_private(
 	 */
 	if (nagcount > oagcount)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
-	if (nb > mp->m_LBA_size)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS,
-				 nb - mp->m_LBA_size);
+	if (nb > mp->m_LBA_size) {
+		int64_t delta = nb - mp->m_sb.sb_dblocks;
+
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_USABLE_DBLOCKS, delta);
+	}
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 	xfs_trans_set_sync(tp);
@@ -660,13 +662,12 @@  xfs_growfs_data_private(
 		return error;
 
 	/* New allocation groups fully initialized, so update mount struct */
-	mp->m_usable_blocks = mp->m_sb.sb_dblocks;
-	mp->m_LBA_size = mp->m_sb.sb_dblocks;
-
 	if (nagimax)
 		mp->m_maxagi = nagimax;
-	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+
+	/* This is a physical grow so the usable size matches the device size */
+	mp->m_LBA_size = mp->m_sb.sb_dblocks;
+	mp->m_usable_blocks = mp->m_LBA_size;
 
 	/*
 	 * If we expanded the last AG, free the per-AG reservation
@@ -801,6 +802,61 @@  xfs_growfs_update_superblocks(
 	return saved_error ? saved_error : error;
 }
 
+/*
+ * For thin filesystems, first we adjust the logical size of the filesystem
+ * to match the desired change. If the filesystem is physically not large
+ * enough, then we grow to the maximum logical size and leave the rest to
+ * the physical grow step. We also leave the the secondary superblock update
+ * to the physical grow step.
+ */
+static int
+xfs_growfs_data_thinspace(
+	struct xfs_mount	*mp,
+	struct xfs_growfs_data	*in)
+{
+	struct xfs_sb		*sbp = &mp->m_sb;
+	struct xfs_trans	*tp;
+	int64_t			delta;
+	int			error;
+
+	if (!xfs_sb_version_hasthinspace(sbp))
+		return 0;
+
+	delta = in->newblocks - sbp->sb_usable_dblocks;
+	if (!delta)
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
+			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	/* grow to maximum logical size */
+	if (delta > 0) {
+		delta = min_t(int64_t, delta,
+			     sbp->sb_dblocks - sbp->sb_usable_dblocks);
+	}
+
+	/*
+	 * Modify incore free block counter. Shrink will return ENOSPC here if
+	 * there isn't free space available to shrink the amount requested.
+	 * We need this ENOSPC check here, which is why we can't use
+	 * xfs_trans_mod_sb() for this set of superblock modifications.
+	 */
+	error = xfs_mod_fdblocks(mp, delta, false);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	/* Update the new size and log the superblock. */
+	sbp->sb_usable_dblocks += delta;
+	mp->m_usable_blocks += delta;
+	xfs_log_sb(tp);
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
 /*
  * protected versions of growfs function acquire and release locks on the mount
  * point - exported through ioctls: XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG,
@@ -822,12 +878,24 @@  xfs_growfs_data(
 	if (in->imaxpct != mp->m_sb.sb_imax_pct) {
 		error = xfs_growfs_imaxpct(mp, in->imaxpct);
 		if (error)
-			goto out_error;
+			goto out_unlock;
 	}
 
-	error = xfs_growfs_data_private(mp, in);
+	error = xfs_growfs_data_thinspace(mp, in);
 	if (error)
-		goto out_error;
+		goto out_unlock;
+
+	/*
+	 * For thinspace filesystems, we can shrink the logical size and hence
+	 * newblocks can be less than the sb_dblocks. Shrinks will be done
+	 * entirely in thinspace, so only do a physical grow if it is needed.
+	 */
+	if (!xfs_sb_version_hasthinspace(&mp->m_sb) ||
+	    in->newblocks > mp->m_LBA_size) {
+		error = xfs_growfs_data_physical(mp, in);
+		if (error)
+			goto out_unlock;
+	}
 
 	/*
 	 * Post growfs calculations needed to reflect new state in operations
@@ -838,13 +906,15 @@  xfs_growfs_data(
 		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
 	} else
 		mp->m_maxicount = 0;
+	xfs_set_low_space_thresholds(mp);
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	/*
 	 * Update secondary superblocks now the physical grow has completed
 	 */
 	error = xfs_growfs_update_superblocks(mp);
 
-out_error:
+out_unlock:
 	/*
 	 * Increment the generation unconditionally, the error could be from
 	 * updating the secondary superblocks, in which case the new size
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index a87f657f59c9..eb1658deacd6 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -391,6 +391,9 @@  xfs_trans_mod_sb(
 	case XFS_TRANS_SB_REXTSLOG:
 		tp->t_rextslog_delta += delta;
 		break;
+	case XFS_TRANS_SB_USABLE_DBLOCKS:
+		tp->t_usable_dblock_delta += delta;
+		break;
 	default:
 		ASSERT(0);
 		return;
@@ -477,6 +480,15 @@  xfs_trans_apply_sb_deltas(
 		whole = 1;
 	}
 
+	/* Only modify the thinspace sb fields if enabled */
+	if (xfs_sb_version_hasthinspace(&tp->t_mountp->m_sb) &&
+	    tp->t_usable_dblock_delta) {
+		be64_add_cpu(&sbp->sb_usable_dblocks,
+			     tp->t_usable_dblock_delta);
+		whole = 1;
+	}
+
+
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
 	if (whole)
 		/*
@@ -659,9 +671,18 @@  xfs_trans_unreserve_and_mod_sb(
 		if (error)
 			goto out_undo_rextents;
 	}
+	if (tp->t_usable_dblock_delta != 0) {
+		error = xfs_sb_mod64(&mp->m_sb.sb_usable_dblocks,
+				     tp->t_usable_dblock_delta);
+		if (error)
+			goto out_undo_rextslog;
+	}
 	spin_unlock(&mp->m_sb_lock);
 	return;
 
+out_undo_rextslog:
+	if (tp->t_rextslog_delta)
+		xfs_sb_mod8(&mp->m_sb.sb_rextslog, -tp->t_rextslog_delta);
 out_undo_rextents:
 	if (tp->t_rextents_delta)
 		xfs_sb_mod64(&mp->m_sb.sb_rextents, -tp->t_rextents_delta);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 815b53d20e26..f8c816956ba2 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -131,6 +131,7 @@  typedef struct xfs_trans {
 	int64_t			t_rblocks_delta;/* superblock rblocks change */
 	int64_t			t_rextents_delta;/* superblocks rextents chg */
 	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
+	int64_t			t_usable_dblock_delta;/* usable space */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	unsigned long		t_pflags;	/* saved process flags state */