[2/9] spaceman/defrag: pick up segments from target file

Message ID	20240709191028.2329-3-wen.gang.wang@oracle.com (mailing list archive)
State	Accepted, archived
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 493451DFCF for <linux-xfs@vger.kernel.org>; Tue, 9 Jul 2024 19:10:34 +0000 (UTC) From: Wengang Wang <wen.gang.wang@oracle.com> To: linux-xfs@vger.kernel.org Cc: wen.gang.wang@oracle.com Subject: [PATCH 2/9] spaceman/defrag: pick up segments from target file Date: Tue, 9 Jul 2024 12:10:21 -0700 Message-Id: <20240709191028.2329-3-wen.gang.wang@oracle.com> In-Reply-To: <20240709191028.2329-1-wen.gang.wang@oracle.com> References: <20240709191028.2329-1-wen.gang.wang@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	introduce defrag to xfs_spaceman \| expand [0/9] introduce defrag to xfs_spaceman [1/9] xfsprogs: introduce defrag command to spaceman [2/9] spaceman/defrag: pick up segments from target file [3/9] spaceman/defrag: defrag segments [4/9] spaceman/defrag: ctrl-c handler [5/9] spaceman/defrag: exclude shared segments on low free space [6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() [7/9] spaceman/defrag: sleeps between segments [8/9] spaceman/defrag: readahead for better performance [9/9] spaceman/defrag: warn on extsize

Wengang Wang July 9, 2024, 7:10 p.m. UTC

segments are the smallest unit to defragment.

A segment
1. Can't exceed size limit
2. contains some extents
3. the contained extents can't be "unwritten"
4. the contained extents must be contigous in file blocks

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)

Darrick J. Wong July 9, 2024, 9:50 p.m. UTC | #1

On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
> segments are the smallest unit to defragment.
> 
> A segment
> 1. Can't exceed size limit

What size limit?  Do you mean a segment can't extend beyond EOF?  Or did
you actually mean RLIMIT_FSIZE?

> 2. contains some extents
> 3. the contained extents can't be "unwritten"
> 4. the contained extents must be contigous in file blocks

As in the segment cannot contain sparse holes?

I think what I"m reading here is that a segment cannot extend beyond EOF
and must be completely filled with written extent mappings?

Is there an upper limit on the number of mappings per segment?

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 204 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index c9732984..175cf461 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -14,6 +14,32 @@
>  #include "space.h"
>  #include "input.h"
>  
> +#define MAPSIZE 512
> +/* used to fetch bmap */
> +struct getbmapx	g_mapx[MAPSIZE];

Each of these global arrays increases the bss segment size, which
increases the overall footprint of xfs_spaceman, even when it's not
being used to defragment files.

Could you switch this data to be dynamically allocated at the start of
defrag_f and freed at the end?

> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
> +static long long 	g_offset = 0;

Unnecessary space after the 'long long'.

> +/* index to indentify next extent, used to get next extent */
> +static int		g_ext_next_idx = -1;
> +
> +/*
> + * segment, the smallest unit to defrag
> + * it includes some contiguous extents.
> + * no holes included,
> + * no unwritten extents included
> + * the size is limited by g_segment_size_lmt
> + */
> +struct defrag_segment {
> +	/* segment offset in units of 512 bytes */
> +	long long	ds_offset;
> +	/* length of segment in units of 512 bytes */
> +	long long	ds_length;
> +	/* number of extents in this segment */
> +	int		ds_nr;
> +	/* flag indicating if segment contains shared blocks */
> +	bool		ds_shared;

Maybe g_mapx belongs in here?  Wait, does a bunch of contiguous written
bmapx records comprise a segment, or is a segment created from (possibly
a subection of) a particular written bmapx record?

> +};
> +
>  /* defrag segment size limit in units of 512 bytes */
>  #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>  #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>  	return true;
>  }
>  
> +/*
> + * get next extent in the file.
> + * Note: next call will get the same extent unless move_next_extent() is called.
> + * returns:
> + * -1:	error happened.
> + * 0:	extent returned
> + * 1:	no more extent left
> + */
> +static int
> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
> +{
> +	int err = 0, i;
> +
> +	/* when no extents are cached in g_mapx, fetch from kernel */
> +	if (g_ext_next_idx == -1) {
> +		g_mapx[0].bmv_offset = g_offset;
> +		g_mapx[0].bmv_length = -1LL;
> +		g_mapx[0].bmv_count = MAPSIZE;
> +		g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
> +		err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
> +		if (err == -1) {
> +			perror("XFS_IOC_GETBMAPX failed");
> +			goto out;
> +		}
> +		/* for stats */
> +		g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
> +
> +		/* no more extents */
> +		if (g_mapx[0].bmv_entries == 0) {
> +			err = 1;
> +			goto out;
> +		}
> +
> +		/* for stats */
> +		for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
> +			if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
> +				g_ext_stats.nr_ext_unwritten++;
> +			if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
> +				g_ext_stats.nr_ext_shared++;
> +		}
> +
> +		g_ext_next_idx = 1;
> +		g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
> +				g_mapx[g_mapx[0].bmv_entries].bmv_length;
> +	}

Huh.  AFAICT, g_ext_next_idx/g_mapx effectively act as a cursor over the
mappings for this file segment.  In that case, shouldn't
defrag_move_next_extent (which actually advances the cursor) be in
charge of grabbing bmapx records from the kernel, and
defrag_get_next_extent be the trivial helper to pass mappings to the
consumer of the bmapx objects (aka defrag_get_next_segment)?

> +
> +	map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
> +	map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
> +	map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
> +out:
> +	return err;
> +}
> +
> +/*
> + * move to next extent
> + */
> +static void
> +defrag_move_next_extent()
> +{
> +	if (g_ext_next_idx == g_mapx[0].bmv_entries)
> +		g_ext_next_idx = -1;
> +	else
> +		g_ext_next_idx += 1;
> +}
> +
> +/*
> + * check if the given extent is a defrag target.
> + * no need to check for holes as we are using BMV_IF_NO_HOLES
> + */
> +static bool
> +defrag_is_target(struct getbmapx *mapx)
> +{
> +	/* unwritten */
> +	if (mapx->bmv_oflags & BMV_OF_PREALLOC)
> +		return false;
> +	return mapx->bmv_length < g_segment_size_lmt;
> +}
> +
> +static bool
> +defrag_is_extent_shared(struct getbmapx *mapx)
> +{
> +	return !!(mapx->bmv_oflags & BMV_OF_SHARED);
> +}
> +
> +/*
> + * get next segment to defragment.
> + * returns:
> + * -1	error happened.
> + * 0	segment returned.
> + * 1	no more segments to return
> + */
> +static int
> +defrag_get_next_segment(int fd, struct defrag_segment *out)
> +{
> +	struct getbmapx mapx;
> +	int	ret;
> +
> +	out->ds_offset = 0;
> +	out->ds_length = 0;
> +	out->ds_nr = 0;
> +	out->ds_shared = false;
> +
> +	do {
> +		ret = defrag_get_next_extent(fd, &mapx);
> +		if (ret != 0) {
> +			/*
> +			 * no more extetns, return current segment if its not
> +			 * empty
> +			*/
> +			if (ret == 1 && out->ds_nr > 0)
> +				ret = 0;
> +			/* otherwise, error heppened, stop */
> +			break;
> +		}
> +
> +		/*
> +		 * If the extent is not a defrag target, skip it.
> +		 * go to next extent if the segment is empty;
> +		 * otherwise return the segment.
> +		 */
> +		if (!defrag_is_target(&mapx)) {
> +			defrag_move_next_extent();
> +			if (out->ds_nr == 0)
> +				continue;
> +			else
> +				break;
> +		}
> +
> +		/* check for segment size limitation */
> +		if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
> +			break;
> +
> +		/* the segment is empty now, add this extent to it for sure */
> +		if (out->ds_nr == 0) {
> +			out->ds_offset = mapx.bmv_offset;
> +			goto add_ext;
> +		}
> +
> +		/*
> +		 * the segment is not empty, check for hole since the last exent
> +		 * if a hole exist before this extent, this extent can't be
> +		 * added to the segment. return the segment
> +		 */
> +		if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> +			break;
> +
> +add_ext:
> +		if (defrag_is_extent_shared(&mapx))
> +			out->ds_shared = true;
> +
> +		out->ds_length += mapx.bmv_length;
> +		out->ds_nr += 1;

OH, ok.  So we walk the mappings for a file.  If we can identify a run
of contiguous written mappings, we define a segment to be the file range
described by that run, up to whatever the maximum is (~4-16M).  Each of
these segments is defragmented (somehow).  Is that correct?

> +		defrag_move_next_extent();
> +
> +	} while (true);
> +
> +	return ret;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>  	struct fsxattr	fsx;
>  	int	ret = 0;
>  
> +	g_offset = 0;
> +	g_ext_next_idx = -1;
> +
>  	fsx.fsx_nextents = 0;
>  	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>  
> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>  		ret = 1;
>  		goto out;
>  	}
> +
> +	do {
> +		struct defrag_segment segment;
> +
> +		ret = defrag_get_next_segment(defrag_fd, &segment);
> +		/* no more segments, we are done */
> +		if (ret == 1) {
> +			ret = 0;
> +			break;
> +		}

If you reverse the polarity of the 0/1 return values (aka return 1 if
there is a segment and 0 if there is none) then you can shorten this
loop to:

	struct defrag_segment	segment;
	int			ret;

	while ((ret = defrag_get_next_segment(...)) == 1) {
		/* process segment */
	}

	return ret;

--D

> +		/* error happened when reading bmap, stop here */
> +		if (ret == -1) {
> +			ret = 1;
> +			break;
> +		}
> +	} while (true);
>  out:
>  	if (scratch_fd != -1) {
>  		close(scratch_fd);
> -- 
> 2.39.3 (Apple Git-146)
> 
>

Wengang Wang July 11, 2024, 10:37 p.m. UTC | #2

> On Jul 9, 2024, at 2:50 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
> 
> What size limit?  Do you mean a segment can't extend beyond EOF?  Or did
> you actually mean RLIMIT_FSIZE?

We defrag things by share (non-contiguous) blocks and then UNSHARE them to get contiguous
Blocks.  The size limit is the limit of the range for which we issue a UNSHARE ioctl call.
We put some contiguous extents (mappings) together (as a segment) for a UNSHARE call.
We don’t hope that’s too big taking long time IO lock on file. Default size limit is 16MiB.

> 
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
> 
> As in the segment cannot contain sparse holes?

Holes are meaningless for unshare calls, so they are not included in segments.

> 
> I think what I"m reading here is that a segment cannot extend beyond EOF
> and must be completely filled with written extent mappings?

Yes, a segment as the unit/range for UNSHARE, 
1) extents included in the segment must be contiguous by file offset
2) is not a hole
3) is not unwritten extents. Unwritten extents are meaningless for UNSHARE.

> 
> Is there an upper limit on the number of mappings per segment?
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
> 
> Each of these global arrays increases the bss segment size, which
> increases the overall footprint of xfs_spaceman, even when it's not
> being used to defragment files.
> 
> Could you switch this data to be dynamically allocated at the start of
> defrag_f and freed at the end?
> 

Yes, I can.
Well I am wondering the binary size is concern this much?
After adding defrag, xfs_space has a size of 239K, I don’t think it’s too big.

$ ll -h spaceman/xfs_spaceman
-rwxrwxr-x 1 ubuntu ubuntu 239K Jul 10 17:20 spaceman/xfs_spaceman*



>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
> 
> Unnecessary space after the 'long long'.
> 

ok.

>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
>> +
>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
> 
> Maybe g_mapx belongs in here?  Wait, does a bunch of contiguous written
> bmapx records comprise a segment, or is a segment created from (possibly
> a subection of) a particular written bmapx record?

g_mapx is used to fetch the mappings from beginning of the file to the end.
It,
1) can include at most 511 (MAPSIZE-1) extents/mappings
2) can fill more than one segment
3) can be a part of a segment

We are walking through the extents/mappings to form segments.
g_mapx is used in walking through the extents/mappings.
For forming a segment, pls see other functions. 

> 
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
> 
> Huh.  AFAICT, g_ext_next_idx/g_mapx effectively act as a cursor over the
> mappings for this file segment.  

Yes, right.

> In that case, shouldn't
> defrag_move_next_extent (which actually advances the cursor) be in
> charge of grabbing bmapx records from the kernel, and
> defrag_get_next_extent be the trivial helper to pass mappings to the
> consumer of the bmapx objects (aka defrag_get_next_segment)?
> 

Defrag_move_next_extent moves the cursor and defrag_get_next_extent
Pick the extent on cursor but doesn’t move the cursor.
There is case that if we add current extent (returned by defrag_get_next_extent)
 to current segment, current segment would be too big. So we don’t add current extent
  But will add it to next segment. For next segment,  defrag_get_next_extent is called
To get the extent in question again.
So as summary on the cases we move cursor:
1). The extent on cursor is not target to defrag (unwritten or too big)
2). The extent is added in current segment.

>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
> 
> OH, ok.  So we walk the mappings for a file.  If we can identify a run
> of contiguous written mappings, we define a segment to be the file range
> described by that run, up to whatever the maximum is (~4-16M).  Each of
> these segments is defragmented (somehow).  Is that correct?

Yes, correct.

>> + defrag_move_next_extent();
>> +
>> + } while (true);
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
> 
> If you reverse the polarity of the 0/1 return values (aka return 1 if
> there is a segment and 0 if there is none) then you can shorten this
> loop to:
> 
> struct defrag_segment segment;
> int ret;
> 
> while ((ret = defrag_get_next_segment(...)) == 1) {
> /* process segment */
> }
> 
> return ret;
> 

Yes, that might be better. What I was just thinking is the “0” usually mean a ‘good’ return :D
Will consider this.

Thanks,
Wengang

> --D
> 
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
>> + } while (true);
>> out:
>> if (scratch_fd != -1) {
>> close(scratch_fd);
>> -- 
>> 2.39.3 (Apple Git-146)

Dave Chinner July 15, 2024, 11:40 p.m. UTC | #3

On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
> segments are the smallest unit to defragment.
> 
> A segment
> 1. Can't exceed size limit
> 2. contains some extents
> 3. the contained extents can't be "unwritten"
> 4. the contained extents must be contigous in file blocks
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 204 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index c9732984..175cf461 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -14,6 +14,32 @@
>  #include "space.h"
>  #include "input.h"
>  
> +#define MAPSIZE 512
> +/* used to fetch bmap */
> +struct getbmapx	g_mapx[MAPSIZE];
> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
> +static long long 	g_offset = 0;
> +/* index to indentify next extent, used to get next extent */
> +static int		g_ext_next_idx = -1;

Please do not prefix global variables with "g_". This is not
useful, and simply makes the code hard to read.

That said, it is much better to pass these as function parameters so
they are specific to the mapping context and so are inherently
thread safe.

> +/*
> + * segment, the smallest unit to defrag
> + * it includes some contiguous extents.
> + * no holes included,
> + * no unwritten extents included
> + * the size is limited by g_segment_size_lmt
> + */

I have no idea what this comment is trying to tell me.

> +struct defrag_segment {
> +	/* segment offset in units of 512 bytes */
> +	long long	ds_offset;
> +	/* length of segment in units of 512 bytes */
> +	long long	ds_length;
> +	/* number of extents in this segment */
> +	int		ds_nr;
> +	/* flag indicating if segment contains shared blocks */
> +	bool		ds_shared;
> +};
> +
>  /* defrag segment size limit in units of 512 bytes */
>  #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>  #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>  	return true;
>  }
>  
> +/*
> + * get next extent in the file.
> + * Note: next call will get the same extent unless move_next_extent() is called.
> + * returns:
> + * -1:	error happened.
> + * 0:	extent returned
> + * 1:	no more extent left
> + */
> +static int
> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
> +{
> +	int err = 0, i;
> +
> +	/* when no extents are cached in g_mapx, fetch from kernel */
> +	if (g_ext_next_idx == -1) {
> +		g_mapx[0].bmv_offset = g_offset;
> +		g_mapx[0].bmv_length = -1LL;
> +		g_mapx[0].bmv_count = MAPSIZE;
> +		g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
> +		err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
> +		if (err == -1) {
> +			perror("XFS_IOC_GETBMAPX failed");
> +			goto out;
> +		}
> +		/* for stats */
> +		g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
> +
> +		/* no more extents */
> +		if (g_mapx[0].bmv_entries == 0) {
> +			err = 1;
> +			goto out;
> +		}
> +
> +		/* for stats */
> +		for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
> +			if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
> +				g_ext_stats.nr_ext_unwritten++;
> +			if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
> +				g_ext_stats.nr_ext_shared++;
> +		}
> +
> +		g_ext_next_idx = 1;
> +		g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
> +				g_mapx[g_mapx[0].bmv_entries].bmv_length;
> +	}
> +
> +	map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
> +	map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
> +	map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
> +out:
> +	return err;
> +}

Ok, so the global variables are just a bmap cache. That's a problem,
because this cache is stale the moment XFS_IOC_GETBMAPX returns to
userspace. Iterating it to decide exactly waht to do next will
race with ongoing file modifications and so it's not going to be
accurate....

> +
> +/*
> + * move to next extent
> + */
> +static void
> +defrag_move_next_extent()
> +{
> +	if (g_ext_next_idx == g_mapx[0].bmv_entries)
> +		g_ext_next_idx = -1;
> +	else
> +		g_ext_next_idx += 1;
> +}
> +
> +/*
> + * check if the given extent is a defrag target.
> + * no need to check for holes as we are using BMV_IF_NO_HOLES
> + */
> +static bool
> +defrag_is_target(struct getbmapx *mapx)
> +{
> +	/* unwritten */
> +	if (mapx->bmv_oflags & BMV_OF_PREALLOC)
> +		return false;
> +	return mapx->bmv_length < g_segment_size_lmt;
> +}
> +
> +static bool
> +defrag_is_extent_shared(struct getbmapx *mapx)
> +{
> +	return !!(mapx->bmv_oflags & BMV_OF_SHARED);
> +}
> +
> +/*
> + * get next segment to defragment.
> + * returns:
> + * -1	error happened.
> + * 0	segment returned.
> + * 1	no more segments to return
> + */
> +static int
> +defrag_get_next_segment(int fd, struct defrag_segment *out)
> +{
> +	struct getbmapx mapx;
> +	int	ret;
> +
> +	out->ds_offset = 0;
> +	out->ds_length = 0;
> +	out->ds_nr = 0;
> +	out->ds_shared = false;

out->ds_nr is never set to anything but zero in this patch.

> +
> +	do {
> +		ret = defrag_get_next_extent(fd, &mapx);
> +		if (ret != 0) {
> +			/*
> +			 * no more extetns, return current segment if its not

Typos everywhere.

> +			 * empty
> +			*/
> +			if (ret == 1 && out->ds_nr > 0)
> +				ret = 0;
> +			/* otherwise, error heppened, stop */
> +			break;
> +		}

> +
> +		/*
> +		 * If the extent is not a defrag target, skip it.
> +		 * go to next extent if the segment is empty;
> +		 * otherwise return the segment.
> +		 */
> +		if (!defrag_is_target(&mapx)) {
> +			defrag_move_next_extent();
> +			if (out->ds_nr == 0)
> +				continue;
> +			else
> +				break;
> +		}
> +
> +		/* check for segment size limitation */
> +		if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
> +			break;
> +
> +		/* the segment is empty now, add this extent to it for sure */
> +		if (out->ds_nr == 0) {
> +			out->ds_offset = mapx.bmv_offset;
> +			goto add_ext;
> +		}

So this is essentially a filter for the getbmapx output that strips
away unwritten extents and anything outside/larger than the target
range.

> +
> +		/*
> +		 * the segment is not empty, check for hole since the last exent
> +		 * if a hole exist before this extent, this extent can't be
> +		 * added to the segment. return the segment
> +		 */
> +		if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> +			break;
> +
> +add_ext:

Why do you need a goto for this logic?

		/*
		 * the segment is not empty, check for hole since the last exent
		 * if a hole exist before this extent, this extent can't be
		 * added to the segment. return the segment
		 */
		if (out->ds_nr) {
			if (out->ds_offset + out->ds_length != mapx.bmv_offset)
				break;
		} else {
			out->ds_offset = mapx.bmv_offset;
		}

> +		if (defrag_is_extent_shared(&mapx))
> +			out->ds_shared = true;
> +
> +		out->ds_length += mapx.bmv_length;
> +		out->ds_nr += 1;
> +		defrag_move_next_extent();
> +
> +	} while (true);

> +
> +	return ret;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>  	struct fsxattr	fsx;
>  	int	ret = 0;
>  
> +	g_offset = 0;
> +	g_ext_next_idx = -1;
> +
>  	fsx.fsx_nextents = 0;
>  	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>  
> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>  		ret = 1;
>  		goto out;
>  	}
> +
> +	do {
> +		struct defrag_segment segment;
> +
> +		ret = defrag_get_next_segment(defrag_fd, &segment);
> +		/* no more segments, we are done */
> +		if (ret == 1) {
> +			ret = 0;
> +			break;
> +		}
> +		/* error happened when reading bmap, stop here */
> +		if (ret == -1) {
> +			ret = 1;
> +			break;
> +		}

ternary return values are nasty. Return a negative errno when a
an error occurs, and -ENODATA when there are no more segments.
Then you have

		if (ret < 0) {
			if (ret == -ENODATA)
				exit_value = 0;
			else
				exit_value = 1;
			break;
		}

> +	} while (true);

Not a fan of do {} while(true) loops.

WIth the above error handling changes, this becomes:

	do {
		struct defrag_segment segment;

		ret = defrag_get_next_segment(defrag_fd, &segment);
	} while (ret == 0);

	if (ret == 0 || ret == -ENODATA)
		exit_value = 0;
	else
		exit_value = 1;


Ok, so this is a linear iteration of all extents in the file that
filters extents for the specific "segment" that is going to be
processed. I still have no idea why fixed length segments are
important, but "linear extent scan for filtering" seems somewhat
expensive.

Indeed, if you used FIEMAP, you can pass a minimum
segment length to filter out all the small extents. Iterating that
extent list means all the ranges you need to defrag are in the holes
of the returned mapping information. This would be much faster
than an entire linear mapping to find all the regions with small
extents that need defrag. The second step could then be doing a
fine grained mapping of each region that we now know either contains
fragmented data or holes....

-Dave.

Wengang Wang July 16, 2024, 8:23 p.m. UTC | #4

> On Jul 15, 2024, at 4:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
> 
> Please do not prefix global variables with "g_". This is not
> useful, and simply makes the code hard to read.
> 
> That said, it is much better to pass these as function parameters so
> they are specific to the mapping context and so are inherently
> thread safe.
> 

OK, will try to move them to function parameters, But I do see global variables used in xfsprog.

>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
> 
> I have no idea what this comment is trying to tell me.
OK.

> 
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
> 
> Ok, so the global variables are just a bmap cache. That's a problem,
> because this cache is stale the moment XFS_IOC_GETBMAPX returns to
> userspace. Iterating it to decide exactly waht to do next will
> race with ongoing file modifications and so it's not going to be
> accurate....

Yes, there is racing. 
And even we do defrag basing on stale extents, there is harm to the file under defrag
though it might not bring good defrag result.


> 
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
> 
> out->ds_nr is never set to anything but zero in this patch.
> 

It’s set at line 211 in the raw patch.

206 +add_ext:
207 +               if (defrag_is_extent_shared(&mapx))
208 +                       out->ds_shared = true;
209 +
210 +               out->ds_length += mapx.bmv_length;
211 +               out->ds_nr += 1;
212 +               defrag_move_next_extent();


>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
> 
> Typos everywhere.

OK.

> 
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
> 
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
> 
> So this is essentially a filter for the getbmapx output that strips
> away unwritten extents and anything outside/larger than the target
> range.

yes.

> 
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
> 
> Why do you need a goto for this logic?
> 
> /*
>  * the segment is not empty, check for hole since the last exent
>  * if a hole exist before this extent, this extent can't be
>  * added to the segment. return the segment
>  */
> if (out->ds_nr) {
> if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> break;
> } else {
> out->ds_offset = mapx.bmv_offset;
> }
> 

Above code also work.
The using of "goto add_ext” saved a “if" inside a “if” making the code clearer to me.
But I can change it as you expected.

>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
>> + defrag_move_next_extent();
>> +
>> + } while (true);
> 
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
> 
> ternary return values are nasty. Return a negative errno when a
> an error occurs, and -ENODATA when there are no more segments.
> Then you have
> 
> if (ret < 0) {
> if (ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> break;
> }

Agreed.
I have modified defrag_get_next_segment() and defrag_get_next_extent() for V2,
Making them return -1 for error and 0 for success. And the "no more extents/segements”
is checked by looking at their length. Will send them out. 

> 
>> + } while (true);
> 
> Not a fan of do {} while(true) loops.
> 
> WIth the above error handling changes, this becomes:
> 
> do {
> struct defrag_segment segment;
> 
> ret = defrag_get_next_segment(defrag_fd, &segment);
> } while (ret == 0);
> 
> if (ret == 0 || ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> 

Yes, I making “big” changes to defrag_get_next_segment()/defrag_get_next_extent()
Please review them when I sent them out.

> 
> Ok, so this is a linear iteration of all extents in the file that
> filters extents for the specific "segment" that is going to be
> processed. I still have no idea why fixed length segments are
> important, but "linear extent scan for filtering" seems somewhat
> expensive.

Hm… fixed length segments — actually not fixed length segments, but segment
size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
Larger segment take longer time (with filed locked) to defrag. The segment size limit is a way to balance
the defrag and the parallel IO latency.


> 
> Indeed, if you used FIEMAP, you can pass a minimum
> segment length to filter out all the small extents. Iterating that
> extent list means all the ranges you need to defrag are in the holes
> of the returned mapping information. This would be much faster
> than an entire linear mapping to find all the regions with small
> extents that need defrag. The second step could then be doing a
> fine grained mapping of each region that we now know either contains
> fragmented data or holes....

Hm… just a question here:
As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
What if the extent layout is like this:

1.    1
2.    2049
3.    2
4.    2050
5.    1
6.    2051

In above case, do you do defrag or not?

As I understand the situation, performance of defrag it’s self is not a critical concern here.

Thanks,
Wengang

Dave Chinner July 17, 2024, 4:11 a.m. UTC | #5

On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
> > Ok, so this is a linear iteration of all extents in the file that
> > filters extents for the specific "segment" that is going to be
> > processed. I still have no idea why fixed length segments are
> > important, but "linear extent scan for filtering" seems somewhat
> > expensive.
> 
> Hm… fixed length segments — actually not fixed length segments, but segment
> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.

Which is effectively fixed length segments....

> Larger segment take longer time (with filed locked) to defrag. The
> segment size limit is a way to balance the defrag and the parallel
> IO latency.

Yes, I know why you've done it. These were the same arguments made a
while back for a new way of cloning files on XFS. We solved those
problems just with a small change to the locking, and didn't need
new ioctls or lots of new code just to solve the "clone blocks
concurrent IO" problem.

I'm looking at this from exactly the same POV. The code presented is
doing lots of complex, unusable stuff to work around the fact that
UNSHARE blocks concurrent IO. I don't see any difference between
CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
have looser rules than CLONE, because a concurrent write will either
do the COW of a shared block itself, or hit the exclusive block that
has already been unshared.

So if we fix these locking issues in the kernel, then the whole need
for working around the IO concurrency problems with UNSHARE goes
away and the userspace code becomes much, much simpler.

> > Indeed, if you used FIEMAP, you can pass a minimum
> > segment length to filter out all the small extents. Iterating that
> > extent list means all the ranges you need to defrag are in the holes
> > of the returned mapping information. This would be much faster
> > than an entire linear mapping to find all the regions with small
> > extents that need defrag. The second step could then be doing a
> > fine grained mapping of each region that we now know either contains
> > fragmented data or holes....
> 
> Hm… just a question here:
> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
> What if the extent layout is like this:
> 
> 1.    1
> 2.    2049
> 3.    2
> 4.    2050
> 5.    1
> 6.    2051
> 
> In above case, do you do defrag or not?

The filtering presenting in the patch above will not defrag any of
this with a 2048 block segment side, because the second extent in
each segment extend beyond the configured max segment length. IOWs,
it ends up with a single extent per "2048 block segment", and that
won't get defragged with the current algorithm.

As it is, this really isn't a common fragmentation pattern for a
file that does not contain shared extents, so I wouldn't expect to
ever need to decide if this needs to be defragged or not.

However, it is exactly the layout I would expect to see for cloned
and modified filesystem image files.  That is, the common layout for
such a "cloned from golden image" Vm images is this:

1.    1		written
2.    2049	shared
3.    2		written
4.    2050	shared
5.    1		written
6.    2051	shared

i.e. there are large chunks of contiguous shared extents between the
small individual COW block modifications that have been made to
customise the image for the deployed VM.

Either way, if the segment/filter length is 2048 blocks, then this
isn't a pattern that should be defragmented. If the segment/filter
length is 4096 or larger, then yes, this pattern should definitely
be defragmented.

> As I understand the situation, performance of defrag it’s self is
> not a critical concern here.

Sure, but implementing a low performing, high CPU consumption,
entirely single threaded defragmentation model that requires
specific tuning in every different environment it is run in doesn't
seem like the best idea to me.

I'm trying to work out if there is a faster, simpler way of
achieving the same goal....

Cheers,

Dave.

Wengang Wang July 18, 2024, 7:03 p.m. UTC | #6

> On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
>>> Ok, so this is a linear iteration of all extents in the file that
>>> filters extents for the specific "segment" that is going to be
>>> processed. I still have no idea why fixed length segments are
>>> important, but "linear extent scan for filtering" seems somewhat
>>> expensive.
>> 
>> Hm… fixed length segments — actually not fixed length segments, but segment
>> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> 
> Which is effectively fixed length segments....
> 
>> Larger segment take longer time (with filed locked) to defrag. The
>> segment size limit is a way to balance the defrag and the parallel
>> IO latency.
> 
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.

I didn’t check the code history, but I am thinking you solved the problem
by allow reads to go while cloning is in progress? Correct me if I'm wrong.
The problem we hit is (heart beat) write timeout.  

> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.
> 
>>> Indeed, if you used FIEMAP, you can pass a minimum
>>> segment length to filter out all the small extents. Iterating that
>>> extent list means all the ranges you need to defrag are in the holes
>>> of the returned mapping information. This would be much faster
>>> than an entire linear mapping to find all the regions with small
>>> extents that need defrag. The second step could then be doing a
>>> fine grained mapping of each region that we now know either contains
>>> fragmented data or holes....
>> 
>> Hm… just a question here:
>> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
>> What if the extent layout is like this:
>> 
>> 1.    1
>> 2.    2049
>> 3.    2
>> 4.    2050
>> 5.    1
>> 6.    2051
>> 
>> In above case, do you do defrag or not?
> 
> The filtering presenting in the patch above will not defrag any of
> this with a 2048 block segment side, because the second extent in
> each segment extend beyond the configured max segment length. IOWs,
> it ends up with a single extent per "2048 block segment", and that
> won't get defragged with the current algorithm.
> 
> As it is, this really isn't a common fragmentation pattern for a
> file that does not contain shared extents, so I wouldn't expect to
> ever need to decide if this needs to be defragged or not.
> 
> However, it is exactly the layout I would expect to see for cloned
> and modified filesystem image files.  That is, the common layout for
> such a "cloned from golden image" Vm images is this:
> 
> 1.    1 written
> 2.    2049 shared
> 3.    2 written
> 4.    2050 shared
> 5.    1 written
> 6.    2051 shared
> 
> i.e. there are large chunks of contiguous shared extents between the
> small individual COW block modifications that have been made to
> customise the image for the deployed VM.
> 
> Either way, if the segment/filter length is 2048 blocks, then this
> isn't a pattern that should be defragmented. If the segment/filter
> length is 4096 or larger, then yes, this pattern should definitely
> be defragmented.

Yes, true. We should focus on real case layout.

> 
>> As I understand the situation, performance of defrag it’s self is
>> not a critical concern here.
> 
> Sure, but implementing a low performing, high CPU consumption,
> entirely single threaded defragmentation model that requires
> specific tuning in every different environment it is run in doesn't
> seem like the best idea to me.
> 
> I'm trying to work out if there is a faster, simpler way of
> achieving the same goal....
> 

Great!

Wengang

Christoph Hellwig July 19, 2024, 4:01 a.m. UTC | #7

On Wed, Jul 17, 2024 at 02:11:16PM +1000, Dave Chinner wrote:
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.
> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.

Btw, the main problem with unshare isn't just locking, but that is
extremely inefficient.  It synchronously reads one block at a time,
which makes it very, very slow.  That's purely a kernel implementation
detail, of course.

Dave Chinner July 19, 2024, 4:59 a.m. UTC | #8

On Thu, Jul 18, 2024 at 07:03:40PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
> >>> Ok, so this is a linear iteration of all extents in the file that
> >>> filters extents for the specific "segment" that is going to be
> >>> processed. I still have no idea why fixed length segments are
> >>> important, but "linear extent scan for filtering" seems somewhat
> >>> expensive.
> >> 
> >> Hm… fixed length segments — actually not fixed length segments, but segment
> >> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> > 
> > Which is effectively fixed length segments....
> > 
> >> Larger segment take longer time (with filed locked) to defrag. The
> >> segment size limit is a way to balance the defrag and the parallel
> >> IO latency.
> > 
> > Yes, I know why you've done it. These were the same arguments made a
> > while back for a new way of cloning files on XFS. We solved those
> > problems just with a small change to the locking, and didn't need
> > new ioctls or lots of new code just to solve the "clone blocks
> > concurrent IO" problem.
> 
> I didn’t check the code history, but I am thinking you solved the problem
> by allow reads to go while cloning is in progress? Correct me if I'm wrong.
> The problem we hit is (heart beat) write timeout.  

The reason this worked (allowing shared reads through and not
writes) was that the VM infrastructure this was being done for uses
a sidecar write channel to redirect writes while a clone is being
done. i.e. writes are not blocked by the clone in progress because
they are being done to a different file.

When the clone completes, those writes are folded back into the
original image file. e.g. see the `qemu-img commit -b <backing file>
<file with delta writes>` which will fold writes to a sidecar write
file back into the original backing file that was just cloned....

What I'm suggesting is that when you run an backing file
defragmentation, you use the same sidecar write setup as cloning
whilst the defrag is done. Reads go straight through to the backing
file, and writes get written to a delta write file. When the defrag
is done the delta write file gets folded back into the backing file.

But for this to work, UNSHARE needs to use shared read locking so
that read IO can be directed through the file at the same time as
the UNSHARE is running. If this works for CLONE to avoid read and
write blocking whilst the operation is in progress, the same
mechanism should be able to be used for UNSHARE, too. At this point
defrag using CLONE+UNSHARE shouldn't ever block read IO and
shouldn't block write IO for any significant period of time,
either...

-Dave.

Wengang Wang July 24, 2024, 7:22 p.m. UTC | #9

> On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
>>> Ok, so this is a linear iteration of all extents in the file that
>>> filters extents for the specific "segment" that is going to be
>>> processed. I still have no idea why fixed length segments are
>>> important, but "linear extent scan for filtering" seems somewhat
>>> expensive.
>> 
>> Hm… fixed length segments — actually not fixed length segments, but segment
>> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> 
> Which is effectively fixed length segments....
> 
>> Larger segment take longer time (with filed locked) to defrag. The
>> segment size limit is a way to balance the defrag and the parallel
>> IO latency.
> 
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.
> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.
> 
>>> Indeed, if you used FIEMAP, you can pass a minimum
>>> segment length to filter out all the small extents. Iterating that
>>> extent list means all the ranges you need to defrag are in the holes
>>> of the returned mapping information. This would be much faster
>>> than an entire linear mapping to find all the regions with small
>>> extents that need defrag. The second step could then be doing a
>>> fine grained mapping of each region that we now know either contains
>>> fragmented data or holes....


Where can we pass a minimum segment length to filter out things?
I don’t see that in the fiemap structure:

A fiemap request is encoded within struct fiemap:

struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which
* userspace cares about (in) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_mapped_extents; /* number of extents that were
* mapped (out) */
__u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
};

Thanks,
Wengang

>> 
>> Hm… just a question here:
>> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
>> What if the extent layout is like this:
>> 
>> 1.    1
>> 2.    2049
>> 3.    2
>> 4.    2050
>> 5.    1
>> 6.    2051
>> 
>> In above case, do you do defrag or not?
> 
> The filtering presenting in the patch above will not defrag any of
> this with a 2048 block segment side, because the second extent in
> each segment extend beyond the configured max segment length. IOWs,
> it ends up with a single extent per "2048 block segment", and that
> won't get defragged with the current algorithm.
> 
> As it is, this really isn't a common fragmentation pattern for a
> file that does not contain shared extents, so I wouldn't expect to
> ever need to decide if this needs to be defragged or not.
> 
> However, it is exactly the layout I would expect to see for cloned
> and modified filesystem image files.  That is, the common layout for
> such a "cloned from golden image" Vm images is this:
> 
> 1.    1 written
> 2.    2049 shared
> 3.    2 written
> 4.    2050 shared
> 5.    1 written
> 6.    2051 shared
> 
> i.e. there are large chunks of contiguous shared extents between the
> small individual COW block modifications that have been made to
> customise the image for the deployed VM.
> 
> Either way, if the segment/filter length is 2048 blocks, then this
> isn't a pattern that should be defragmented. If the segment/filter
> length is 4096 or larger, then yes, this pattern should definitely
> be defragmented.
> 
>> As I understand the situation, performance of defrag it’s self is
>> not a critical concern here.
> 
> Sure, but implementing a low performing, high CPU consumption,
> entirely single threaded defragmentation model that requires
> specific tuning in every different environment it is run in doesn't
> seem like the best idea to me.
> 
> I'm trying to work out if there is a faster, simpler way of
> achieving the same goal....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner July 30, 2024, 10:13 p.m. UTC | #10

On Wed, Jul 24, 2024 at 07:22:25PM +0000, Wengang Wang wrote:
> > On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> Indeed, if you used FIEMAP, you can pass a minimum
> >>> segment length to filter out all the small extents. Iterating that
> >>> extent list means all the ranges you need to defrag are in the holes
> >>> of the returned mapping information. This would be much faster
> >>> than an entire linear mapping to find all the regions with small
> >>> extents that need defrag. The second step could then be doing a
> >>> fine grained mapping of each region that we now know either contains
> >>> fragmented data or holes....
> 
> 
> Where can we pass a minimum segment length to filter out things?
> I don’t see that in the fiemap structure:

Oh, sorry, too many similar APIs - it's FITRIM that has a minimum
length filter build into it. It's something we could add if we
really need it.

-Dave.

[2/9] spaceman/defrag: pick up segments from target file

Commit Message

Comments

Patch