[8/9] spaceman/defrag: readahead for better performance

Message ID	20240709191028.2329-9-wen.gang.wang@oracle.com (mailing list archive)
State	Accepted, archived
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE05C182A66 for <linux-xfs@vger.kernel.org>; Tue, 9 Jul 2024 19:10:38 +0000 (UTC) From: Wengang Wang <wen.gang.wang@oracle.com> To: linux-xfs@vger.kernel.org Cc: wen.gang.wang@oracle.com Subject: [PATCH 8/9] spaceman/defrag: readahead for better performance Date: Tue, 9 Jul 2024 12:10:27 -0700 Message-Id: <20240709191028.2329-9-wen.gang.wang@oracle.com> In-Reply-To: <20240709191028.2329-1-wen.gang.wang@oracle.com> References: <20240709191028.2329-1-wen.gang.wang@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	introduce defrag to xfs_spaceman \| expand [0/9] introduce defrag to xfs_spaceman [1/9] xfsprogs: introduce defrag command to spaceman [2/9] spaceman/defrag: pick up segments from target file [3/9] spaceman/defrag: defrag segments [4/9] spaceman/defrag: ctrl-c handler [5/9] spaceman/defrag: exclude shared segments on low free space [6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() [7/9] spaceman/defrag: sleeps between segments [8/9] spaceman/defrag: readahead for better performance [9/9] spaceman/defrag: warn on extsize

Wengang Wang July 9, 2024, 7:10 p.m. UTC

Reading ahead take less lock on file compared to "unshare" the file via ioctl.
Do readahead when defrag sleeps for better defrag performace and thus more
file IO time.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

Darrick J. Wong July 9, 2024, 8:27 p.m. UTC | #1

On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> Do readahead when defrag sleeps for better defrag performace and thus more
> file IO time.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 415fe9c2..ab8508bb 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>  }
>  
>  static bool g_enable_first_ext_share = true;
> +static bool g_readahead = false;
> +
> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> +{
> +	if (!g_readahead || g_idle_time <= 0)
> +		return;
> +
> +	if (readahead(defrag_fd, offset, count) < 0) {
> +		fprintf(stderr, "readahead failed: %s, errno=%d\n",
> +			strerror(errno), errno);

Why is it worth reporting if readahead fails?  Won't the unshare also
fail?  I'm also wondering why we wouldn't want readahead all the time?

--D

> +	}
> +}
>  
>  static int
>  defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
> @@ -578,6 +590,8 @@ defrag_xfs_defrag(char *file_path) {
>  
>  		/* checks for EoF and fix up clone */
>  		stop = defrag_clone_eof(&clone);
> +		defrag_readahead(defrag_fd, seg_off, seg_size);
> +
>  		if (sleep_time_us > 0)
>  			usleep(sleep_time_us);
>  
> @@ -698,6 +712,7 @@ static void defrag_help(void)
>  " -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>  " -n                 -- disable the \"share first extent\" featue, it's\n"
>  "                       enabled by default to speed up\n"
> +" -a                 -- do readahead to speed up defrag, disabled by default\n"
>  	));
>  }
>  
> @@ -709,7 +724,7 @@ defrag_f(int argc, char **argv)
>  	int	i;
>  	int	c;
>  
> -	while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
> +	while ((c = getopt(argc, argv, "s:f:nia")) != EOF) {
>  		switch(c) {
>  		case 's':
>  			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> @@ -731,6 +746,10 @@ defrag_f(int argc, char **argv)
>  			g_idle_time = atoi(optarg) * 1000;
>  			break;
>  
> +		case 'a':
> +			g_readahead = true;
> +			break;
> +
>  		default:
>  			command_usage(&defrag_cmd);
>  			return 1;
> -- 
> 2.39.3 (Apple Git-146)
> 
>

Wengang Wang July 11, 2024, 11:29 p.m. UTC | #2

> On Jul 9, 2024, at 1:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>> Do readahead when defrag sleeps for better defrag performace and thus more
>> file IO time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 415fe9c2..ab8508bb 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>> }
>> 
>> static bool g_enable_first_ext_share = true;
>> +static bool g_readahead = false;
>> +
>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>> +{
>> + if (!g_readahead || g_idle_time <= 0)
>> + return;
>> +
>> + if (readahead(defrag_fd, offset, count) < 0) {
>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>> + strerror(errno), errno);
> 
> Why is it worth reporting if readahead fails?  Won't the unshare also
> fail?  

Yes, if readahead failed, later unshare should fail as I think.
I just want to capture error as soon as possible though readahead is not critical.

> I'm also wondering why we wouldn't want readahead all the time?

As per our tests, readahead on NVME doesn’t hehaved better.
So I’d make it an option.

Thanks,
Wengang

> 
> --D
> 
>> + }
>> +}
>> 
>> static int
>> defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
>> @@ -578,6 +590,8 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> /* checks for EoF and fix up clone */
>> stop = defrag_clone_eof(&clone);
>> + defrag_readahead(defrag_fd, seg_off, seg_size);
>> +
>> if (sleep_time_us > 0)
>> usleep(sleep_time_us);
>> 
>> @@ -698,6 +712,7 @@ static void defrag_help(void)
>> " -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>> " -n                 -- disable the \"share first extent\" featue, it's\n"
>> "                       enabled by default to speed up\n"
>> +" -a                 -- do readahead to speed up defrag, disabled by default\n"
>> ));
>> }
>> 
>> @@ -709,7 +724,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:nia")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -731,6 +746,10 @@ defrag_f(int argc, char **argv)
>> g_idle_time = atoi(optarg) * 1000;
>> break;
>> 
>> + case 'a':
>> + g_readahead = true;
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> -- 
>> 2.39.3 (Apple Git-146)

Dave Chinner July 16, 2024, 12:56 a.m. UTC | #3

On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> Do readahead when defrag sleeps for better defrag performace and thus more
> file IO time.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 415fe9c2..ab8508bb 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>  }
>  
>  static bool g_enable_first_ext_share = true;
> +static bool g_readahead = false;
> +
> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> +{
> +	if (!g_readahead || g_idle_time <= 0)
> +		return;
> +
> +	if (readahead(defrag_fd, offset, count) < 0) {
> +		fprintf(stderr, "readahead failed: %s, errno=%d\n",
> +			strerror(errno), errno);

This doesn't do what you think it does. readahead() only queues the
first readahead chunk of the range given (a few pages at most). It
does not cause readahead on the entire range, wait for page cache
population, nor report IO errors that might have occurred during
readahead.

There's almost no value to making this syscall, especially if the
app is about to trigger a sequential read for the whole range.
Readahead will occur naturally during that read operation (i.e. the
UNSHARE copy), and the read will return IO errors unlike
readahead().

If you want the page cache pre-populated before the unshare
operation is done, then you need to use mmap() and
madvise(MADV_POPULATE_READ). This will read the whole region into
the page cache as if it was a sequential read, wait for it to
complete and return any IO errors that might have occurred during
the read.

-Dave.

Wengang Wang July 18, 2024, 6:40 p.m. UTC | #4

> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>> Do readahead when defrag sleeps for better defrag performace and thus more
>> file IO time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 415fe9c2..ab8508bb 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>> }
>> 
>> static bool g_enable_first_ext_share = true;
>> +static bool g_readahead = false;
>> +
>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>> +{
>> + if (!g_readahead || g_idle_time <= 0)
>> + return;
>> +
>> + if (readahead(defrag_fd, offset, count) < 0) {
>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>> + strerror(errno), errno);
> 
> This doesn't do what you think it does. readahead() only queues the
> first readahead chunk of the range given (a few pages at most). It
> does not cause readahead on the entire range, wait for page cache
> population, nor report IO errors that might have occurred during
> readahead.

Is it a bug? As per the man page it should try to read _count_ bytes:

DESCRIPTION
       readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that
       other activity on the system did not in the meantime flush pages from the cache).

       The fd argument is a file descriptor identifying the file which is to be read.  The offset argument specifies the starting point from which data is to be read and count specifies the number of bytes to
       be read.  I/O is performed in whole pages, so that offset is effectively rounded down to a page boundary and bytes are read up to the next page boundary greater than or equal to (offset+count).  reada‐
       head() does not read beyond the end of the file.  The file offset of the open file description referred to by fd is left unchanged.

> 
> There's almost no value to making this syscall, especially if the
> app is about to trigger a sequential read for the whole range.
> Readahead will occur naturally during that read operation (i.e. the
> UNSHARE copy), and the read will return IO errors unlike
> readahead().
> 
> If you want the page cache pre-populated before the unshare
> operation is done, then you need to use mmap() and
> madvise(MADV_POPULATE_READ). This will read the whole region into
> the page cache as if it was a sequential read, wait for it to
> complete and return any IO errors that might have occurred during
> the read.

As you know in the unshare path, fetching data from disk is done when IO is locked.
(I am wondering if we can improve that.)
The main purpose of using readahead is that I want less (IO) lock time when fetching
data from disk. Can we achieve that by using mmap and madvise()?

Thanks,
Wengang

Dave Chinner July 31, 2024, 3:10 a.m. UTC | #5

On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> >> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> >> Do readahead when defrag sleeps for better defrag performace and thus more
> >> file IO time.
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> spaceman/defrag.c | 21 ++++++++++++++++++++-
> >> 1 file changed, 20 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >> index 415fe9c2..ab8508bb 100644
> >> --- a/spaceman/defrag.c
> >> +++ b/spaceman/defrag.c
> >> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
> >> }
> >> 
> >> static bool g_enable_first_ext_share = true;
> >> +static bool g_readahead = false;
> >> +
> >> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> >> +{
> >> + if (!g_readahead || g_idle_time <= 0)
> >> + return;
> >> +
> >> + if (readahead(defrag_fd, offset, count) < 0) {
> >> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
> >> + strerror(errno), errno);
> > 
> > This doesn't do what you think it does. readahead() only queues the
> > first readahead chunk of the range given (a few pages at most). It
> > does not cause readahead on the entire range, wait for page cache
> > population, nor report IO errors that might have occurred during
> > readahead.
> 
> Is it a bug?

No.

> As per the man page it should try to read _count_ bytes:

No it doesn't. It says:

> 
> DESCRIPTION
>        readahead() initiates readahead on a file
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It says it -initiates- readahead. It doesn't mean it waits for
readahead to complete or that it will readahead the whole range.
It just starts readahead.

> > There's almost no value to making this syscall, especially if the
> > app is about to trigger a sequential read for the whole range.
> > Readahead will occur naturally during that read operation (i.e. the
> > UNSHARE copy), and the read will return IO errors unlike
> > readahead().
> > 
> > If you want the page cache pre-populated before the unshare
> > operation is done, then you need to use mmap() and
> > madvise(MADV_POPULATE_READ). This will read the whole region into
> > the page cache as if it was a sequential read, wait for it to
> > complete and return any IO errors that might have occurred during
> > the read.
> 
> As you know in the unshare path, fetching data from disk is done when IO is locked.
> (I am wondering if we can improve that.)

Christoph pointed that out and some potential fixes back in the
original discussion:

https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@infradead.org/

> The main purpose of using readahead is that I want less (IO) lock time when fetching
> data from disk. Can we achieve that by using mmap and madvise()?

Maybe, but you're still adding complexity to userspace as a work
around for a kernel issue we should be fixing.

-Dave.

Wengang Wang Aug. 2, 2024, 6:31 p.m. UTC | #6

> On Jul 30, 2024, at 8:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>>>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>>>> Do readahead when defrag sleeps for better defrag performace and thus more
>>>> file IO time.
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>>> ---
>>>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>>> index 415fe9c2..ab8508bb 100644
>>>> --- a/spaceman/defrag.c
>>>> +++ b/spaceman/defrag.c
>>>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>>>> }
>>>> 
>>>> static bool g_enable_first_ext_share = true;
>>>> +static bool g_readahead = false;
>>>> +
>>>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>>>> +{
>>>> + if (!g_readahead || g_idle_time <= 0)
>>>> + return;
>>>> +
>>>> + if (readahead(defrag_fd, offset, count) < 0) {
>>>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>>>> + strerror(errno), errno);
>>> 
>>> This doesn't do what you think it does. readahead() only queues the
>>> first readahead chunk of the range given (a few pages at most). It
>>> does not cause readahead on the entire range, wait for page cache
>>> population, nor report IO errors that might have occurred during
>>> readahead.
>> 
>> Is it a bug?
> 
> No.
> 
>> As per the man page it should try to read _count_ bytes:
> 
> No it doesn't. It says:
> 
>> 
>> DESCRIPTION
>>       readahead() initiates readahead on a file
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It says it -initiates- readahead. It doesn't mean it waits for
> readahead to complete or that it will readahead the whole range.
> It just starts readahead.
> 

I know it’s asynchronous operation. But when given enough time, it should complete.

# without considering your idea of completing the defrag as fast as possible
As we tested with 16MiB segment size limit, we were seeing the unshare takes about
350 ms including the disk data reading. And we are using 250 ms default sleep/idle time.
In my idea, during the 250 ms sleep time, a lot of block reads should be done.


>>> There's almost no value to making this syscall, especially if the
>>> app is about to trigger a sequential read for the whole range.
>>> Readahead will occur naturally during that read operation (i.e. the
>>> UNSHARE copy), and the read will return IO errors unlike
>>> readahead().
>>> 
>>> If you want the page cache pre-populated before the unshare
>>> operation is done, then you need to use mmap() and
>>> madvise(MADV_POPULATE_READ). This will read the whole region into
>>> the page cache as if it was a sequential read, wait for it to
>>> complete and return any IO errors that might have occurred during
>>> the read.
>> 
>> As you know in the unshare path, fetching data from disk is done when IO is locked.
>> (I am wondering if we can improve that.)
> 
> Christoph pointed that out and some potential fixes back in the
> original discussion:
> 
> https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@infradead.org/

Yes, I read that. Thanks Christoph for that.

> 
>> The main purpose of using readahead is that I want less (IO) lock time when fetching
>> data from disk. Can we achieve that by using mmap and madvise()?
> 
> Maybe, but you're still adding complexity to userspace as a work
> around for a kernel issue we should be fixing.
> 

Yes, true. But in case of kernel has no the related fixes and we have to use the defrag, working
around is the way go, right.

Thanks,
Wengang

[8/9] spaceman/defrag: readahead for better performance

Commit Message

Comments

Patch