[v4] ceph: fix blindly expanding the readahead windows

Message ID	20230509005703.155321-1-xiubli@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> From: xiubli@redhat.com To: idryomov@gmail.com, ceph-devel@vger.kernel.org Cc: jlayton@kernel.org, vshankar@redhat.com, Xiubo Li <xiubli@redhat.com>, stable@vger.kernel.org, Hu Weiwen <sehuww@mail.scut.edu.cn> Subject: [PATCH v4] ceph: fix blindly expanding the readahead windows Date: Tue, 9 May 2023 08:57:03 +0800 Message-Id: <20230509005703.155321-1-xiubli@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v4] ceph: fix blindly expanding the readahead windows \| expand [v4] ceph: fix blindly expanding the readahead windows

Message ID

20230509005703.155321-1-xiubli@redhat.com (mailing list archive)

State

New, archived

Headers

From: xiubli@redhat.com
To: idryomov@gmail.com, ceph-devel@vger.kernel.org
Cc: jlayton@kernel.org, vshankar@redhat.com,
        Xiubo Li <xiubli@redhat.com>, stable@vger.kernel.org,
        Hu Weiwen <sehuww@mail.scut.edu.cn>
Subject: [PATCH v4] ceph: fix blindly expanding the readahead windows
Date: Tue,  9 May 2023 08:57:03 +0800
Message-Id: <20230509005703.155321-1-xiubli@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

[v4] ceph: fix blindly expanding the readahead windows | expand

Commit Message

Xiubo Li May 9, 2023, 12:57 a.m. UTC

From: Xiubo Li <xiubli@redhat.com>

Blindly expanding the readahead windows will cause unneccessary
pagecache thrashing and also will introdue the network workload.
We should disable expanding the windows if the readahead is disabled
and also shouldn't expand the windows too much.

Expanding forward firstly instead of expanding backward for possible
sequential reads.

Bound `rreq->len` to the actual file size to restore the previous page
cache usage.

Cc: stable@vger.kernel.org
Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
URL: https://www.spinics.net/lists/ceph-users/msg76183.html
Cc: Hu Weiwen <sehuww@mail.scut.edu.cn>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---

V4:
- two small cleanup from Ilya's comments. Thanks


 fs/ceph/addr.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

Comments

Jeff Layton May 9, 2023, 2:18 p.m. UTC | #1

On Tue, 2023-05-09 at 08:57 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> Blindly expanding the readahead windows will cause unneccessary
> pagecache thrashing and also will introdue the network workload.
> We should disable expanding the windows if the readahead is disabled
> and also shouldn't expand the windows too much.
> 
> Expanding forward firstly instead of expanding backward for possible
> sequential reads.
> 
> Bound `rreq->len` to the actual file size to restore the previous page
> cache usage.
> 
> Cc: stable@vger.kernel.org
> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
> URL: https://www.spinics.net/lists/ceph-users/msg76183.html
> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn>
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
> 
> V4:
> - two small cleanup from Ilya's comments. Thanks
> 
> 

(cc'ing Steve French since he was asking me about ceph readahead
yesterday)

FWIW, the original idea here was to try to read whole OSD objects when
we can. I can see that that may have been overzealous though, so
ramping up the size more slowly makes sense.

>  fs/ceph/addr.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index ca4dc6450887..683ba9fbd590 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
>  	struct inode *inode = rreq->inode;
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	struct ceph_file_layout *lo = &ci->i_layout;
> +	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
> +	unsigned long max_len = max_pages << PAGE_SHIFT;
> +	loff_t end = rreq->start + rreq->len, new_end;
>  	u32 blockoff;
> -	u64 blockno;
>  
> -	/* Expand the start downward */
> -	blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
> -	rreq->start = blockno * lo->stripe_unit;
> -	rreq->len += blockoff;
> +	/* Readahead is disabled */
> +	if (!max_pages)
> +		return;
>  
> -	/* Now, round up the length to the next block */
> -	rreq->len = roundup(rreq->len, lo->stripe_unit);
> +	/*
> +	 * Try to expand the length forward by rounding  up it to the next
> +	 * block, but do not exceed the file size, unless the original
> +	 * request already exceeds it.
> +	 */

Do you really need to clamp this to the i_size? Is it ever possible for
the client to be out of date as to the real file size? If so then you
might end up with a short read when there is actually more data.

I guess by doing this, you're trying to avoid having to call the OSD
back after a short read and get a zero-length read when the file is
shorter than the requested read?

> +	new_end = min(round_up(end, lo->stripe_unit), rreq->i_size);
> +	if (new_end > end && new_end <= rreq->start + max_len)
> +		rreq->len = new_end - rreq->start;
> +
> +	/* Try to expand the start downward */
> +	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
> +	if (rreq->len + blockoff <= max_len) {
> +		rreq->start -= blockoff;
> +		rreq->len += blockoff;
> +	}
>  }
>  
>  static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)

Xiubo Li May 10, 2023, 12:48 a.m. UTC | #2

On 5/9/23 22:18, Jeff Layton wrote:
> On Tue, 2023-05-09 at 08:57 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> Blindly expanding the readahead windows will cause unneccessary
>> pagecache thrashing and also will introdue the network workload.
>> We should disable expanding the windows if the readahead is disabled
>> and also shouldn't expand the windows too much.
>>
>> Expanding forward firstly instead of expanding backward for possible
>> sequential reads.
>>
>> Bound `rreq->len` to the actual file size to restore the previous page
>> cache usage.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
>> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
>> URL: https://www.spinics.net/lists/ceph-users/msg76183.html
>> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> V4:
>> - two small cleanup from Ilya's comments. Thanks
>>
>>
> (cc'ing Steve French since he was asking me about ceph readahead
> yesterday)
>
> FWIW, the original idea here was to try to read whole OSD objects when
> we can. I can see that that may have been overzealous though, so
> ramping up the size more slowly makes sense.
>
>>   fs/ceph/addr.c | 28 +++++++++++++++++++++-------
>>   1 file changed, 21 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index ca4dc6450887..683ba9fbd590 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
>>   	struct inode *inode = rreq->inode;
>>   	struct ceph_inode_info *ci = ceph_inode(inode);
>>   	struct ceph_file_layout *lo = &ci->i_layout;
>> +	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
>> +	unsigned long max_len = max_pages << PAGE_SHIFT;
>> +	loff_t end = rreq->start + rreq->len, new_end;
>>   	u32 blockoff;
>> -	u64 blockno;
>>   
>> -	/* Expand the start downward */
>> -	blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
>> -	rreq->start = blockno * lo->stripe_unit;
>> -	rreq->len += blockoff;
>> +	/* Readahead is disabled */
>> +	if (!max_pages)
>> +		return;
>>   
>> -	/* Now, round up the length to the next block */
>> -	rreq->len = roundup(rreq->len, lo->stripe_unit);
>> +	/*
>> +	 * Try to expand the length forward by rounding  up it to the next
>> +	 * block, but do not exceed the file size, unless the original
>> +	 * request already exceeds it.
>> +	 */

Hi Jeff,


> Do you really need to clamp this to the i_size? Is it ever possible for
> the client to be out of date as to the real file size? If so then you
> might end up with a short read when there is actually more data.
>
> I guess by doing this, you're trying to avoid having to call the OSD
> back after a short read and get a zero-length read when the file is
> shorter than the requested read?

This is folded from Weiwen's another fix 
https://patchwork.kernel.org/project/ceph-devel/patch/20230504082510.247-1-sehuww@mail.scut.edu.cn/.

For small files use case this may really could cause unnecessary network 
workload and inefficient usage of the page cache.

Thanks

- Xiubo

>> +	new_end = min(round_up(end, lo->stripe_unit), rreq->i_size);
>> +	if (new_end > end && new_end <= rreq->start + max_len)
>> +		rreq->len = new_end - rreq->start;
>> +
>> +	/* Try to expand the start downward */
>> +	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
>> +	if (rreq->len + blockoff <= max_len) {
>> +		rreq->start -= blockoff;
>> +		rreq->len += blockoff;
>> +	}
>>   }
>>   
>>   static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)

胡玮文 May 10, 2023, 10:01 a.m. UTC | #3

On Tue, May 09, 2023 at 08:57:03AM +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> Blindly expanding the readahead windows will cause unneccessary
> pagecache thrashing and also will introdue the network workload.
                                    ^^^^^^^^
                                    introduce

> We should disable expanding the windows if the readahead is disabled
> and also shouldn't expand the windows too much.
> 
> Expanding forward firstly instead of expanding backward for possible
> sequential reads.
> 
> Bound `rreq->len` to the actual file size to restore the previous page
> cache usage.
> 
> Cc: stable@vger.kernel.org
> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
> URL: https://www.spinics.net/lists/ceph-users/msg76183.html
> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn>
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
> 
> V4:
> - two small cleanup from Ilya's comments. Thanks
> 
> 
>  fs/ceph/addr.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index ca4dc6450887..683ba9fbd590 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
>  	struct inode *inode = rreq->inode;
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	struct ceph_file_layout *lo = &ci->i_layout;
> +	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;

I think it is better to use `ractl->ra->ra_pages' instead of
`inode->i_sb->s_bdi->ra_pages'.  So that we can consider per-request ra
size config, e.g., posix_fadvise(POSIX_FADV_SEQUENTIAL) will double the
ra_pages.

But `ractl' is not passed to this function.  Can we just add this
argument?  ceph seems to be the only implementation of expand_readahead,
so it should be easy.  Or since this patch will be backported, maybe we
should keep it simple, and write another patch for this?

> +	unsigned long max_len = max_pages << PAGE_SHIFT;
> +	loff_t end = rreq->start + rreq->len, new_end;
>  	u32 blockoff;
> -	u64 blockno;
>  
> -	/* Expand the start downward */
> -	blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
> -	rreq->start = blockno * lo->stripe_unit;
> -	rreq->len += blockoff;
> +	/* Readahead is disabled */
> +	if (!max_pages)
> +		return;

If we have access to ractl here, we can also skip expanding on
`ractl->file->f_mode & FMODE_RANDOM', which is set by
`posix_fadvise(POSIX_FADV_RANDOM)'.

>  
> -	/* Now, round up the length to the next block */
> -	rreq->len = roundup(rreq->len, lo->stripe_unit);
> +	/*
> +	 * Try to expand the length forward by rounding  up it to the next
                                                       ^^ an extra space

> +	 * block, but do not exceed the file size, unless the original
> +	 * request already exceeds it.
> +	 */
> +	new_end = min(round_up(end, lo->stripe_unit), rreq->i_size);
> +	if (new_end > end && new_end <= rreq->start + max_len)
> +		rreq->len = new_end - rreq->start;
> +
> +	/* Try to expand the start downward */
> +	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
> +	if (rreq->len + blockoff <= max_len) {
> +		rreq->start -= blockoff;
> +		rreq->len += blockoff;
> +	}
>  }
>  
>  static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
> -- 
> 2.40.0
>

Xiubo Li May 10, 2023, 11:36 a.m. UTC | #4

On 5/10/23 18:01, 胡玮文 wrote:
> On Tue, May 09, 2023 at 08:57:03AM +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> Blindly expanding the readahead windows will cause unneccessary
>> pagecache thrashing and also will introdue the network workload.
>                                      ^^^^^^^^
>                                      introduce
>
>> We should disable expanding the windows if the readahead is disabled
>> and also shouldn't expand the windows too much.
>>
>> Expanding forward firstly instead of expanding backward for possible
>> sequential reads.
>>
>> Bound `rreq->len` to the actual file size to restore the previous page
>> cache usage.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
>> URL: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
>> URL: https://www.spinics.net/lists/ceph-users/msg76183.html
>> Cc: Hu Weiwen <sehuww@mail.scut.edu.cn>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> V4:
>> - two small cleanup from Ilya's comments. Thanks
>>
>>
>>   fs/ceph/addr.c | 28 +++++++++++++++++++++-------
>>   1 file changed, 21 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index ca4dc6450887..683ba9fbd590 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -188,16 +188,30 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
>>   	struct inode *inode = rreq->inode;
>>   	struct ceph_inode_info *ci = ceph_inode(inode);
>>   	struct ceph_file_layout *lo = &ci->i_layout;
>> +	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
> I think it is better to use `ractl->ra->ra_pages' instead of
> `inode->i_sb->s_bdi->ra_pages'.  So that we can consider per-request ra
> size config, e.g., posix_fadvise(POSIX_FADV_SEQUENTIAL) will double the
> ra_pages.

Yeah, good catch.

> But `ractl' is not passed to this function.  Can we just add this
> argument?  ceph seems to be the only implementation of expand_readahead,
> so it should be easy.  Or since this patch will be backported, maybe we
> should keep it simple, and write another patch for this?

I think we should fix it together with this. It should be easy.

We can just store the "file->f_ra->ra_pages" in ceph_init_request() 
instead, because each rreq will be related to a dedicated file.


>> +	unsigned long max_len = max_pages << PAGE_SHIFT;
>> +	loff_t end = rreq->start + rreq->len, new_end;
>>   	u32 blockoff;
>> -	u64 blockno;
>>   
>> -	/* Expand the start downward */
>> -	blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
>> -	rreq->start = blockno * lo->stripe_unit;
>> -	rreq->len += blockoff;
>> +	/* Readahead is disabled */
>> +	if (!max_pages)
>> +		return;
> If we have access to ractl here, we can also skip expanding on
> `ractl->file->f_mode & FMODE_RANDOM', which is set by
> `posix_fadvise(POSIX_FADV_RANDOM)'.

Correct, because the "page_cache_sync_ra()' will do the right thing and 
we can skip expanding it here in ceph.

Thanks

- Xiubo

>
>>   
>> -	/* Now, round up the length to the next block */
>> -	rreq->len = roundup(rreq->len, lo->stripe_unit);
>> +	/*
>> +	 * Try to expand the length forward by rounding  up it to the next
>                                                         ^^ an extra space
>
>> +	 * block, but do not exceed the file size, unless the original
>> +	 * request already exceeds it.
>> +	 */
>> +	new_end = min(round_up(end, lo->stripe_unit), rreq->i_size);
>> +	if (new_end > end && new_end <= rreq->start + max_len)
>> +		rreq->len = new_end - rreq->start;
>> +
>> +	/* Try to expand the start downward */
>> +	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
>> +	if (rreq->len + blockoff <= max_len) {
>> +		rreq->start -= blockoff;
>> +		rreq->len += blockoff;
>> +	}
>>   }
>>   
>>   static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
>> -- 
>> 2.40.0
>>

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index ca4dc6450887..683ba9fbd590 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -188,16 +188,30 @@  static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
 	struct inode *inode = rreq->inode;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_file_layout *lo = &ci->i_layout;
+	unsigned long max_pages = inode->i_sb->s_bdi->ra_pages;
+	unsigned long max_len = max_pages << PAGE_SHIFT;
+	loff_t end = rreq->start + rreq->len, new_end;
 	u32 blockoff;
-	u64 blockno;
 
-	/* Expand the start downward */
-	blockno = div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
-	rreq->start = blockno * lo->stripe_unit;
-	rreq->len += blockoff;
+	/* Readahead is disabled */
+	if (!max_pages)
+		return;
 
-	/* Now, round up the length to the next block */
-	rreq->len = roundup(rreq->len, lo->stripe_unit);
+	/*
+	 * Try to expand the length forward by rounding  up it to the next
+	 * block, but do not exceed the file size, unless the original
+	 * request already exceeds it.
+	 */
+	new_end = min(round_up(end, lo->stripe_unit), rreq->i_size);
+	if (new_end > end && new_end <= rreq->start + max_len)
+		rreq->len = new_end - rreq->start;
+
+	/* Try to expand the start downward */
+	div_u64_rem(rreq->start, lo->stripe_unit, &blockoff);
+	if (rreq->len + blockoff <= max_len) {
+		rreq->start -= blockoff;
+		rreq->len += blockoff;
+	}
 }
 
 static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)

[v4] ceph: fix blindly expanding the readahead windows

Commit Message

Comments

Patch