diff mbox

[v2] pnfs: Proper delay for NFS4ERR_RECALLCONFLICT in layout_get_done

Message ID 52D5589A.7090507@panasas.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boaz Harrosh Jan. 14, 2014, 3:32 p.m. UTC
An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT
only when a Server Sent a RECALL do to that GET_LAYOUT, or
the RECALL and GET_LAYOUT crossed on the wire.
In any way this means we want to wait at most until in-flight IO
is finished and the RECALL can be satisfied.

So a proper wait here is more like 1/10 of a second, not 15 seconds
like we have now. (We use NFS4_POLL_RETRY_MIN here)

Current code totally craps out performance of very large files on
most pnfs-objects layouts, because of how the map changes when the
file has grown beyond a raid group.

CC: Stable Tree <stable@vger.kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/nfs/nfs4proc.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

Comments

Trond Myklebust Jan. 14, 2014, 7:05 p.m. UTC | #1
On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote:
> An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT
> only when a Server Sent a RECALL do to that GET_LAYOUT, or
> the RECALL and GET_LAYOUT crossed on the wire.
> In any way this means we want to wait at most until in-flight IO
> is finished and the RECALL can be satisfied.
> 
> So a proper wait here is more like 1/10 of a second, not 15 seconds
> like we have now. (We use NFS4_POLL_RETRY_MIN here)
> 
> Current code totally craps out performance of very large files on
> most pnfs-objects layouts, because of how the map changes when the
> file has grown beyond a raid group.
> 
> CC: Stable Tree <stable@vger.kernel.org>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> ---
>  fs/nfs/nfs4proc.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index d53d678..3264fca 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -7058,7 +7058,7 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
>  	struct nfs4_state *state = NULL;
>  	unsigned long timeo, giveup;
>  
> -	dprintk("--> %s\n", __func__);
> +	dprintk("--> %s tk_status => %d\n", __func__, task->tk_status);
>  
>  	if (!nfs41_sequence_done(task, &lgp->res.seq_res))
>  		goto out;
> @@ -7067,11 +7067,27 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
>  	case 0:
>  		goto out;
>  	case -NFS4ERR_LAYOUTTRYLATER:
> +	/* NFS4ERR_RECALLCONFLICT is always a minimal delay (conflict with
> +	 * self)
> +	 * TODO: NFS4ERR_LAYOUTTRYLATER is a conflict with another client
> +	 * (or clients). What we should do is randomize a short delay like on a
> +	 * network broadcast burst, and raise the random max every failure.
> +	 * For now leave it stateless and do this polling.
> +	 */
>  	case -NFS4ERR_RECALLCONFLICT:
>  		timeo = rpc_get_timeout(task->tk_client);
>  		giveup = lgp->args.timestamp + timeo;
> -		if (time_after(giveup, jiffies))
> -			task->tk_status = -NFS4ERR_DELAY;
> +		if (time_after(giveup, jiffies)) {
> +			/* Do a minimum delay, We are actually waiting for our
> +			 * own IO to finish (In most cases)
> +			 */
> +			dprintk("%s: NFS4ERR_RECALLCONFLICT waiting\n",
> +				__func__);
> +			rpc_delay(task, NFS4_POLL_RETRY_MIN);
> +			task->tk_status = 0;
> +			rpc_restart_call_prepare(task);
> +			goto out; /* Do not call nfs4_async_handle_error() */
> +		}
>  

For the default mount option of 'timeo=600', and the default #define
NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server
with 600 LAYOUTGET requests within the space of 1 minute, before giving
up. Is that reasonable?
Boaz Harrosh Jan. 14, 2014, 10:21 p.m. UTC | #2
On 01/14/2014 09:05 PM, Trond Myklebust wrote:
> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote:
>>  
> 
> For the default mount option of 'timeo=600', and the default #define
> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server
> with 600 LAYOUTGET requests within the space of 1 minute, before giving
> up. Is that reasonable?
> 

It will never get there it will always be 1 or two sends. Usually it is
just so the sequence of layout_get_done is out of the way and the
LAYOUT_RECALL sequence+1 can get through and the layout released. Then
the next time it will all be good and the LAYOUT_GET will succeed.

Worst case is when the client is very busy with queue full of IO
on the same busy layout that needs to be released by the recall. Personally
I found that this never exceeds 40 IOPs in flight. Note that this is not
the amount of total dirty memory but only the amount of already submitted
IO. I guess that on a very slow connection these can take time but in
regular line speeds I never observed more the 2 retries with this patch.

It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you
have need to be released" (I say released because the forgetful model does
not actually returns them). Can you see a critical time when layouts are
held for longer than a second ?

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Jan. 14, 2014, 10:43 p.m. UTC | #3
On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote:

> On 01/14/2014 09:05 PM, Trond Myklebust wrote:
>> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote:
>>> 
>> 
>> For the default mount option of 'timeo=600', and the default #define
>> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server
>> with 600 LAYOUTGET requests within the space of 1 minute, before giving
>> up. Is that reasonable?
>> 
> 
> It will never get there it will always be 1 or two sends. Usually it is
> just so the sequence of layout_get_done is out of the way and the
> LAYOUT_RECALL sequence+1 can get through and the layout released. Then
> the next time it will all be good and the LAYOUT_GET will succeed.
> 
> Worst case is when the client is very busy with queue full of IO
> on the same busy layout that needs to be released by the recall. Personally
> I found that this never exceeds 40 IOPs in flight. Note that this is not
> the amount of total dirty memory but only the amount of already submitted
> IO. I guess that on a very slow connection these can take time but in
> regular line speeds I never observed more the 2 retries with this patch.
> 
> It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you
> have need to be released" (I say released because the forgetful model does
> not actually returns them). Can you see a critical time when layouts are
> held for longer than a second ?

That will probably depend on the workload and possibly on the layout type.

My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt.

--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Jan. 14, 2014, 10:47 p.m. UTC | #4
On Jan 14, 2014, at 17:43, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> 
> On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> On 01/14/2014 09:05 PM, Trond Myklebust wrote:
>>> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote:
>>>> 
>>> 
>>> For the default mount option of 'timeo=600', and the default #define
>>> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server
>>> with 600 LAYOUTGET requests within the space of 1 minute, before giving
>>> up. Is that reasonable?
>>> 
>> 
>> It will never get there it will always be 1 or two sends. Usually it is
>> just so the sequence of layout_get_done is out of the way and the
>> LAYOUT_RECALL sequence+1 can get through and the layout released. Then
>> the next time it will all be good and the LAYOUT_GET will succeed.
>> 
>> Worst case is when the client is very busy with queue full of IO
>> on the same busy layout that needs to be released by the recall. Personally
>> I found that this never exceeds 40 IOPs in flight. Note that this is not
>> the amount of total dirty memory but only the amount of already submitted
>> IO. I guess that on a very slow connection these can take time but in
>> regular line speeds I never observed more the 2 retries with this patch.
>> 
>> It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you
>> have need to be released" (I say released because the forgetful model does
>> not actually returns them). Can you see a critical time when layouts are
>> held for longer than a second ?
> 
> That will probably depend on the workload and possibly on the layout type.
> 
> My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt.

Whoops. That should probably be

max(NFS4_POLL_RETRY_MIN, min(giveup - jiffies , jiffies - lgp->args.timestamp))

so that the time interval is not < NFS4_POLL_RETRY_MIN.
--
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Jan. 14, 2014, 11:41 p.m. UTC | #5
On 01/15/2014 12:47 AM, Trond Myklebust wrote:
> 
> On Jan 14, 2014, at 17:43, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
>>
>> On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote:
>>
>>> On 01/14/2014 09:05 PM, Trond Myklebust wrote:
>>>> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote:
>>>>>
>>>>
>>>> For the default mount option of 'timeo=600', and the default #define
>>>> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server
>>>> with 600 LAYOUTGET requests within the space of 1 minute, before giving
>>>> up. Is that reasonable?
>>>>
>>>
>>> It will never get there it will always be 1 or two sends. Usually it is
>>> just so the sequence of layout_get_done is out of the way and the
>>> LAYOUT_RECALL sequence+1 can get through and the layout released. Then
>>> the next time it will all be good and the LAYOUT_GET will succeed.
>>>
>>> Worst case is when the client is very busy with queue full of IO
>>> on the same busy layout that needs to be released by the recall. Personally
>>> I found that this never exceeds 40 IOPs in flight. Note that this is not
>>> the amount of total dirty memory but only the amount of already submitted
>>> IO. I guess that on a very slow connection these can take time but in
>>> regular line speeds I never observed more the 2 retries with this patch.
>>>
>>> It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you
>>> have need to be released" (I say released because the forgetful model does
>>> not actually returns them). Can you see a critical time when layouts are
>>> held for longer than a second ?
>>
>> That will probably depend on the workload and possibly on the layout type.
>>
>> My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt.
> 
> Whoops. That should probably be
> 
> max(NFS4_POLL_RETRY_MIN, min(giveup - jiffies , jiffies - lgp->args.timestamp))
> 
> so that the time interval is not < NFS4_POLL_RETRY_MIN.

OK I'll try that.

Thanks
Boaz


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index d53d678..3264fca 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7058,7 +7058,7 @@  static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
 	struct nfs4_state *state = NULL;
 	unsigned long timeo, giveup;
 
-	dprintk("--> %s\n", __func__);
+	dprintk("--> %s tk_status => %d\n", __func__, task->tk_status);
 
 	if (!nfs41_sequence_done(task, &lgp->res.seq_res))
 		goto out;
@@ -7067,11 +7067,27 @@  static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
 	case 0:
 		goto out;
 	case -NFS4ERR_LAYOUTTRYLATER:
+	/* NFS4ERR_RECALLCONFLICT is always a minimal delay (conflict with
+	 * self)
+	 * TODO: NFS4ERR_LAYOUTTRYLATER is a conflict with another client
+	 * (or clients). What we should do is randomize a short delay like on a
+	 * network broadcast burst, and raise the random max every failure.
+	 * For now leave it stateless and do this polling.
+	 */
 	case -NFS4ERR_RECALLCONFLICT:
 		timeo = rpc_get_timeout(task->tk_client);
 		giveup = lgp->args.timestamp + timeo;
-		if (time_after(giveup, jiffies))
-			task->tk_status = -NFS4ERR_DELAY;
+		if (time_after(giveup, jiffies)) {
+			/* Do a minimum delay, We are actually waiting for our
+			 * own IO to finish (In most cases)
+			 */
+			dprintk("%s: NFS4ERR_RECALLCONFLICT waiting\n",
+				__func__);
+			rpc_delay(task, NFS4_POLL_RETRY_MIN);
+			task->tk_status = 0;
+			rpc_restart_call_prepare(task);
+			goto out; /* Do not call nfs4_async_handle_error() */
+		}
 		break;
 	case -NFS4ERR_EXPIRED:
 	case -NFS4ERR_BAD_STATEID: