diff mbox

[RFC] nfsd: serialize layout stateid morphing operations

Message ID 1442491104-30080-1-git-send-email-jeff.layton@primarydata.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Sept. 17, 2015, 11:58 a.m. UTC
In order to allow the client to make a sane determination of what
happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we
must ensure that the seqids return accurately represent the order of
operations. The simplest way to do that is to ensure that operations on
a single stateid are serialized.

This patch adds a mutex to the layout stateid, and locks it when
checking the layout stateid's seqid. The mutex is held over the entire
operation and released after the seqid is bumped.

Note that in the case of CB_LAYOUTRECALL we must move the increment of
the seqid and setting into a new cb "prepare" operation. The lease
infrastructure will call the lm_break callback with a spinlock held, so
and we can't take the mutex in that codepath.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
---
 fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++----
 fs/nfsd/nfs4proc.c    |  4 ++++
 fs/nfsd/state.h       |  1 +
 3 files changed, 26 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Oct. 11, 2015, 1:15 p.m. UTC | #1
On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote:
> In order to allow the client to make a sane determination of what
> happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we
> must ensure that the seqids return accurately represent the order of
> operations. The simplest way to do that is to ensure that operations on
> a single stateid are serialized.
> 
> This patch adds a mutex to the layout stateid, and locks it when
> checking the layout stateid's seqid. The mutex is held over the entire
> operation and released after the seqid is bumped.
> 
> Note that in the case of CB_LAYOUTRECALL we must move the increment of
> the seqid and setting into a new cb "prepare" operation. The lease
> infrastructure will call the lm_break callback with a spinlock held, so
> and we can't take the mutex in that codepath.

I can't say I like the long running mutex all that much.  What kinds
of reproducers do you have where the current behavior causes problems?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Oct. 11, 2015, 8:51 p.m. UTC | #2
On Sun, 11 Oct 2015 15:15:27 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote:
> > In order to allow the client to make a sane determination of what
> > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we
> > must ensure that the seqids return accurately represent the order of
> > operations. The simplest way to do that is to ensure that operations on
> > a single stateid are serialized.
> > 
> > This patch adds a mutex to the layout stateid, and locks it when
> > checking the layout stateid's seqid. The mutex is held over the entire
> > operation and released after the seqid is bumped.
> > 
> > Note that in the case of CB_LAYOUTRECALL we must move the increment of
> > the seqid and setting into a new cb "prepare" operation. The lease
> > infrastructure will call the lm_break callback with a spinlock held, so
> > and we can't take the mutex in that codepath.
> 
> I can't say I like the long running mutex all that much.  What kinds
> of reproducers do you have where the current behavior causes problems?

I'm not thrilled with it either, though it is per-stateid. It should
only ever be contended when there are concurrent operations that specify
the same stateid.

We did have a report of this problem with open stateids that came about
after the client started parallelizing opens more aggressively. It was
pretty clear that you could hit similar races with lock stateids as
well, given a client that didn't serialize things properly.

I don't have any reports of this problem with layout stateids though.
It may be that the client is good enough at serializing these
operations (or slow enough) that it's never an issue. But, if that's
the case then the mutex is harmless anyway since it'd never be
contended.

In hindsight I probably should have sent this as a RFC patch, I'm fine
with dropping this for now if you think it's potentially harmful.
J. Bruce Fields Oct. 23, 2015, 7:35 p.m. UTC | #3
On Sun, Oct 11, 2015 at 04:51:21PM -0400, Jeff Layton wrote:
> On Sun, 11 Oct 2015 15:15:27 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote:
> > > In order to allow the client to make a sane determination of what
> > > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we
> > > must ensure that the seqids return accurately represent the order of
> > > operations. The simplest way to do that is to ensure that operations on
> > > a single stateid are serialized.
> > > 
> > > This patch adds a mutex to the layout stateid, and locks it when
> > > checking the layout stateid's seqid. The mutex is held over the entire
> > > operation and released after the seqid is bumped.
> > > 
> > > Note that in the case of CB_LAYOUTRECALL we must move the increment of
> > > the seqid and setting into a new cb "prepare" operation. The lease
> > > infrastructure will call the lm_break callback with a spinlock held, so
> > > and we can't take the mutex in that codepath.
> > 
> > I can't say I like the long running mutex all that much.  What kinds
> > of reproducers do you have where the current behavior causes problems?
> 
> I'm not thrilled with it either, though it is per-stateid. It should
> only ever be contended when there are concurrent operations that specify
> the same stateid.
> 
> We did have a report of this problem with open stateids that came about
> after the client started parallelizing opens more aggressively. It was
> pretty clear that you could hit similar races with lock stateids as
> well, given a client that didn't serialize things properly.
> 
> I don't have any reports of this problem with layout stateids though.
> It may be that the client is good enough at serializing these
> operations (or slow enough) that it's never an issue. But, if that's
> the case then the mutex is harmless anyway since it'd never be
> contended.
> 
> In hindsight I probably should have sent this as a RFC patch, I'm fine
> with dropping this for now if you think it's potentially harmful.

Still looks to me like the code's incorrect without your patch, so I
intend to apply it.

I'll fully admit that I haven't looked at how layout stateid's are
supposed to work closely enough so it's possible Christoph can persaude
me otherwise.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kinglong Mee Nov. 29, 2015, 4:07 a.m. UTC | #4
I meet two problems with this patch,

1. a dump_stack messages printed in process_one_work() at kernel/workqueue.c.

BUG: workqueue leaked lock or atomic: kworker/u2:4/0x00000000/98#012     last function: nfsd4_run_cb_work [nfsd]
1 lock held by kworker/u2:4/98:
#0:  (&ls->ls_mutex){......}, at: [<ffffffffa0250d34>] nfsd4_cb_layout_prepare+0x24/0x40 [nfsd]
CPU: 0 PID: 98 Comm: kworker/u2:4 Not tainted 4.4.0-rc2+ #333
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
ffff8800362b9e40 000000007fe9394f ffff880036353d58 ffffffff8136dc64
ffff880036353dd8 ffffffff810a3f12 ffffffff810a3cbd 000000000000000a
ffffffffa0261d78 ffffffff82902e20 0000000000000000 ffffffffa0259241
Call Trace:
[<ffffffff8136dc64>] dump_stack+0x19/0x25
[<ffffffff810a3f12>] process_one_work+0x3c2/0x4c0
[<ffffffff810a3cbd>] ? process_one_work+0x16d/0x4c0
[<ffffffff810a405a>] worker_thread+0x4a/0x440
[<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0
[<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0
[<ffffffff810a91e5>] kthread+0xf5/0x110
[<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240
[<ffffffff81738d0f>] ret_from_fork+0x3f/0x70
[<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240

2. a mutex race between layoutrecall and layoutcommit, 

callback thread,
nfsd4_cb_layout_prepare 
  --->mutex_lock(&ls->ls_mutex);

layoutcommit thread,
nfsd4_layoutcommit
  ---> nfsd4_preprocess_layout_stateid
     --> mutex_lock(&ls->ls_mutex);   <----------------  hang 

[  600.645142] INFO: task nfsd:11623 blocked for more than 120 seconds.
[  600.646337]       Not tainted 4.4.0-rc2+ #332
[  600.647404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.648546] nfsd            D ffff880064277b80     0 11623      2 0x00000000
[  600.649803]  ffff880064277b80 ffff880064278000 00000000ffffffff ffff88005dd241a8
[  600.651021]  ffffffffa025e77c 0000000000000246 ffff880064277b98 ffffffff81733103
[  600.652255]  ffff880063d7e100 ffff880064277ba8 ffffffff8173330e ffff880064277c28
[  600.653512] Call Trace:
[  600.654765]  [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
[  600.656084]  [<ffffffff81733103>] schedule+0x33/0x80
[  600.657405]  [<ffffffff8173330e>] schedule_preempt_disabled+0xe/0x10
[  600.658741]  [<ffffffff817357f5>] mutex_lock_nested+0x145/0x330
[  600.660094]  [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
[  600.661696]  [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
[  600.663129]  [<ffffffffa025e405>] ? nfsd4_preprocess_layout_stateid+0x5/0x400 [nfsd]
[  600.664558]  [<ffffffff81173b8f>] ? printk+0x56/0x72
[  600.665990]  [<ffffffffa023e3ec>] nfsd4_layoutcommit+0x13c/0x200 [nfsd]
[  600.667365]  [<ffffffffa023fb98>] nfsd4_proc_compound+0x388/0x660 [nfsd]
[  600.668835]  [<ffffffffa022c148>] nfsd_dispatch+0xb8/0x200 [nfsd]
[  600.670323]  [<ffffffffa0093d89>] svc_process_common+0x409/0x650 [sunrpc]
[  600.671836]  [<ffffffffa0094e04>] svc_process+0xf4/0x190 [sunrpc]
[  600.673328]  [<ffffffffa022bb05>] nfsd+0x135/0x1a0 [nfsd]
[  600.674825]  [<ffffffffa022b9d5>] ? nfsd+0x5/0x1a0 [nfsd]
[  600.676388]  [<ffffffffa022b9d0>] ? nfsd_destroy+0xb0/0xb0 [nfsd]
[  600.677884]  [<ffffffff810a9175>] kthread+0xf5/0x110
[  600.679373]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
[  600.680874]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
[  600.682398]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
[  600.683893] 1 lock held by nfsd/11623:
[  600.685449]  #0:  (&ls->ls_mutex){......}, at: [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
[  600.688778] Sending NMI to all CPUs:
[  600.690854] NMI backtrace for cpu 0
[  600.691909] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332
[  600.692523] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  600.693821] task: ffff88007b900000 ti: ffff88007b8fc000 task.ti: ffff88007b8fc000
[  600.694496] RIP: 0010:[<ffffffff81053aca>]  [<ffffffff81053aca>] native_write_msr_safe+0xa/0x10
[  600.695185] RSP: 0018:ffff88007b8ffd70  EFLAGS: 00000046
[  600.695861] RAX: 0000000000000400 RBX: 0000000000000286 RCX: 0000000000000830
[  600.696539] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000830
[  600.697204] RBP: ffff88007b8ffd70 R08: 0000000000000400 R09: 0000000000000000
[  600.697862] R10: 00000000000000e4 R11: 0000000000000001 R12: ffff880063d7e100
[  600.698513] R13: 00000000003fff3c R14: ffff880063d7e308 R15: 0000000000000004
[  600.699156] FS:  0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
[  600.699823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  600.700459] CR2: 00007f366afd4000 CR3: 000000005dd56000 CR4: 00000000001406f0
[  600.701106] Stack:
[  600.701745]  ffff88007b8ffd88 ffffffff8104a860 ffffffff81047340 ffff88007b8ffd98
[  600.702404]  ffffffff8104a885 ffff88007b8ffda8 ffffffff8104735b ffff88007b8ffdd8
[  600.703058]  ffffffff813723ad 0000000000000078 ffff880063d7e100 00000000003fff3c
[  600.703712] Call Trace:
[  600.704355]  [<ffffffff8104a860>] __x2apic_send_IPI_mask.isra.2+0x60/0x70
[  600.705017]  [<ffffffff81047340>] ? setup_vector_irq+0x130/0x130
[  600.705676]  [<ffffffff8104a885>] x2apic_send_IPI_mask+0x15/0x20
[  600.706335]  [<ffffffff8104735b>] nmi_raise_cpu_backtrace+0x1b/0x20
[  600.706989]  [<ffffffff813723ad>] nmi_trigger_all_cpu_backtrace+0x14d/0x1c0
[  600.707693]  [<ffffffff810473b9>] arch_trigger_all_cpu_backtrace+0x19/0x20
[  600.708362]  [<ffffffff8112c4cf>] watchdog+0x32f/0x370
[  600.709031]  [<ffffffff8112c221>] ? watchdog+0x81/0x370
[  600.709725]  [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20
[  600.710398]  [<ffffffff810a9175>] kthread+0xf5/0x110
[  600.711067]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
[  600.711739]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
[  600.712405]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
[  600.713073] Code: 00 55 89 f9 48 89 e5 0f 32 45 31 c0 48 c1 e2 20 44 89 06 48 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 55 89 f0 89 f9 48 89 e5 0f 30 <31> c0 5d c3 66 90 55 89 f9 48 89 e5 0f 33 48 c1 e2 20 48 09 d0
[  600.715196] Kernel panic - not syncing: hung_task: blocked tasks
[  600.715889] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332
[  600.716540] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  600.717910]  ffffffff81a4d19c 00000000480750be ffff88007b8ffd60 ffffffff8136dbf4
[  600.718610]  ffff88007b8ffde8 ffffffff81173559 ffff880000000008 ffff88007b8ffdf8
[  600.719302]  ffff88007b8ffd90 00000000480750be 0000000000000001 0000000000000001
[  600.719984] Call Trace:
[  600.720646]  [<ffffffff8136dbf4>] dump_stack+0x19/0x25
[  600.721330]  [<ffffffff81173559>] panic+0xd3/0x212
[  600.722009]  [<ffffffff8112c4db>] watchdog+0x33b/0x370
[  600.722686]  [<ffffffff8112c221>] ? watchdog+0x81/0x370
[  600.723213]  [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20
[  600.723674]  [<ffffffff810a9175>] kthread+0xf5/0x110
[  600.724107]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
[  600.724509]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
[  600.724903]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240

thanks,
Kinglong Mee

On 9/17/2015 19:58, Jeff Layton wrote:
> In order to allow the client to make a sane determination of what
> happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we
> must ensure that the seqids return accurately represent the order of
> operations. The simplest way to do that is to ensure that operations on
> a single stateid are serialized.
> 
> This patch adds a mutex to the layout stateid, and locks it when
> checking the layout stateid's seqid. The mutex is held over the entire
> operation and released after the seqid is bumped.
> 
> Note that in the case of CB_LAYOUTRECALL we must move the increment of
> the seqid and setting into a new cb "prepare" operation. The lease
> infrastructure will call the lm_break callback with a spinlock held, so
> and we can't take the mutex in that codepath.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
> ---
>  fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++----
>  fs/nfsd/nfs4proc.c    |  4 ++++
>  fs/nfsd/state.h       |  1 +
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index ebf90e487c75..4a68ab901b4b 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -201,6 +201,7 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
>  	INIT_LIST_HEAD(&ls->ls_perfile);
>  	spin_lock_init(&ls->ls_lock);
>  	INIT_LIST_HEAD(&ls->ls_layouts);
> +	mutex_init(&ls->ls_mutex);
>  	ls->ls_layout_type = layout_type;
>  	nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
>  			NFSPROC4_CLNT_CB_LAYOUT);
> @@ -262,19 +263,23 @@ nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
>  		status = nfserr_jukebox;
>  		if (!ls)
>  			goto out;
> +		mutex_lock(&ls->ls_mutex);
>  	} else {
>  		ls = container_of(stid, struct nfs4_layout_stateid, ls_stid);
>  
>  		status = nfserr_bad_stateid;
> +		mutex_lock(&ls->ls_mutex);
>  		if (stateid->si_generation > stid->sc_stateid.si_generation)
> -			goto out_put_stid;
> +			goto out_unlock_stid;
>  		if (layout_type != ls->ls_layout_type)
> -			goto out_put_stid;
> +			goto out_unlock_stid;
>  	}
>  
>  	*lsp = ls;
>  	return 0;
>  
> +out_unlock_stid:
> +	mutex_unlock(&ls->ls_mutex);
>  out_put_stid:
>  	nfs4_put_stid(stid);
>  out:
> @@ -296,8 +301,6 @@ nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls)
>  	trace_layout_recall(&ls->ls_stid.sc_stateid);
>  
>  	atomic_inc(&ls->ls_stid.sc_count);
> -	update_stateid(&ls->ls_stid.sc_stateid);
> -	memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
>  	nfsd4_run_cb(&ls->ls_recall);
>  
>  out_unlock:
> @@ -494,6 +497,7 @@ nfsd4_return_file_layouts(struct svc_rqst *rqstp,
>  	}
>  	spin_unlock(&ls->ls_lock);
>  
> +	mutex_unlock(&ls->ls_mutex);
>  	nfs4_put_stid(&ls->ls_stid);
>  	nfsd4_free_layouts(&reaplist);
>  	return nfs_ok;
> @@ -608,6 +612,17 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
>  	}
>  }
>  
> +static void
> +nfsd4_cb_layout_prepare(struct nfsd4_callback *cb)
> +{
> +	struct nfs4_layout_stateid *ls =
> +		container_of(cb, struct nfs4_layout_stateid, ls_recall);
> +
> +	mutex_lock(&ls->ls_mutex);
> +	update_stateid(&ls->ls_stid.sc_stateid);
> +	memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
> +}
> +
>  static int
>  nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
>  {
> @@ -649,12 +664,14 @@ nfsd4_cb_layout_release(struct nfsd4_callback *cb)
>  
>  	trace_layout_recall_release(&ls->ls_stid.sc_stateid);
>  
> +	mutex_unlock(&ls->ls_mutex);
>  	nfsd4_return_all_layouts(ls, &reaplist);
>  	nfsd4_free_layouts(&reaplist);
>  	nfs4_put_stid(&ls->ls_stid);
>  }
>  
>  static struct nfsd4_callback_ops nfsd4_cb_layout_ops = {
> +	.prepare	= nfsd4_cb_layout_prepare,
>  	.done		= nfsd4_cb_layout_done,
>  	.release	= nfsd4_cb_layout_release,
>  };
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 4ce6b97b31ad..a9f096c7e99f 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1309,6 +1309,7 @@ nfsd4_layoutget(struct svc_rqst *rqstp,
>  	nfserr = nfsd4_insert_layout(lgp, ls);
>  
>  out_put_stid:
> +	mutex_unlock(&ls->ls_mutex);
>  	nfs4_put_stid(&ls->ls_stid);
>  out:
>  	return nfserr;
> @@ -1362,6 +1363,9 @@ nfsd4_layoutcommit(struct svc_rqst *rqstp,
>  		goto out;
>  	}
>  
> +	/* LAYOUTCOMMIT does not require any serialization */
> +	mutex_unlock(&ls->ls_mutex);
> +
>  	if (new_size > i_size_read(inode)) {
>  		lcp->lc_size_chg = 1;
>  		lcp->lc_newsize = new_size;
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 31bde12feefe..1fa0f3848d4e 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -562,6 +562,7 @@ struct nfs4_layout_stateid {
>  	struct nfsd4_callback		ls_recall;
>  	stateid_t			ls_recall_sid;
>  	bool				ls_recalled;
> +	struct mutex			ls_mutex;
>  };
>  
>  static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index ebf90e487c75..4a68ab901b4b 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -201,6 +201,7 @@  nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
 	INIT_LIST_HEAD(&ls->ls_perfile);
 	spin_lock_init(&ls->ls_lock);
 	INIT_LIST_HEAD(&ls->ls_layouts);
+	mutex_init(&ls->ls_mutex);
 	ls->ls_layout_type = layout_type;
 	nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
 			NFSPROC4_CLNT_CB_LAYOUT);
@@ -262,19 +263,23 @@  nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
 		status = nfserr_jukebox;
 		if (!ls)
 			goto out;
+		mutex_lock(&ls->ls_mutex);
 	} else {
 		ls = container_of(stid, struct nfs4_layout_stateid, ls_stid);
 
 		status = nfserr_bad_stateid;
+		mutex_lock(&ls->ls_mutex);
 		if (stateid->si_generation > stid->sc_stateid.si_generation)
-			goto out_put_stid;
+			goto out_unlock_stid;
 		if (layout_type != ls->ls_layout_type)
-			goto out_put_stid;
+			goto out_unlock_stid;
 	}
 
 	*lsp = ls;
 	return 0;
 
+out_unlock_stid:
+	mutex_unlock(&ls->ls_mutex);
 out_put_stid:
 	nfs4_put_stid(stid);
 out:
@@ -296,8 +301,6 @@  nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls)
 	trace_layout_recall(&ls->ls_stid.sc_stateid);
 
 	atomic_inc(&ls->ls_stid.sc_count);
-	update_stateid(&ls->ls_stid.sc_stateid);
-	memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
 	nfsd4_run_cb(&ls->ls_recall);
 
 out_unlock:
@@ -494,6 +497,7 @@  nfsd4_return_file_layouts(struct svc_rqst *rqstp,
 	}
 	spin_unlock(&ls->ls_lock);
 
+	mutex_unlock(&ls->ls_mutex);
 	nfs4_put_stid(&ls->ls_stid);
 	nfsd4_free_layouts(&reaplist);
 	return nfs_ok;
@@ -608,6 +612,17 @@  nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
 	}
 }
 
+static void
+nfsd4_cb_layout_prepare(struct nfsd4_callback *cb)
+{
+	struct nfs4_layout_stateid *ls =
+		container_of(cb, struct nfs4_layout_stateid, ls_recall);
+
+	mutex_lock(&ls->ls_mutex);
+	update_stateid(&ls->ls_stid.sc_stateid);
+	memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
+}
+
 static int
 nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
 {
@@ -649,12 +664,14 @@  nfsd4_cb_layout_release(struct nfsd4_callback *cb)
 
 	trace_layout_recall_release(&ls->ls_stid.sc_stateid);
 
+	mutex_unlock(&ls->ls_mutex);
 	nfsd4_return_all_layouts(ls, &reaplist);
 	nfsd4_free_layouts(&reaplist);
 	nfs4_put_stid(&ls->ls_stid);
 }
 
 static struct nfsd4_callback_ops nfsd4_cb_layout_ops = {
+	.prepare	= nfsd4_cb_layout_prepare,
 	.done		= nfsd4_cb_layout_done,
 	.release	= nfsd4_cb_layout_release,
 };
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 4ce6b97b31ad..a9f096c7e99f 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1309,6 +1309,7 @@  nfsd4_layoutget(struct svc_rqst *rqstp,
 	nfserr = nfsd4_insert_layout(lgp, ls);
 
 out_put_stid:
+	mutex_unlock(&ls->ls_mutex);
 	nfs4_put_stid(&ls->ls_stid);
 out:
 	return nfserr;
@@ -1362,6 +1363,9 @@  nfsd4_layoutcommit(struct svc_rqst *rqstp,
 		goto out;
 	}
 
+	/* LAYOUTCOMMIT does not require any serialization */
+	mutex_unlock(&ls->ls_mutex);
+
 	if (new_size > i_size_read(inode)) {
 		lcp->lc_size_chg = 1;
 		lcp->lc_newsize = new_size;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 31bde12feefe..1fa0f3848d4e 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -562,6 +562,7 @@  struct nfs4_layout_stateid {
 	struct nfsd4_callback		ls_recall;
 	stateid_t			ls_recall_sid;
 	bool				ls_recalled;
+	struct mutex			ls_mutex;
 };
 
 static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)