diff mbox

[3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

Message ID 20140918060317.23854.40603.stgit@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Sept. 18, 2014, 6:03 a.m. UTC
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service.  If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second.  If the commit doesn't complete
in this time, nfs_release_page() will fail.  This means it might now
fail in some cases where it wouldn't before.  These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
 - page_cache_pipe_buf_steal() in splice.c
 - shrink_page_list() in vmscan.c
 - invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely.  The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive.  I have
seen delays of as much as 208 seconds.

To keep the delay to one second, the bdi is marked as write-congested
if the commit didn't finished.  Once it does finish, the
write-congested flag will be cleared.

With this, the longest total delay in try_to_free_pages that I have
seen in under 3 seconds.  With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c  |   30 ++++++++++++++++++++----------
 fs/nfs/write.c |    7 +++++++
 2 files changed, 27 insertions(+), 10 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jeff Layton Sept. 18, 2014, 12:01 p.m. UTC | #1
On Thu, 18 Sep 2014 16:03:17 +1000
NeilBrown <neilb@suse.de> wrote:

> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
> 
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service.  If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
> 
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().
> 
> This patch modifies nfs_release_page() to wait a limited time for the
> commit to complete - one second.  If the commit doesn't complete
> in this time, nfs_release_page() will fail.  This means it might now
> fail in some cases where it wouldn't before.  These cases are only
> when 'gfp' includes '__GFP_WAIT'.
> 
> nfs_release_page() is only called by try_to_release_page(), and that
> can only be called on an NFS page with required 'gfp' flags from
>  - page_cache_pipe_buf_steal() in splice.c
>  - shrink_page_list() in vmscan.c
>  - invalidate_inode_pages2_range() in truncate.c
> 
> The first two handle failure quite safely.  The last is only called
> after ->launder_page() has been called, and that will have waited
> for the commit to finish already.
> 
> So aborting if the commit takes longer than 1 second is perfectly safe.
> 
> If nfs_release_page() is called on a sequence of pages which are all
> in the same file which is blocked on COMMIT, each page could
> contribute a 1 second delay which could be come excessive.  I have
> seen delays of as much as 208 seconds.
> 
> To keep the delay to one second, the bdi is marked as write-congested
> if the commit didn't finished.  Once it does finish, the
> write-congested flag will be cleared.
> 
> With this, the longest total delay in try_to_free_pages that I have
> seen in under 3 seconds.  With no waiting in nfs_release_page at all
> I have seen delays of nearly 1.5 seconds.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
>  fs/nfs/write.c |    7 +++++++
>  2 files changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 524dd80d1898..febba950d8a6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  
>  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>  
> -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> -	 * doing this memory reclaim for a fs-related allocation.
> +	/* Always try to initiate a 'commit' if relevant, but only
> +	 * wait for it if __GFP_WAIT is set and the calling process is
> +	 * allowed to block.  Even then, only wait 1 second and only
> +	 * if the 'bdi' is not congested.
> +	 * Waiting indefinitely can cause deadlocks when the NFS
> +	 * server is on this machine, and there is no particular need
> +	 * to wait extensively here.  A short wait has the benefit
> +	 * that someone else can worry about the freezer.
>  	 */
> -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> -	    !(current->flags & PF_FSTRANS)) {
> -		int how = FLUSH_SYNC;
> -
> -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> -		if (current_is_kswapd())
> -			how = 0;
> -		nfs_commit_inode(mapping->host, how);
> +	if (mapping) {
> +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> +		nfs_commit_inode(mapping->host, 0);
> +		if ((gfp & __GFP_WAIT) &&
> +		    !current_is_kswapd() &&
> +		    !(current->flags & PF_FSTRANS) &&
> +		    !bdi_write_congested(&nfss->backing_dev_info))
> +			wait_on_page_bit_killable_timeout(page, PG_private,
> +							  HZ);
> +		if (PagePrivate(page))
> +			set_bdi_congested(&nfss->backing_dev_info,
> +					  BLK_RW_ASYNC);

I've never had a great feel for the BDI congestion stuff, but won't
this have some unintended effects?

For instance, suppose the VM decides to try to free this page and
passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
COMMIT, but don't wait for it. The COMMIT is actually going to go
reasonably fast, but we now set the BDI congested because we didn't
wait for it to occur.

That in turn causes writeout for other inodes on this BDI to get
throttled even though there really is no congestion. It just looks that
way due to how releasepage got called.

Am I making mountains out of molehills here?

>  	}
>  	/* If PagePrivate() is set, then the page is not freeable */
>  	if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 175d5d073ccf..3066c7fcb565 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>  		if (likely(!PageSwapCache(head->wb_page))) {
>  			set_page_private(head->wb_page, 0);
>  			ClearPagePrivate(head->wb_page);
> +			smp_mb__after_atomic();
> +			wake_up_page(head->wb_page, PG_private);
>  			clear_bit(PG_MAPPED, &head->wb_flags);
>  		}
>  		nfsi->npages--;
> @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	struct nfs_page	*req;
>  	int status = data->task.tk_status;
>  	struct nfs_commit_info cinfo;
> +	struct nfs_server *nfss;
>  
>  	while (!list_empty(&data->pages)) {
>  		req = nfs_list_entry(data->pages.next);
> @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	next:
>  		nfs_unlock_and_release_request(req);
>  	}
> +	nfss = NFS_SERVER(data->inode);
> +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> +
>  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
>  		nfs_commit_clear_lock(NFS_I(data->inode));
> 
>
NeilBrown Sept. 22, 2014, 1:37 a.m. UTC | #2
On Thu, 18 Sep 2014 08:01:07 -0400 Jeff Layton <jeff.layton@primarydata.com>
wrote:

> On Thu, 18 Sep 2014 16:03:17 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> > 
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service.  If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> > 
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
> > 
> > This patch modifies nfs_release_page() to wait a limited time for the
> > commit to complete - one second.  If the commit doesn't complete
> > in this time, nfs_release_page() will fail.  This means it might now
> > fail in some cases where it wouldn't before.  These cases are only
> > when 'gfp' includes '__GFP_WAIT'.
> > 
> > nfs_release_page() is only called by try_to_release_page(), and that
> > can only be called on an NFS page with required 'gfp' flags from
> >  - page_cache_pipe_buf_steal() in splice.c
> >  - shrink_page_list() in vmscan.c
> >  - invalidate_inode_pages2_range() in truncate.c
> > 
> > The first two handle failure quite safely.  The last is only called
> > after ->launder_page() has been called, and that will have waited
> > for the commit to finish already.
> > 
> > So aborting if the commit takes longer than 1 second is perfectly safe.
> > 
> > If nfs_release_page() is called on a sequence of pages which are all
> > in the same file which is blocked on COMMIT, each page could
> > contribute a 1 second delay which could be come excessive.  I have
> > seen delays of as much as 208 seconds.
> > 
> > To keep the delay to one second, the bdi is marked as write-congested
> > if the commit didn't finished.  Once it does finish, the
> > write-congested flag will be cleared.
> > 
> > With this, the longest total delay in try_to_free_pages that I have
> > seen in under 3 seconds.  With no waiting in nfs_release_page at all
> > I have seen delays of nearly 1.5 seconds.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
> >  fs/nfs/write.c |    7 +++++++
> >  2 files changed, 27 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 524dd80d1898..febba950d8a6 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> >  
> >  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
> >  
> > -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> > -	 * doing this memory reclaim for a fs-related allocation.
> > +	/* Always try to initiate a 'commit' if relevant, but only
> > +	 * wait for it if __GFP_WAIT is set and the calling process is
> > +	 * allowed to block.  Even then, only wait 1 second and only
> > +	 * if the 'bdi' is not congested.
> > +	 * Waiting indefinitely can cause deadlocks when the NFS
> > +	 * server is on this machine, and there is no particular need
> > +	 * to wait extensively here.  A short wait has the benefit
> > +	 * that someone else can worry about the freezer.
> >  	 */
> > -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> > -	    !(current->flags & PF_FSTRANS)) {
> > -		int how = FLUSH_SYNC;
> > -
> > -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> > -		if (current_is_kswapd())
> > -			how = 0;
> > -		nfs_commit_inode(mapping->host, how);
> > +	if (mapping) {
> > +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> > +		nfs_commit_inode(mapping->host, 0);
> > +		if ((gfp & __GFP_WAIT) &&
> > +		    !current_is_kswapd() &&
> > +		    !(current->flags & PF_FSTRANS) &&
> > +		    !bdi_write_congested(&nfss->backing_dev_info))
> > +			wait_on_page_bit_killable_timeout(page, PG_private,
> > +							  HZ);
> > +		if (PagePrivate(page))
> > +			set_bdi_congested(&nfss->backing_dev_info,
> > +					  BLK_RW_ASYNC);
> 
> I've never had a great feel for the BDI congestion stuff, but won't
> this have some unintended effects?
> 
> For instance, suppose the VM decides to try to free this page and
> passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
> COMMIT, but don't wait for it. The COMMIT is actually going to go
> reasonably fast, but we now set the BDI congested because we didn't
> wait for it to occur.
> 
> That in turn causes writeout for other inodes on this BDI to get
> throttled even though there really is no congestion. It just looks that
> way due to how releasepage got called.
> 
> Am I making mountains out of molehills here?

Excellent molehill - thanks :-)

I was being lazy.  The 'if (PagePrivate())' should really be inside the
other if statement with the wait_on_page_bit...().  I've moved it there.
Once I get an Ack for the mm bits I'll report it all.

Thanks,
NeilBrown

(Molehills are worse for your suspension than mountains!)


> 
> >  	}
> >  	/* If PagePrivate() is set, then the page is not freeable */
> >  	if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 175d5d073ccf..3066c7fcb565 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> >  		if (likely(!PageSwapCache(head->wb_page))) {
> >  			set_page_private(head->wb_page, 0);
> >  			ClearPagePrivate(head->wb_page);
> > +			smp_mb__after_atomic();
> > +			wake_up_page(head->wb_page, PG_private);
> >  			clear_bit(PG_MAPPED, &head->wb_flags);
> >  		}
> >  		nfsi->npages--;
> > @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	struct nfs_page	*req;
> >  	int status = data->task.tk_status;
> >  	struct nfs_commit_info cinfo;
> > +	struct nfs_server *nfss;
> >  
> >  	while (!list_empty(&data->pages)) {
> >  		req = nfs_list_entry(data->pages.next);
> > @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	next:
> >  		nfs_unlock_and_release_request(req);
> >  	}
> > +	nfss = NFS_SERVER(data->inode);
> > +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > +
> >  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> >  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
> >  		nfs_commit_clear_lock(NFS_I(data->inode));
> > 
> > 
> 
>
diff mbox

Patch

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..febba950d8a6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -468,17 +468,27 @@  static int nfs_release_page(struct page *page, gfp_t gfp)
 
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
-	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
-	 * doing this memory reclaim for a fs-related allocation.
+	/* Always try to initiate a 'commit' if relevant, but only
+	 * wait for it if __GFP_WAIT is set and the calling process is
+	 * allowed to block.  Even then, only wait 1 second and only
+	 * if the 'bdi' is not congested.
+	 * Waiting indefinitely can cause deadlocks when the NFS
+	 * server is on this machine, and there is no particular need
+	 * to wait extensively here.  A short wait has the benefit
+	 * that someone else can worry about the freezer.
 	 */
-	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
-	    !(current->flags & PF_FSTRANS)) {
-		int how = FLUSH_SYNC;
-
-		/* Don't let kswapd deadlock waiting for OOM RPC calls */
-		if (current_is_kswapd())
-			how = 0;
-		nfs_commit_inode(mapping->host, how);
+	if (mapping) {
+		struct nfs_server *nfss = NFS_SERVER(mapping->host);
+		nfs_commit_inode(mapping->host, 0);
+		if ((gfp & __GFP_WAIT) &&
+		    !current_is_kswapd() &&
+		    !(current->flags & PF_FSTRANS) &&
+		    !bdi_write_congested(&nfss->backing_dev_info))
+			wait_on_page_bit_killable_timeout(page, PG_private,
+							  HZ);
+		if (PagePrivate(page))
+			set_bdi_congested(&nfss->backing_dev_info,
+					  BLK_RW_ASYNC);
 	}
 	/* If PagePrivate() is set, then the page is not freeable */
 	if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 175d5d073ccf..3066c7fcb565 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -731,6 +731,8 @@  static void nfs_inode_remove_request(struct nfs_page *req)
 		if (likely(!PageSwapCache(head->wb_page))) {
 			set_page_private(head->wb_page, 0);
 			ClearPagePrivate(head->wb_page);
+			smp_mb__after_atomic();
+			wake_up_page(head->wb_page, PG_private);
 			clear_bit(PG_MAPPED, &head->wb_flags);
 		}
 		nfsi->npages--;
@@ -1636,6 +1638,7 @@  static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	struct nfs_page	*req;
 	int status = data->task.tk_status;
 	struct nfs_commit_info cinfo;
+	struct nfs_server *nfss;
 
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
@@ -1669,6 +1672,10 @@  static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	next:
 		nfs_unlock_and_release_request(req);
 	}
+	nfss = NFS_SERVER(data->inode);
+	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
 	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
 	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
 		nfs_commit_clear_lock(NFS_I(data->inode));