diff mbox

dcache: return -ESTALE not -EBUSY on distributed fs race

Message ID 20141217195911.GF9617@fieldses.org (mailing list archive)
State New, archived
Headers show

Commit Message

J. Bruce Fields Dec. 17, 2014, 7:59 p.m. UTC
From: "J. Bruce Fields" <bfields@redhat.com>

On a distributed filesystem it's possible for lookup to discover that a
directory it just found is already cached elsewhere in the directory
heirarchy.  The dcache won't let us keep the directory in both places,
so we have to move the dentry to the new location from the place we
previously had it cached.

If the parent has changed, then this requires all the same locks as we'd
need to do a cross-directory rename.  But we're already in lookup
holding one parent's i_mutex, so it's too late to acquire those locks in
the right order.

The (unreliable) solution in __d_unalias is to trylock() the required
locks and return -EBUSY if it fails.

I see no particular reason for returning -EBUSY, and -ESTALE is already
the result of some other lookup races on NFS.  I think -ESTALE is the
more helpful error return.  It also allows us to take advantage of the
logic Jeff Layton added in c6a9428401c0 "vfs: fix renameat to retry on
ESTALE errors" and ancestors, which hopefully resolves some of these
errors before they're returned to userspace.

I can reproduce these cases using NFS with:

	ssh root@$client '
		mount -olookupcache=pos '$server':'$export' /mnt/
		mkdir /mnt/TO
		mkdir /mnt/DIR
		touch /mnt/DIR/test.txt
		while true; do
			strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
		done
	'
	ssh root@$server '
		while true; do
			mv $export/DIR $export/TO/DIR
			mv $export/TO/DIR $export/DIR
		done
	'

It also helps to add some other concurrent use of the directory on the
client (e.g., "ls /mnt/TO").  And you can replace the server-side mv's
by client-side mv's that are repeatedly killed.  (If the client is
interrupted while waiting for the RENAME response then it's left with a
dentry that has to go under one parent or the other, but it doesn't yet
know which.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/dcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

J. Bruce Fields Dec. 17, 2014, 8:01 p.m. UTC | #1
I keep looking back at this problem trying to think of a better idea.
Most recently I remember Jeff's work to retry ESTALE's and wondered if
we could take advantage of that.

It's still kind of a bandaid, but it's the only thing I've thought of
that at least helps a little and isn't a huge pain.  Any other ideas?

--b.

On Wed, Dec 17, 2014 at 02:59:11PM -0500, bfields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> On a distributed filesystem it's possible for lookup to discover that a
> directory it just found is already cached elsewhere in the directory
> heirarchy.  The dcache won't let us keep the directory in both places,
> so we have to move the dentry to the new location from the place we
> previously had it cached.
> 
> If the parent has changed, then this requires all the same locks as we'd
> need to do a cross-directory rename.  But we're already in lookup
> holding one parent's i_mutex, so it's too late to acquire those locks in
> the right order.
> 
> The (unreliable) solution in __d_unalias is to trylock() the required
> locks and return -EBUSY if it fails.
> 
> I see no particular reason for returning -EBUSY, and -ESTALE is already
> the result of some other lookup races on NFS.  I think -ESTALE is the
> more helpful error return.  It also allows us to take advantage of the
> logic Jeff Layton added in c6a9428401c0 "vfs: fix renameat to retry on
> ESTALE errors" and ancestors, which hopefully resolves some of these
> errors before they're returned to userspace.
> 
> I can reproduce these cases using NFS with:
> 
> 	ssh root@$client '
> 		mount -olookupcache=pos '$server':'$export' /mnt/
> 		mkdir /mnt/TO
> 		mkdir /mnt/DIR
> 		touch /mnt/DIR/test.txt
> 		while true; do
> 			strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
> 		done
> 	'
> 	ssh root@$server '
> 		while true; do
> 			mv $export/DIR $export/TO/DIR
> 			mv $export/TO/DIR $export/DIR
> 		done
> 	'
> 
> It also helps to add some other concurrent use of the directory on the
> client (e.g., "ls /mnt/TO").  And you can replace the server-side mv's
> by client-side mv's that are repeatedly killed.  (If the client is
> interrupted while waiting for the RENAME response then it's left with a
> dentry that has to go under one parent or the other, but it doesn't yet
> know which.)
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> ---
>  fs/dcache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5bc72b07fde2..b7de7f1ae38f 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2612,7 +2612,7 @@ static struct dentry *__d_unalias(struct inode *inode,
>  		struct dentry *dentry, struct dentry *alias)
>  {
>  	struct mutex *m1 = NULL, *m2 = NULL;
> -	struct dentry *ret = ERR_PTR(-EBUSY);
> +	struct dentry *ret = ERR_PTR(-ESTALE);
>  
>  	/* If alias and dentry share a parent, then no extra locks required */
>  	if (alias->d_parent == dentry->d_parent)
> -- 
> 1.9.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. R. Okajima Dec. 18, 2014, 3:50 p.m. UTC | #2
"J. Bruce Fields":
> It's still kind of a bandaid, but it's the only thing I've thought of
> that at least helps a little and isn't a huge pain.  Any other ideas?

How about introducing a tiny inline function which returns either EBUSY
or ESTALE?

static inline int busy_or_stale()
{
	if (!test_distributed_fs())
		return -EBUSY;
	return -ESTALE;
}

For nfs, the test function will be something like this.

int test_nfsd()
{
	int ret;
	struct task_struct *tsk = current;
	char comm[sizeof(tsk->comm)];

	ret = 0;
	if (tsk->flags & PF_KTHREAD) {
		get_task_comm(comm, tsk);
		ret = !strcmp(comm, "nfsd");
	}

	return ret;
}

I know this "by-name" test is not good.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Dec. 18, 2014, 3:58 p.m. UTC | #3
On Fri, Dec 19, 2014 at 12:50:38AM +0900, J. R. Okajima wrote:
> 
> "J. Bruce Fields":
> > It's still kind of a bandaid, but it's the only thing I've thought of
> > that at least helps a little and isn't a huge pain.  Any other ideas?
> 
> How about introducing a tiny inline function which returns either EBUSY
> or ESTALE?
> 
> static inline int busy_or_stale()
> {
> 	if (!test_distributed_fs())
> 		return -EBUSY;
> 	return -ESTALE;
> }

Why do you think -EBUSY's the right error in the local filesystem case?

This condition really shouldn't happen in the local filesystem case
anyway.  If anything maybe it should be -EIO, since it looks like a sign
your filesystem's corrupted.

> For nfs, the test function will be something like this.
> 
> int test_nfsd()
> {
> 	int ret;
> 	struct task_struct *tsk = current;
> 	char comm[sizeof(tsk->comm)];
> 
> 	ret = 0;
> 	if (tsk->flags & PF_KTHREAD) {
> 		get_task_comm(comm, tsk);
> 		ret = !strcmp(comm, "nfsd");
> 	}
> 
> 	return ret;
> }
> 
> I know this "by-name" test is not good.

Also you'd want to be checking for nfs, not nfsd.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. R. Okajima Dec. 18, 2014, 4:27 p.m. UTC | #4
"J. Bruce Fields":
> Why do you think -EBUSY's the right error in the local filesystem case?

This busy_or_stale() is another bandaid, based upon your patch, EBUSY
--> ESTALE.
Because the msg string of ESTALE is "Stale NFS file handle" on many
systems, I don't think it a good idea to return it for local fs. If you
think EIO is better than EBUSY you can change it to eio_or_stale(). If
you think it is surely not happen on every local fs, then this inline
function is not necessary.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 18, 2014, 5:32 p.m. UTC | #5
On Fri, 19 Dec 2014 01:27:14 +0900
"J. R. Okajima" <hooanon05g@gmail.com> wrote:

> 
> "J. Bruce Fields":
> > Why do you think -EBUSY's the right error in the local filesystem case?
> 
> This busy_or_stale() is another bandaid, based upon your patch, EBUSY
> --> ESTALE.
> Because the msg string of ESTALE is "Stale NFS file handle" on many
> systems, I don't think it a good idea to return it for local fs. If you
> think EIO is better than EBUSY you can change it to eio_or_stale(). If
> you think it is surely not happen on every local fs, then this inline
> function is not necessary.
> 

I think the only way this could happen on a local fs would be for it to
allow hardlinked directories, which is (of course) forbidden. After
all, the comment above __d_unalias specifically mentions that it's
there to handle remotely renamed directories.

I think the patch is almost certainly fine for NFS. The only question
I'd have is whether other distributed filesystems (p9? ceph? gfs2?)
might have an issue here. It seems unlikely though, and most of them
would also benefit from redoing the lookup with LOOKUP_REVAL set in
this situation.

Still, it's probably worth letting this soak in -next for a bit to be
sure we're not missing anything.

Bruce, you can add my:

Acked-by: Jeff Layton <jlayton@primarydata.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Dec. 18, 2014, 11:15 p.m. UTC | #6
On Fri, Dec 19, 2014 at 01:27:14AM +0900, J. R. Okajima wrote:
> 
> "J. Bruce Fields":
> > Why do you think -EBUSY's the right error in the local filesystem case?
> 
> This busy_or_stale() is another bandaid, based upon your patch, EBUSY
> --> ESTALE.
> Because the msg string of ESTALE is "Stale NFS file handle" on many
> systems, I don't think it a good idea to return it for local fs. If you

We use file handles on local filesystems. They have exactly the same
semantics as NFS file handles and so local filesystems have the same
ESTALE exposure as "distributed" filesystems to this problem. i.e:

$ man open_by_handle_at
.....
	ESTALE	The specified handle is not valid.  This error will
		occur if, for example, the file has been deleted.

We've been doing this on XFS since before it was ported to linux 15
years ago... ;)

Cheers,

Dave.
J. R. Okajima Dec. 19, 2014, 2:46 a.m. UTC | #7
Dave Chinner:
> We use file handles on local filesystems. They have exactly the same
> semantics as NFS file handles and so local filesystems have the same
> ESTALE exposure as "distributed" filesystems to this problem. i.e:

It is reasonable.
But do you think ESTALE is good in case of __d_unalias()?


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 19, 2014, 11:09 a.m. UTC | #8
On Fri, 19 Dec 2014 11:46:26 +0900
"J. R. Okajima" <hooanon05g@gmail.com> wrote:

> 
> Dave Chinner:
> > We use file handles on local filesystems. They have exactly the same
> > semantics as NFS file handles and so local filesystems have the same
> > ESTALE exposure as "distributed" filesystems to this problem. i.e:
> 
> It is reasonable.
> But do you think ESTALE is good in case of __d_unalias()?
> 

I don't think you'll ever see __d_unalias called on a local fs, as the
__d_find_any_alias call should always return NULL on a local fs.

With a local fs, we always know that the dcache is correct (which is
why local fs' generally don't supply a d_revalidate method). The only
way to get to __d_unalias is for all of the following to be true:

- you're splicing a new dentry into the tree
- that new dentry is a directory
- the inode for it already has a dentry associated with it that's in a
  different place

...that can only happen in a remote filesystem. The only way for that
to occur on a local fs is for that fs to allow a directory to be
hardlinked, which should of course never happen.

As to whether ESTALE is as good as EBUSY there...I doubt it'll matter
much. Remote filesystems generally have to be able to cope with ESTALE
errors anyway, and I doubt a lot of userland code handles EBUSY
especially for this situation.
J. R. Okajima Dec. 22, 2014, 9:53 a.m. UTC | #9
Jeff Layton:
> ...that can only happen in a remote filesystem. The only way for that
> to occur on a local fs is for that fs to allow a directory to be
> hardlinked, which should of course never happen.
>
> As to whether ESTALE is as good as EBUSY there...I doubt it'll matter
> much. Remote filesystems generally have to be able to cope with ESTALE
> errors anyway, and I doubt a lot of userland code handles EBUSY
> especially for this situation.

Ok, I understand.
Thank for explanation.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 5bc72b07fde2..b7de7f1ae38f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2612,7 +2612,7 @@  static struct dentry *__d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
-	struct dentry *ret = ERR_PTR(-EBUSY);
+	struct dentry *ret = ERR_PTR(-ESTALE);
 
 	/* If alias and dentry share a parent, then no extra locks required */
 	if (alias->d_parent == dentry->d_parent)