diff mbox

[ceph] locking fun with d_materialise_unique()

Message ID alpine.DEB.2.00.1301291301260.6657@cobra.newdream.net (mailing list archive)
State New, archived
Headers show

Commit Message

Sage Weil Jan. 29, 2013, 9:03 p.m. UTC
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:

> On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:
> 
> > Yep, that is indeed a problem.  I think we just need to do the r_aborted 
> > and/or r_locked_dir check in the else if condition...
> > 
> > > I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
> > > get to its splice_dentry() and d_delete() calls in similar situations -
> > > I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
> > > there as well.
> > 
> > ...and the condition guarding readdir_prepopulate().  :)
> 
> > I think you're reading it correctly.  The main thing to keep in mind here 
> > is that we *do* need to call fill_inode() for the inode metadata on these 
> > requests to keep the mds and client state in sync.  The dentry state is 
> > safe to ignore.
> 
> You mean the parts under
>         if (rinfo->head->is_dentry) {
> and
>         if (rinfo->head->is_target) {
> in there?  Because there's fill_inode() called from readdir_prepopulate()
> and it's a lot more problematic than those two...

Yep, patch below.
 
> > It would be great to have the dir i_mutex rules summarized somewhere, even 
> > if it is just a copy of the below.  It took a fair bit of trial and error 
> > to infer what was going on when writing this code.  :)
> 
> Directory ->i_mutex rules are in part documented - "what VFS guarantees
> to hold" side is in Documentation/filesystems/directory-locking.  It's
> the other side ("what locks are expected to be held by callers of dcache.c
> functions") that is badly missing...
> 
> > Ping me when you've pushed that branch and I'll take a look...
> 
> To gitolite@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
>    01a88fa..4056362  master -> master
> 
> with tentative ceph patch in the very end.  Should be on git.kernel.org
> shortly...

We should drop teh mds_client.c hunk from your patch, and then do 
something like the below.  I'll put it in the ceph tree so we can do some 
basic testing.  Unfortunately, the abort case is a bit annoying to 
trigger.. we'll need to write a new test for it.

Thanks-
sage



From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@inktank.com>
Date: Tue, 29 Jan 2013 13:01:15 -0800
Subject: [PATCH] ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache.  However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
---
 fs/ceph/inode.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Comments

Al Viro Jan. 30, 2013, 2:42 p.m. UTC | #1
On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:

> We should drop teh mds_client.c hunk from your patch, and then do 
> something like the below.  I'll put it in the ceph tree so we can do some 
> basic testing.  Unfortunately, the abort case is a bit annoying to 
> trigger.. we'll need to write a new test for it.

hunk dropped, the rest folded into your patch, resulting branch pushed...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Feb. 1, 2013, 1:14 a.m. UTC | #2
On Wed, Jan 30, 2013 at 02:42:14PM +0000, Al Viro wrote:
> On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
> 
> > We should drop teh mds_client.c hunk from your patch, and then do 
> > something like the below.  I'll put it in the ceph tree so we can do some 
> > basic testing.  Unfortunately, the abort case is a bit annoying to 
> > trigger.. we'll need to write a new test for it.
> 
> hunk dropped, the rest folded into your patch, resulting branch pushed...

BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
look fishy.  In ceph_rename() it seems to be pointless.  We do have
the directory inode of parent of old_dentry - it's old_dir, so it's just
a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*
other callers are racy.
	* link("a/foo", "b/bar") can happen in parallel with
rename("a/foo", "c/splat").  link(2) holds ->i_mutex on a/foo (so it can't race
with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
one exists).  It also holds ->s_vfs_rename_sem, so cross-directory renames
are serialized.  For normal filesystems that's enough and there ->link()
couldn't care less about a/; ceph_link() wants the inode of a/ for some
reason.  If it *really* needs the parent of a/foo for the operation, the
current code is SOL - the directory we grab can cease being that parent
just as it's getting returned to ceph_link()
	* ->d_revalidate() can happen in parallel with rename().  You
don't seem to be using the parent inode much in there, so that should
be reasonably easy to deal with.
	* ceph_open() can race with rename(); ->atomic_open() is
called with parent locked, but ->open() isn't.  AFAICS, open() with
O_TRUNC could step into that code...
	* ceph_setattr() *definitely* can race with rename(); we have
the object itself locked, but not its parent (and I'm really surprised
by the need to know that parent for such operation).
	* CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
about, but opened files can be moved around, TYVM, even when they are
in the middle of ioctl(2).
	* ->setxattr() and ->removexattr() - again, can happen in parallel
with rename(), and again I really wonder why do we need the parent directory
for such operations.

What's going on there?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Feb. 5, 2013, 9:59 p.m. UTC | #3
Hi Al-

I've returned and recovered from LCA and have no more excuse for keeping 
my brain turned off:

On Fri, 1 Feb 2013, Al Viro wrote:
> On Wed, Jan 30, 2013 at 02:42:14PM +0000, Al Viro wrote:
> > On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
> > 
> > > We should drop teh mds_client.c hunk from your patch, and then do 
> > > something like the below.  I'll put it in the ceph tree so we can do some 
> > > basic testing.  Unfortunately, the abort case is a bit annoying to 
> > > trigger.. we'll need to write a new test for it.
> > 
> > hunk dropped, the rest folded into your patch, resulting branch pushed...
> 
> BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
> look fishy.  In ceph_rename() it seems to be pointless.  We do have
> the directory inode of parent of old_dentry - it's old_dir, so it's just
> a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*

Yeah, that's an easy fix.

> other callers are racy.
> 	* link("a/foo", "b/bar") can happen in parallel with
> rename("a/foo", "c/splat").  link(2) holds ->i_mutex on a/foo (so it can't race
> with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
> one exists).  It also holds ->s_vfs_rename_sem, so cross-directory renames
> are serialized.  For normal filesystems that's enough and there ->link()
> couldn't care less about a/; ceph_link() wants the inode of a/ for some
> reason.  If it *really* needs the parent of a/foo for the operation, the
> current code is SOL - the directory we grab can cease being that parent
> just as it's getting returned to ceph_link()

And I think this use is totally unnecessary.

> 	* ->d_revalidate() can happen in parallel with rename().  You
> don't seem to be using the parent inode much in there, so that should
> be reasonably easy to deal with.

This is the tricky one.  We're looking at the parent because some of the 
directory inode state is effectively a lease over the validity of the 
directory's dentries.  In the general case, this is more efficient than a 
lot of per-dentry state in the client/server protocol.

Probably what we *should* be doing is linking to the parent from the 
child's ceph_dentry_info under the MDS session mutex, so that it is 
properly aligned with the state being passed from the server.  I 
considered this originally but was annoyed about essentially duplicating 
d_parent.  But... the locking being what it is, that is probably the only 
way to make the d_revalidate completely safe.  It'll be a bigger patch to 
change that behavior, though.

> 	* ceph_open() can race with rename(); ->atomic_open() is
> called with parent locked, but ->open() isn't.  AFAICS, open() with
> O_TRUNC could step into that code...
> 	* ceph_setattr() *definitely* can race with rename(); we have
> the object itself locked, but not its parent (and I'm really surprised
> by the need to know that parent for such operation).
> 	* CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
> about, but opened files can be moved around, TYVM, even when they are
> in the middle of ioctl(2).
> 	* ->setxattr() and ->removexattr() - again, can happen in parallel
> with rename(), and again I really wonder why do we need the parent directory
> for such operations.

These are only there so that an fsync(dirfd) will wait for the file 
creation or other metadata to commit.  I suspect I am (heavily) 
over-reading what POSIX says here... and should be concerned only with the 
namespace modifications (link, unlink, rename, O_CREAT).

Assuming that's the case,

	git://github.com/ceph/ceph-client.git #wip-dentries

has a set of patches that fix this up.  (Well, except for d_revalidate, 
which is a bigger change; opened #4023 in the ceph bug tracker.)

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 77b1b18..b51653e 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1195,6 +1195,39 @@  done:
 /*
  * Prepopulate our cache with readdir results, leases, etc.
  */
+static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
+					   struct ceph_mds_session *session)
+{
+	struct ceph_mds_reply_info_parsed *rinfo = &req->r_reply_info;
+	int i, err = 0;
+
+	for (i = 0; i < rinfo->dir_nr; i++) {
+		struct ceph_vino vino;
+		struct inode *in;
+		int rc;
+
+		vino.ino = le64_to_cpu(rinfo->dir_in[i].in->ino);
+		vino.snap = le64_to_cpu(rinfo->dir_in[i].in->snapid);
+
+		in = ceph_get_inode(req->r_dentry->d_sb, vino);
+		if (IS_ERR(in)) {
+			err = PTR_ERR(in);
+			dout("new_inode badness got %d\n", err);
+			continue;
+		}
+		rc = fill_inode(in, &rinfo->dir_in[i], NULL, session,
+				req->r_request_started, -1,
+				&req->r_caps_reservation);
+		if (rc < 0) {
+			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			err = rc;
+			continue;
+		}
+	}
+
+	return err;
+}
+
 int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			     struct ceph_mds_session *session)
 {
@@ -1209,6 +1242,9 @@  int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 	u64 frag = le32_to_cpu(rhead->args.readdir.frag);
 	struct ceph_dentry_info *di;
 
+	if (req->r_aborted)
+		return readdir_prepopulate_inodes_only(req, session);
+
 	if (le32_to_cpu(rinfo->head->op) == CEPH_MDS_OP_LSSNAP) {
 		snapdir = ceph_get_snapdir(parent->d_inode);
 		parent = d_find_alias(snapdir);