Message ID | alpine.DEB.2.00.1301291301260.6657@cobra.newdream.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: > We should drop teh mds_client.c hunk from your patch, and then do > something like the below. I'll put it in the ceph tree so we can do some > basic testing. Unfortunately, the abort case is a bit annoying to > trigger.. we'll need to write a new test for it. hunk dropped, the rest folded into your patch, resulting branch pushed... -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 30, 2013 at 02:42:14PM +0000, Al Viro wrote: > On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: > > > We should drop teh mds_client.c hunk from your patch, and then do > > something like the below. I'll put it in the ceph tree so we can do some > > basic testing. Unfortunately, the abort case is a bit annoying to > > trigger.. we'll need to write a new test for it. > > hunk dropped, the rest folded into your patch, resulting branch pushed... BTW, moderately related issue - ceph_get_dentry_parent_inode() uses look fishy. In ceph_rename() it seems to be pointless. We do have the directory inode of parent of old_dentry - it's old_dir, so it's just a fancy way to spell igrab(old_directory). And as far as I can tell, *all* other callers are racy. * link("a/foo", "b/bar") can happen in parallel with rename("a/foo", "c/splat"). link(2) holds ->i_mutex on a/foo (so it can't race with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last one exists). It also holds ->s_vfs_rename_sem, so cross-directory renames are serialized. For normal filesystems that's enough and there ->link() couldn't care less about a/; ceph_link() wants the inode of a/ for some reason. If it *really* needs the parent of a/foo for the operation, the current code is SOL - the directory we grab can cease being that parent just as it's getting returned to ceph_link() * ->d_revalidate() can happen in parallel with rename(). You don't seem to be using the parent inode much in there, so that should be reasonably easy to deal with. * ceph_open() can race with rename(); ->atomic_open() is called with parent locked, but ->open() isn't. AFAICS, open() with O_TRUNC could step into that code... * ceph_setattr() *definitely* can race with rename(); we have the object itself locked, but not its parent (and I'm really surprised by the need to know that parent for such operation). * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is about, but opened files can be moved around, TYVM, even when they are in the middle of ioctl(2). * ->setxattr() and ->removexattr() - again, can happen in parallel with rename(), and again I really wonder why do we need the parent directory for such operations. What's going on there? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Al- I've returned and recovered from LCA and have no more excuse for keeping my brain turned off: On Fri, 1 Feb 2013, Al Viro wrote: > On Wed, Jan 30, 2013 at 02:42:14PM +0000, Al Viro wrote: > > On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: > > > > > We should drop teh mds_client.c hunk from your patch, and then do > > > something like the below. I'll put it in the ceph tree so we can do some > > > basic testing. Unfortunately, the abort case is a bit annoying to > > > trigger.. we'll need to write a new test for it. > > > > hunk dropped, the rest folded into your patch, resulting branch pushed... > > BTW, moderately related issue - ceph_get_dentry_parent_inode() uses > look fishy. In ceph_rename() it seems to be pointless. We do have > the directory inode of parent of old_dentry - it's old_dir, so it's just > a fancy way to spell igrab(old_directory). And as far as I can tell, *all* Yeah, that's an easy fix. > other callers are racy. > * link("a/foo", "b/bar") can happen in parallel with > rename("a/foo", "c/splat"). link(2) holds ->i_mutex on a/foo (so it can't race > with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last > one exists). It also holds ->s_vfs_rename_sem, so cross-directory renames > are serialized. For normal filesystems that's enough and there ->link() > couldn't care less about a/; ceph_link() wants the inode of a/ for some > reason. If it *really* needs the parent of a/foo for the operation, the > current code is SOL - the directory we grab can cease being that parent > just as it's getting returned to ceph_link() And I think this use is totally unnecessary. > * ->d_revalidate() can happen in parallel with rename(). You > don't seem to be using the parent inode much in there, so that should > be reasonably easy to deal with. This is the tricky one. We're looking at the parent because some of the directory inode state is effectively a lease over the validity of the directory's dentries. In the general case, this is more efficient than a lot of per-dentry state in the client/server protocol. Probably what we *should* be doing is linking to the parent from the child's ceph_dentry_info under the MDS session mutex, so that it is properly aligned with the state being passed from the server. I considered this originally but was annoyed about essentially duplicating d_parent. But... the locking being what it is, that is probably the only way to make the d_revalidate completely safe. It'll be a bigger patch to change that behavior, though. > * ceph_open() can race with rename(); ->atomic_open() is > called with parent locked, but ->open() isn't. AFAICS, open() with > O_TRUNC could step into that code... > * ceph_setattr() *definitely* can race with rename(); we have > the object itself locked, but not its parent (and I'm really surprised > by the need to know that parent for such operation). > * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is > about, but opened files can be moved around, TYVM, even when they are > in the middle of ioctl(2). > * ->setxattr() and ->removexattr() - again, can happen in parallel > with rename(), and again I really wonder why do we need the parent directory > for such operations. These are only there so that an fsync(dirfd) will wait for the file creation or other metadata to commit. I suspect I am (heavily) over-reading what POSIX says here... and should be concerned only with the namespace modifications (link, unlink, rename, O_CREAT). Assuming that's the case, git://github.com/ceph/ceph-client.git #wip-dentries has a set of patches that fix this up. (Well, except for d_revalidate, which is a bigger change; opened #4023 in the ceph bug tracker.) Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 77b1b18..b51653e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1195,6 +1195,39 @@ done: /* * Prepopulate our cache with readdir results, leases, etc. */ +static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req, + struct ceph_mds_session *session) +{ + struct ceph_mds_reply_info_parsed *rinfo = &req->r_reply_info; + int i, err = 0; + + for (i = 0; i < rinfo->dir_nr; i++) { + struct ceph_vino vino; + struct inode *in; + int rc; + + vino.ino = le64_to_cpu(rinfo->dir_in[i].in->ino); + vino.snap = le64_to_cpu(rinfo->dir_in[i].in->snapid); + + in = ceph_get_inode(req->r_dentry->d_sb, vino); + if (IS_ERR(in)) { + err = PTR_ERR(in); + dout("new_inode badness got %d\n", err); + continue; + } + rc = fill_inode(in, &rinfo->dir_in[i], NULL, session, + req->r_request_started, -1, + &req->r_caps_reservation); + if (rc < 0) { + pr_err("fill_inode badness on %p got %d\n", in, rc); + err = rc; + continue; + } + } + + return err; +} + int ceph_readdir_prepopulate(struct ceph_mds_request *req, struct ceph_mds_session *session) { @@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, u64 frag = le32_to_cpu(rhead->args.readdir.frag); struct ceph_dentry_info *di; + if (req->r_aborted) + return readdir_prepopulate_inodes_only(req, session); + if (le32_to_cpu(rinfo->head->op) == CEPH_MDS_OP_LSSNAP) { snapdir = ceph_get_snapdir(parent->d_inode); parent = d_find_alias(snapdir);