diff mbox

[14/20] VFS/namei: add 'inode' arg to put_link().

Message ID 20150323023740.8161.51849.stgit@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown March 23, 2015, 2:37 a.m. UTC
When symlinks are followed in RCU-walk, dentry->d_inode
may have changed between the call to ->follow_link and
the call to ->put_link.
So we need to preserve the inode used in the first instance,
and use it to find the correct put_link.

Note that this means that when RCU-walk is permitted in
->follow_link, dentry->d_inode cannot be used in ->put_link.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/filesystems/porting |    4 ++++
 fs/namei.c                        |   20 ++++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro April 17, 2015, 4:25 p.m. UTC | #1
On Mon, Mar 23, 2015 at 01:37:40PM +1100, NeilBrown wrote:
> @@ -1669,13 +1669,14 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
>  
>  	do {
>  		struct path link = *path;
> +		struct inode *inode = link.dentry->d_inode;
>  		void *cookie;
>  
>  		res = follow_link(&link, nd, &cookie);
>  		if (res)
>  			break;
>  		res = walk_component(nd, path, LOOKUP_FOLLOW);
> -		put_link(nd, &link, cookie);
> +		put_link(nd, &link, inode, cookie);
>  	} while (res > 0);

That's really unpleasant - it means increased stack footprint in the
recursion.

Damn, maybe it's time to bite the bullet and kill the recursion completely...

What do we really need to save across the recursive call?
	* how far did we get in the previous pathname
	* data needed for put_link:
		cookie
		link body
		dentry of link
		vfsmount (to pin containing fs; non-RCU) or inode (RCU)

We are already saving link body in nameidata, so we could fatten that array.
It would allow flattening link_path_walk() completely - instead of
recursive call we would just save what needed saving and jump to the beginning
and on exits we'd check the depth and either return or restore the saved state
and jump back to just past the place where recursive call used to be.
It would even save quite a bit of space in the worst case.  However, it would
blow the stack footprint in normal cases *and* blow it even worse for the
things that need two struct nameidata instances at once (rename(), basically).
5 pointers instead of 1 pointer per level - extra 32 words on stack, i.e.
extra 256 bytes on 64bit.  Extra 0.5Kb of stack footprint on rename() is
probably too much, especially since this "saved" stuff from its two nameidata
instances will never be used at the same time...

Alternatively, we could just allocate about a page worth of an array when
the depth of nesting goes beyond 2 and put this saved stuff there - at
5 pointers per level it would completely dispose of the depth of nesting
limit, giving us uniform "can't traverse more than 40 symlinks per pathname
resolution".  40 * 5 * sizeof(pointer) is what, at most 1600 bytes?  So
even half a page would suffice for that quite comfortably...

The question is whether we'll be able to avoid blowing the I-cache footprint
of link_path_walk() to hell while doing that; it feels like we should be,
but we'll have to see how well does that work in reality...

I'll try to implement that (with your #3..#7 as the first steps) and see
how well does it work; it's obviously the next cycle fodder, but hopefully
in testable shape by -rc2...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 17, 2015, 7:09 p.m. UTC | #2
On Fri, Apr 17, 2015 at 05:25:36PM +0100, Al Viro wrote:
> On Mon, Mar 23, 2015 at 01:37:40PM +1100, NeilBrown wrote:
> > @@ -1669,13 +1669,14 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
> >  
> >  	do {
> >  		struct path link = *path;
> > +		struct inode *inode = link.dentry->d_inode;
> >  		void *cookie;
> >  
> >  		res = follow_link(&link, nd, &cookie);
> >  		if (res)
> >  			break;
> >  		res = walk_component(nd, path, LOOKUP_FOLLOW);
> > -		put_link(nd, &link, cookie);
> > +		put_link(nd, &link, inode, cookie);
> >  	} while (res > 0);
> 
> That's really unpleasant - it means increased stack footprint in the
> recursion.
> 
> Damn, maybe it's time to bite the bullet and kill the recursion completely...
> 
> What do we really need to save across the recursive call?
> 	* how far did we get in the previous pathname
> 	* data needed for put_link:
> 		cookie
> 		link body
> 		dentry of link
> 		vfsmount (to pin containing fs; non-RCU) or inode (RCU)
> 
> We are already saving link body in nameidata, so we could fatten that array.
> It would allow flattening link_path_walk() completely - instead of
> recursive call we would just save what needed saving and jump to the beginning
> and on exits we'd check the depth and either return or restore the saved state
> and jump back to just past the place where recursive call used to be.
> It would even save quite a bit of space in the worst case.  However, it would
> blow the stack footprint in normal cases *and* blow it even worse for the
> things that need two struct nameidata instances at once (rename(), basically).
> 5 pointers instead of 1 pointer per level - extra 32 words on stack, i.e.
> extra 256 bytes on 64bit.  Extra 0.5Kb of stack footprint on rename() is
> probably too much, especially since this "saved" stuff from its two nameidata
> instances will never be used at the same time...
> 
> Alternatively, we could just allocate about a page worth of an array when
> the depth of nesting goes beyond 2 and put this saved stuff there - at
> 5 pointers per level it would completely dispose of the depth of nesting
> limit, giving us uniform "can't traverse more than 40 symlinks per pathname
> resolution".  40 * 5 * sizeof(pointer) is what, at most 1600 bytes?  So
> even half a page would suffice for that quite comfortably...
> 
> The question is whether we'll be able to avoid blowing the I-cache footprint
> of link_path_walk() to hell while doing that; it feels like we should be,
> but we'll have to see how well does that work in reality...
> 
> I'll try to implement that (with your #3..#7 as the first steps) and see
> how well does it work; it's obviously the next cycle fodder, but hopefully
> in testable shape by -rc2...

	Hmm...  Actually, right now we have 192 bytes of stack footprint per
nesting level (amd64 allmodconfig).  Which means that simply making the
array fatter would give a clean benefit at the 3rd level of recursion (symlink
encountered while traversing a symlink) for everything other than rename()...
allnoconfig+64bit gives 160 bytes per level, with the same breakeven point.

	Interesting...  It might even make sense to separate that array from
struct nameidata and solve the rename() problem that way (current->nameidata
would be replaced with pointer to that sucker in such variant, of course, and
->depth would move there).  In that variant we do not get rid of nesting limit
completely, but it would probably be simpler than the one above...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 18, 2015, 8:09 a.m. UTC | #3
On Fri, Apr 17, 2015 at 08:09:10PM +0100, Al Viro wrote:
> On Fri, Apr 17, 2015 at 05:25:36PM +0100, Al Viro wrote:
> > On Mon, Mar 23, 2015 at 01:37:40PM +1100, NeilBrown wrote:
> > > @@ -1669,13 +1669,14 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
> > >  
> > >  	do {
> > >  		struct path link = *path;
> > > +		struct inode *inode = link.dentry->d_inode;
> > >  		void *cookie;
> > >  
> > >  		res = follow_link(&link, nd, &cookie);
> > >  		if (res)
> > >  			break;
> > >  		res = walk_component(nd, path, LOOKUP_FOLLOW);
> > > -		put_link(nd, &link, cookie);
> > > +		put_link(nd, &link, inode, cookie);
> > >  	} while (res > 0);
> > 
> > That's really unpleasant - it means increased stack footprint in the
> > recursion.
> > 
> > Damn, maybe it's time to bite the bullet and kill the recursion completely...
> > 
> > What do we really need to save across the recursive call?
> > 	* how far did we get in the previous pathname
> > 	* data needed for put_link:
> > 		cookie
> > 		link body
> > 		dentry of link
> > 		vfsmount (to pin containing fs; non-RCU) or inode (RCU)
> > 
> > We are already saving link body in nameidata, so we could fatten that array.
> > It would allow flattening link_path_walk() completely - instead of
> > recursive call we would just save what needed saving and jump to the beginning
> > and on exits we'd check the depth and either return or restore the saved state
> > and jump back to just past the place where recursive call used to be.
> > It would even save quite a bit of space in the worst case.  However, it would
> > blow the stack footprint in normal cases *and* blow it even worse for the
> > things that need two struct nameidata instances at once (rename(), basically).
> > 5 pointers instead of 1 pointer per level - extra 32 words on stack, i.e.
> > extra 256 bytes on 64bit.  Extra 0.5Kb of stack footprint on rename() is
> > probably too much, especially since this "saved" stuff from its two nameidata
> > instances will never be used at the same time...
> > 
> > Alternatively, we could just allocate about a page worth of an array when
> > the depth of nesting goes beyond 2 and put this saved stuff there - at
> > 5 pointers per level it would completely dispose of the depth of nesting
> > limit, giving us uniform "can't traverse more than 40 symlinks per pathname
> > resolution".  40 * 5 * sizeof(pointer) is what, at most 1600 bytes?  So
> > even half a page would suffice for that quite comfortably...
> > 
> > The question is whether we'll be able to avoid blowing the I-cache footprint
> > of link_path_walk() to hell while doing that; it feels like we should be,
> > but we'll have to see how well does that work in reality...
> > 
> > I'll try to implement that (with your #3..#7 as the first steps) and see
> > how well does it work; it's obviously the next cycle fodder, but hopefully
> > in testable shape by -rc2...
> 
> 	Hmm...  Actually, right now we have 192 bytes of stack footprint per
> nesting level (amd64 allmodconfig).  Which means that simply making the
> array fatter would give a clean benefit at the 3rd level of recursion (symlink
> encountered while traversing a symlink) for everything other than rename()...
> allnoconfig+64bit gives 160 bytes per level, with the same breakeven point.
> 
> 	Interesting...  It might even make sense to separate that array from
> struct nameidata and solve the rename() problem that way (current->nameidata
> would be replaced with pointer to that sucker in such variant, of course, and
> ->depth would move there).  In that variant we do not get rid of nesting limit
> completely, but it would probably be simpler than the one above...

	OK, right now in my tree recursion is gone, it seems to survive the
testing *and* stack footprint of link_path_walk() is 432 bytes.  Less than the
mainline with two nested symlinks, and I definitely see how to trim that down
by another 64 bytes, which would put us within a spitting distance from what
the mainline gets with a single symlink encountered in the middle of a
pathname.  It still needs more massage (link_path_walk() is ugly as hell
right now), but I see how to clean it up, and porting the rest of Neil's
RCU follow_link series on top of that shouldn't be hard.  Obviously next cycle
fodder, but if everything works out, we'll get serious stack footprint
reduction *and* not falling out of lazy pathwalk whenever we run into a symlink.

	I've dumped the current branch in vfs.git#link_path_walk; more after
I get some sleep...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index eba8dd0a13e3..09454610515c 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -490,3 +490,7 @@  in your dentry operations instead.
 	The passed inode must be used rather than dentry->d_inode,
 	particularly if LOOKUP_RCU is set.
 	If s_fs_info is used, it must be freed using RCU.
+--
+[mandatory]
+	If ->follow_link permits RCU-walk, then ->put_link must
+	not access dentry->d_inode as that may have changed.
diff --git a/fs/namei.c b/fs/namei.c
index 224b1495edae..72f5a4f91855 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -763,9 +763,9 @@  static void terminate_walk(struct nameidata *nd)
 	}
 }
 
-static inline void put_link(struct nameidata *nd, struct path *link, void *cookie)
+static inline void put_link(struct nameidata *nd, struct path *link,
+			    struct inode *inode, void *cookie)
 {
-	struct inode *inode = link->dentry->d_inode;
 	if (inode->i_op->put_link)
 		inode->i_op->put_link(link->dentry, nd_get_link(nd), cookie);
 	if (!(nd->flags & LOOKUP_LINK_RCU))
@@ -934,7 +934,7 @@  follow_link(struct path *link, struct nameidata *nd, void **p)
 	if (s) {
 		if (unlikely(IS_ERR(s))) {
 			terminate_walk(nd);
-			put_link(nd, link, *p);
+			put_link(nd, link, inode, *p);
 			return PTR_ERR(s);
 		}
 		if (*s == '/') {
@@ -948,7 +948,7 @@  follow_link(struct path *link, struct nameidata *nd, void **p)
 		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
 		if (unlikely(error))
-			put_link(nd, link, *p);
+			put_link(nd, link, inode, *p);
 	}
 
 	return error;
@@ -1669,13 +1669,14 @@  static inline int nested_symlink(struct path *path, struct nameidata *nd)
 
 	do {
 		struct path link = *path;
+		struct inode *inode = link.dentry->d_inode;
 		void *cookie;
 
 		res = follow_link(&link, nd, &cookie);
 		if (res)
 			break;
 		res = walk_component(nd, path, LOOKUP_FOLLOW);
-		put_link(nd, &link, cookie);
+		put_link(nd, &link, inode, cookie);
 	} while (res > 0);
 
 	nd->link_count--;
@@ -2036,6 +2037,7 @@  static int path_lookupat(int dfd, const char *name,
 		while (err > 0) {
 			void *cookie;
 			struct path link = path;
+			struct inode *inode = link.dentry->d_inode;
 			err = may_follow_link(&link, nd);
 			if (unlikely(err))
 				break;
@@ -2044,7 +2046,7 @@  static int path_lookupat(int dfd, const char *name,
 			if (err)
 				break;
 			err = lookup_last(nd, &path);
-			put_link(nd, &link, cookie);
+			put_link(nd, &link, inode, cookie);
 		}
 	}
 
@@ -2396,6 +2398,7 @@  path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 	while (err > 0) {
 		void *cookie;
 		struct path link = *path;
+		struct inode *inode = link.dentry->d_inode;
 		err = may_follow_link(&link, &nd);
 		if (unlikely(err))
 			break;
@@ -2404,7 +2407,7 @@  path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 		if (err)
 			break;
 		err = mountpoint_last(&nd, path);
-		put_link(&nd, &link, cookie);
+		put_link(&nd, &link, inode, cookie);
 	}
 out:
 	path_cleanup(&nd);
@@ -3281,6 +3284,7 @@  static struct file *path_openat(int dfd, struct filename *pathname,
 	error = do_last(nd, &path, file, op, &opened, pathname);
 	while (unlikely(error > 0)) { /* trailing symlink */
 		struct path link = path;
+		struct inode *inode = link.dentry->d_inode;
 		void *cookie;
 		if (!(nd->flags & LOOKUP_FOLLOW)) {
 			path_to_nameidata(&path, nd);
@@ -3297,7 +3301,7 @@  static struct file *path_openat(int dfd, struct filename *pathname,
 		if (unlikely(error))
 			break;
 		error = do_last(nd, &path, file, op, &opened, pathname);
-		put_link(nd, &link, cookie);
+		put_link(nd, &link, inode, cookie);
 	}
 out:
 	path_cleanup(nd);