diff mbox

[16/24] link_path_walk: kill the recursion

Message ID 1429553588-24764-16-git-send-email-viro@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro April 20, 2015, 6:13 p.m. UTC
From: Al Viro <viro@zeniv.linux.org.uk>

absolutely straightforward now - the only variables we need to preserve
across the recursive call are name, link and cookie, and recursion depth
is limited (and can is equal to nd->depth).  So arrange an array of
triples to hold instances of those and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namei.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

Comments

Linus Torvalds April 20, 2015, 9:04 p.m. UTC | #1
On Mon, Apr 20, 2015 at 11:13 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>   */
>  static int link_path_walk(const char *name, struct nameidata *nd)
>  {
> +       struct saved {
> +               struct path link;
> +               void *cookie;
> +               const char *name;
> +       } stack[MAX_NESTED_LINKS], *last = stack + nd->depth - 1;
>         struct path next;
>         int err;

I was going to complain about this, and suggest that you move it to
the nameidata, but I see that you did that later.

That said, you then introduce a stack-allocated "struct saved stack[]"
in path_mountpoint[] instead, *and* nameidata is saved on stack, so
this all ends up being very stack-intensive anyway.

I might have missed some patch here, but would it be possible to just
allocate a single per-thread nameidata, and just leave it at that?
Because allocating that thing on the stack when it contains what is
now one kilobyte of array data is *not* acceptable.

Other than that, my quick scan through this looked fine. And maybe
that quick scan missed where you already did that too.

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 20, 2015, 9:32 p.m. UTC | #2
On Mon, Apr 20, 2015 at 02:04:53PM -0700, Linus Torvalds wrote:

> That said, you then introduce a stack-allocated "struct saved stack[]"
> in path_mountpoint[] instead, *and* nameidata is saved on stack, so
> this all ends up being very stack-intensive anyway.
> 
> I might have missed some patch here, but would it be possible to just
> allocate a single per-thread nameidata, and just leave it at that?
> Because allocating that thing on the stack when it contains what is
> now one kilobyte of array data is *not* acceptable.

What kilobyte?  It's 9*4 pointers, IOW, 288 bytes total (assuming 64bit box).
And nd->saved_names[] goes away, so scratch 9 pointers we used to have.  Sure,
we can allocate that dynamically (or hold a couple of elements on stack and
allocate when/if we overgrow that), but it's not particularly large win.

Breakeven point is circa the second level of nesting - symlink met when
traversing a symlink...  That - on amd64; on something with fatter stack
frames I would expect the comparison to be even worse for mainline...

We need to preserve 4 pointers on stack per level of nesting.  Seeing that
single link_path_walk() stack frame in mainline is about 5-6 times bigger
than that, "just put enough for all levels into auto array" is an obvious
approach - a couple of link_path_walk() stack frames will be heavier than
that.  For renameat() (the worst-case user of link_path_walk() - there are
two struct nameidata on stack) we end up with breakeven at *one* level of
nesting, what with getting rid of 2*9 pointers in ->saved_names[] of those
nameidata.  And yes, I've measured the actual stack use before and after...

A kilobyte would suffice for 32 levels.  _IF_ we go for "lift the restrictions
on nesting completely", sure, we want to switch to (on-demand) dynamic
allocation.  It's not particularly hard to add and it might be worth doing,
but it's a separate story.  This series leaves the set of accepted pathnames
as-is...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 20, 2015, 9:39 p.m. UTC | #3
On Mon, Apr 20, 2015 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What kilobyte?  It's 9*4 pointers, IOW, 288 bytes total (assuming 64bit box).

You also said that you were going to up the recursion limit to 40.. So
40*3*8 bytes..

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 20, 2015, 9:41 p.m. UTC | #4
On Mon, Apr 20, 2015 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> A kilobyte would suffice for 32 levels.  _IF_ we go for "lift the restrictions
> on nesting completely", sure, we want to switch to (on-demand) dynamic
> allocation.

And no, we will *never* lift the recursion limit. Not for 1kB, not for
1MB. Never.

Because it's a latency and DoS issue too. We need to react well to
true loops, but also to "very deep" non-loops. It's not about memory
use, it's about users triggering unreasonable CPU resources.

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 20, 2015, 9:42 p.m. UTC | #5
On Mon, Apr 20, 2015 at 2:41 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And no, we will *never* lift the recursion limit. Not for 1kB, not for
> 1MB. Never.

Just to clarify: that's for the "remove restrictions completely".
Upping it to 32 or 40 would be fine. But not with allocations on the
stack.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 20, 2015, 9:51 p.m. UTC | #6
On Mon, Apr 20, 2015 at 02:39:43PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2015 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What kilobyte?  It's 9*4 pointers, IOW, 288 bytes total (assuming 64bit box).
> 
> You also said that you were going to up the recursion limit to 40.. So
> 40*3*8 bytes..

Er...  That's exactly what

||        We could reduce it further (see below), but I'm not sure it's worth
|| doing - it's not much extra complexity, but we only squeeze out ~250 bytes
|| that way, with the worst-case footprints (those are triggered by rename())
|| are around 1.4Kb (instead of about twice as much in mainline).  OTOH,
|| that extra complexity would've allowed us to get rid of the nesting limit
|| entirely (i.e. we get ELOOP when we encounter 40 symlinks, no matter in
|| which manner they are nested).  That might be worth considering...

had been about.  And yes, it is easy to implement - new nameidata flag
for "need to kfree() nd->stack",
                        if (unlikely(current->link_count >= MAX_NESTED_LINKS)) {
                                path_put_conditional(&next, nd);
                                path_put(&nd->path);
                                return -ELOOP;
                        }
                        BUG_ON(nd->depth >= MAX_NESTED_LINKS);

                        nd->depth++;
replaced with
			if (nd->depth == 2 && !(nd->flags & LOOKUP_KFREE)) {
				struct saved *p = kmalloc(41 * sizeof(*p));
				if (!p) {
					path_put_conditional(&next, nd);
					path_put(&nd->path);
					return -ENOMEM;
				}
				memcpy(p, nd->stack, 2 * sizeof(*p));
				nd->stack = p;
				nd->flags |= LOOKUP_KFREE;
			}
			nd->depth++;
with obvious logics for freeing that crap afterwards.

I really don't like the idea of putting it into nameidata, BTW - consider
e.g. rename().  We don't need the contents of that thing after the
link_path_walk() returns; no point duplicating it...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 20, 2015, 9:52 p.m. UTC | #7
On Mon, Apr 20, 2015 at 02:41:31PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2015 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > A kilobyte would suffice for 32 levels.  _IF_ we go for "lift the restrictions
> > on nesting completely", sure, we want to switch to (on-demand) dynamic
> > allocation.
> 
> And no, we will *never* lift the recursion limit. Not for 1kB, not for
> 1MB. Never.

We still have the "no more than 40 links total" limit, so it would obviously
be limited by that...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro April 20, 2015, 9:59 p.m. UTC | #8
On Mon, Apr 20, 2015 at 02:42:56PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2015 at 2:41 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And no, we will *never* lift the recursion limit. Not for 1kB, not for
> > 1MB. Never.
> 
> Just to clarify: that's for the "remove restrictions completely".
> Upping it to 32 or 40 would be fine. But not with allocations on the
> stack.

Yep.  There are two restrictions - one on the total amount of symlinks
encountered (40), and another on the depth of nesting (8).  The latter
is due to recursion issues, and with dynamic nd->stack it goes away.
The former, of course, stays, for the reasons you've mentioned - it has
nothing to do with the depth of recursion.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 6d79e00..4b98f4b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1767,9 +1767,15 @@  static inline u64 hash_name(const char *name)
  */
 static int link_path_walk(const char *name, struct nameidata *nd)
 {
+	struct saved {
+		struct path link;
+		void *cookie;
+		const char *name;
+	} stack[MAX_NESTED_LINKS], *last = stack + nd->depth - 1;
 	struct path next;
 	int err;
-	
+
+start:
 	while (*name=='/')
 		name++;
 	if (!*name)
@@ -1833,8 +1839,6 @@  Walked:
 			goto Err;
 
 		if (err) {
-			struct path link;
-			void *cookie;
 			char *s;
 
 			if (unlikely(nd->link_count >= MAX_NESTED_LINKS)) {
@@ -1847,22 +1851,25 @@  Walked:
 
 			nd->depth++;
 			nd->link_count++;
+			last++;
 
-			link = next;
-			s = get_link(&link, nd, &cookie);
+			last->link = next;
+			s = get_link(&last->link, nd, &last->cookie);
 
 			if (unlikely(IS_ERR(s))) {
 				err = PTR_ERR(s);
 				nd->link_count--;
 				nd->depth--;
+				last--;
 				goto Err;
 			}
 			err = 0;
 			if (unlikely(!s)) {
 				/* jumped */
-				put_link(nd, &link, cookie);
+				put_link(nd, &last->link, last->cookie);
 				nd->link_count--;
 				nd->depth--;
+				last--;
 			} else {
 				if (*s == '/') {
 					if (!nd->root.mnt)
@@ -1873,17 +1880,24 @@  Walked:
 					nd->flags |= LOOKUP_JUMPED;
 				}
 				nd->inode = nd->path.dentry->d_inode;
-				err = link_path_walk(s, nd);
+				last->name = name;
+				name = s;
+				goto start;
+
+back:
+				name = last->name;
 				if (unlikely(err)) {
-					put_link(nd, &link, cookie);
+					put_link(nd, &last->link, last->cookie);
 					nd->link_count--;
 					nd->depth--;
+					last--;
 					goto Err;
 				} else {
 					err = walk_component(nd, &next, LOOKUP_FOLLOW);
-					put_link(nd, &link, cookie);
+					put_link(nd, &last->link, last->cookie);
 					nd->link_count--;
 					nd->depth--;
+					last--;
 					goto Walked;
 				}
 			}
@@ -1895,9 +1909,13 @@  Walked:
 	}
 	terminate_walk(nd);
 Err:
-	return err;
+	if (likely(!nd->depth))
+		return err;
+	goto back;
 OK:
-	return 0;
+	if (likely(!nd->depth))
+		return 0;
+	goto back;
 }
 
 static int path_init(int dfd, const struct filename *name, unsigned int flags,