diff mbox

[RFC,PATCHSET] non-recursive link_path_walk() and reducing stack footprint

Message ID 20150424163534.6eb109eb@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown April 24, 2015, 6:35 a.m. UTC
On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:
> 
> > follow_link calls  link_path_walk -> walk_component -> lookup_fast which sets
> > nd->seq.  Is that not enough?  I guess not when nd_jump_link is called.  Is
> > that what I missed?
> 
> No.  Potential nd_jump_link() callers are just refusing to go there in lazy
> mode, end of story.  That's not the problem; see below.
> 
> > One thing that is clear to me is that I don't really understand all the
> > handling of 'seq' numbers, making me unable to comment usefully.
> > I'll try to go through the current code and you current patch with that issue
> > in mind and make sure I do understand it.  Then I'll try to comment.
> 
> OK, here's the basic theory behind the lazy pathwalk:
> 
> * during the entire exercise we never drop rcu_read_lock, therefore any
> objects that have all references to them removed before RCU-delayed
> freeing (inodes, dentries, superblocks and vfsmounts are among such)
> that we might find in process won't be freed until after we are done.
> 
> * invariant we are keeping:
> 	at some point between the beginning of walk and now the pathname
> traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
> ->d_seq and ->d_inode of the dentry in question.

Thanks for all of this!
I think one part that was confusing me is that there is a place where nd->seq
does related to nd->path.dentry.
Usually  the two are updated at the same time.
However updated nd->seq, but *doesn't* update nd->path.dentry.
Instead, the dentry is stored in a separate 'path' variable.
This discrepancy is eventually resolved in a call to path_to_nameidata().

Which is a bit confusing until you know what is happening :-)


> 
> * path_init() arranges for that to be true in the beginning of the walk.
> 
> * symlinks aside, walk_component() preserves that.
> 	+ normal name components go through lookup_fast(), where we have
> 	  __d_lookup_rcu() find a child of current nd->path with matching
> 	  name and (atomically) picks ->d_seq of that child, which had the
> 	  name matching our component.  Atomicity is provided by ->d_lock
> 	  on child.  Then we proceed to pick ->d_inode (and verify that

I don't see d_lock being taken  in __d_lookup_rcu.
I think this atomicity is provided by ->d_seq on the child.


> 	  ->d_seq has not changed, thus making sure that ->d_inode value
> 	  at the moment when the name matched had been the same and child
> 	  is still hashed.  Then we verify that parent's ->d_seq has not
> 	  changed, guaranteeing that parent hadn't been moved or unhashed
> 	  from the moment we'd found it until after we'd found its child.
> 	  Assuming nothing's mounted on top of that thing, and there's no
> 	  problem with ->d_revalidate()), that's it - we have new nd->seq,
> 	  nd->path and nd->inode satisfying our invariant for longer
> 	  piece of pathname.
> 	+ "." needs nothing - we just stay where we are
> 	+ ".." is handled by follow_dotdot_rcu(), which in the normal case
> 	  (no mountpoint crossing) picks ->d_parent of where we are,
> 	  fetches its ->d_inode and ->d_seq and verifies that our directory
> 	  still hadn't changed _its_ ->d_seq.  The last part guarantees that
> 	  it hadn't been moved since the time we'd found it, and thus its
> 	  ->d_parent had remained unchanged at least until that verification.
> 	  Therefore, it remained pinned all along, and it ->d_inode had
> 	  remained stable, including the moment when we fetched ->d_seq.
> 	  Which means that the value we had picked *and* its ->d_inode and
> 	  ->d_seq would satisfy the invariant for the longer piece of
> 	  pathname.
> 	+ mountpoint crossing towards leaves is handled in __follow_mount_rcu();
> 	  it is simple (->mnt_root never changes and is always pinned,
> 	  stabilizing its ->d_inode), but we might need to worry about
> 	  automount points *and* need to make sure that we stop right
> 	  there if mount_lock has been bumped.  See commit b37199e6 for
> 	  the details on the last part - basically, making sure that false
> 	  negatives from __lookup_mnt() won't end up with hard error when
> 	  we walk into whatever had been under the mountpoint we'd missed.
> 	+ mountpoint crossing towards root (in follow_dotdot_rcu()) is
> 	  similar, but there we obviously don't care about automounts.
> 	  Looking at it now, it might make sense to recheck mount_lock there
> 	  as well, though - potential danger is to hit the moment when
> 	  mnt_parent and mnt_mountpoint are out of sync, leaving us with
> 	  mismatched vfsmount,dentry pair.  Generally, that will be caught
> 	  when we try to leave lazy mode (legitimize_mnt() will fail) or
> 	  to cross towards leaves, but the next crossing towards root
> 	  might be bogus as well, and we could end up with unwarranted hard
> 	  error.  Should be very hard to hit, but it's easy enough to check
> 	  *and* backport, so it looks like a good idea for -stable.  Linus,
> 	  do you have any objections against adding
>                 if (read_seqretry(&mount_lock, nd->m_seq))
>                         goto failed;
> 	  right after
>                 if (!follow_up_rcu(&nd->path))
>                         break;
> 	  in follow_dotdot_rcu()?  It's a very narrow race with mount --move
> 	  and most of the time it ends up being completely harmless, but
> 	  it's possible to construct a case when we'll get a bogus hard error
> 	  instead of falling back to non-lazy walk...  OTOH, anyone doing
> 	  that will get something inherently timing-dependent as the result,
> 	  lazy mode or not.  I'm in favour of adding that check, but IMO it's
> 	  not a critical problem.
> 
> * if we find that we can't continue in lazy mode because some verification
> fails, we just fail with -ECHILD.  However, in cases when our current position
> might be fine, but the next step can't be done in lazy mode, we attempt to
> fall back to non-lazy without full restart that would be caused by -ECHILD.
> That's what unlazy_walk() is.  Of course, if we reach the end of walk we
> need to leave the lazy mode as well (complete_walk()).  Either can fail,
> and such a failure means restart from scratch in non-lazy mode.  We need
> to grab references on vfsmount and dentry (or two dentries, when we have
> parent and child to deal with).  The interesting part is vfsmount - we need
> to make sure we won't interfere with umount(2), having walked in that sucker
> *after* umount(2) has checked that it's not busy.  See legitimize_mnt() for
> details; basically, we have umount(2) mark the "I've verified it's not busy
> and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
> on RCU delays on that path - if we find one of those, we undo the
> increment of its refcount we'd just done, without having dropped
> rcu_read_lock().  Full-blown mntput() is done only on mismatches that
> had _not_ been marked that way.
> 
> BTW, something similar to the above probably needs to be turned into coherent
> text, either in parallel to Documentation/filesystems/path-lookup.txt, or
> as update to it.

That might be fun.... if I find time.


> 
> The reason why walk_component() in your call chain won't suffice is simple -
> it will fail this
>                 /*
>                  * This sequence count validates that the parent had no
>                  * changes while we did the lookup of the dentry above.
>                  *
>                  * The memory barrier in read_seqcount_begin of child is
>                  *  enough, we can use __read_seqcount_retry here.
>                  */
>                 if (__read_seqcount_retry(&parent->d_seq, nd->seq))
>                         return -ECHILD;
> in lookup_fast(), just before updating nd->seq to new value.  Basically,
> it has no way to tell if parent has been buggered to hell and back.
> 
> It's not _that_ hard to prevent - we just need to stop discarding the parent's
> seq number in "need to follow a symlink" case of walk_component().  Will take
> some massage, but not too much...

So.....
Where I have:

+                       if (nd->flags & LOOKUP_RCU) {
+                               if (!nd->root.mnt)
+                                       set_root_rcu(nd);
+                               nd->path = nd->root;


in the case where the symlink starts '/', I need to set nd->seq

		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);


but in the case where symlink doesn't start '/', I don't change nd->path,
so nd->seq should still be correct?

Also, just after that (still in follow_link()), we have

		nd->inode = nd->path.dentry->d_inode;

*after* the test for (*s == '/').
Wouldn't is be OK to have that *inside* the if?  In the relative-symlink case
it should be unchanged, and in the jump-symlink case nd_jump_link() has already
done that (and probably NULL was returns for 's' so this code doesn't execute).

So following on top of my previous patches?

Thanks heaps,
NeilBrown

Comments

Al Viro April 24, 2015, 1:42 p.m. UTC | #1
On Fri, Apr 24, 2015 at 04:35:34PM +1000, NeilBrown wrote:
> > * path_init() arranges for that to be true in the beginning of the walk.
> > 
> > * symlinks aside, walk_component() preserves that.
> > 	+ normal name components go through lookup_fast(), where we have
> > 	  __d_lookup_rcu() find a child of current nd->path with matching
> > 	  name and (atomically) picks ->d_seq of that child, which had the
> > 	  name matching our component.  Atomicity is provided by ->d_lock
> > 	  on child.  Then we proceed to pick ->d_inode (and verify that
> 
> I don't see d_lock being taken  in __d_lookup_rcu.

Do'h...  No, it isn't - sorry about the braino.

> I think this atomicity is provided by ->d_seq on the child.

Yes.  It's fetch ->d_seq, compare names, then in caller fetch ->d_inode and
compare ->d_seq.

> So.....
> Where I have:
> 
> +                       if (nd->flags & LOOKUP_RCU) {
> +                               if (!nd->root.mnt)
> +                                       set_root_rcu(nd);
> +                               nd->path = nd->root;
> 
> 
> in the case where the symlink starts '/', I need to set nd->seq
> 
> 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> 
> 
> but in the case where symlink doesn't start '/', I don't change nd->path,
> so nd->seq should still be correct?

It would be correct, if walk_component() hadn't _already_ flipped it to
match the symlink.  By that point it's too late - we'd already lost the
old value.  This is the area where it hits the fan:
                if (__read_seqcount_retry(&parent->d_seq, nd->seq))
                        return -ECHILD;
OK, parent is still valid
                nd->seq = seq;
... and we lose its seq number, replacing it with that of a child.

                if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) {
                        status = d_revalidate(dentry, nd->flags);
                        if (unlikely(status <= 0)) {
                                if (status != -ECHILD)
                                        need_reval = 0;
                                goto unlazy;
                        }
                }
                path->mnt = mnt;
                path->dentry = dentry;
set path to (nd->path.mnt, dentry of child)
                if (likely(__follow_mount_rcu(nd, path, inode)))
see if we need to (and can) cross the mountpoint here.  That can change
(vfsmount,dentry) pair *and* nd->seq - making it match whatever path->dentry
ends us being.
                        return 0;
if everything's fine, we return to caller, still in RCU mode.
unlazy:
                if (unlazy_walk(nd, dentry))
... and here we attempt to fall back to non-lazy mode, expecting nd->seq to
match dentry.  Win or lose, that's the last point where nd->seq will be looked
at.
                        return -ECHILD;

See the problem?  We have discarded the nd->seq value for parent and we can't
re-pick it without opening a race.

AFAICS, the simplest solution is to have nd->next_seq and set _that_ in
lookup_fast() and __follow_mount_rcu(), using it in unlazy_walk().  And
in callers have it copied into nd->seq after we'd established that it's not
a symlink to be followed.

Handling of absolute symlinks also needs some care - we have no idea whether
nd->root still makes any sense.  It might have become negative by that point.
unlazy_walk() done first would've checked it's still our process' root, but
if we stay in lazy mode, we obviously can't rely on that check.

One possible solution is to do the same kind of check we do in unlazy_walk() - 
        if (!(nd->flags & LOOKUP_ROOT)) {
                spin_lock(&fs->lock);
                if (nd->root.mnt != fs->root.mnt || nd->root.dentry != fs->root.dentry)
			bugger off
		fetch ->d_inode and ->d_seq
                spin_unlock(&fs->lock);
	} else {
		fetch ->d_inode and ->d_seq
	}
Another - to store the ->d_seq of root in nd->seq_root, have set_root_rcu()
set that instead of (or, better, in addition to) returning it to caller,
have path_init() initialize it explicitly in LOOKUP_ROOT case and have this
"reset to root" in following an absolute symlink just do
	set_root_rcu(nd);
	nd->path = nd->root;
	nd->seq = nd->seq_root;
	nd->inode = nd->path.dentry->d_inode;
	check if nd->path.dentry is stale, bugger off if it is

That avoids this spin_lock() on each absolute symlink at the price of extra
32 bits in struct nameidata.  It looks like doing on-demand reallocation
of nd->stack is the right way to go anyway, so the pressure on nameidata size
is going to be weaker and that might be the right way to go...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro May 4, 2015, 5:11 a.m. UTC | #2
On Fri, Apr 24, 2015 at 02:42:03PM +0100, Al Viro wrote:

> That avoids this spin_lock() on each absolute symlink at the price of extra
> 32 bits in struct nameidata.  It looks like doing on-demand reallocation
> of nd->stack is the right way to go anyway, so the pressure on nameidata size
> is going to be weaker and that might be the right way to go...

OK, on-demand reallocation is done.  What I have right now is
	* flat MAXSYMLINKS 40, no matter what kind of nesting there might
be.
	* purely iterative link_path_walk().
	* no damn nameidata on stack for generic_readlink()
	* stack footprint of the entire thing independent from the nesting
depth, and about on par with "no symlinks at all" case in mainline.
	* some massage towards RCU follow_link done (in the end of queue),
but quite a bit more work remains.

What I've got so far is in vfs.git#link_path_walk; I'm not too happy about
posting a 70-chunk mailbomb, but it really will need review and testing.
It survives xfstests and LTP with no regressions, but it will need
serious profiling, etc., along with RTFS.  I tried to keep it in reasonably
small pieces, but there's a lot of them ;-/

FWIW, I've a bit more reorganization plotted out, but it's not far from
where we need to be for RCU follow_link.  Some notes:
	* I don't believe we want to pass flags to ->follow_link() - it's
much simpler to give the damn thing NULL for dentry in RCU case.  In *all*
cases where we might have a change to get the symlink body without blocking
we can do that by inode alone.  We obviously want to pass dentry and inode
separately (and in case of fast symlinks we don't hit the filesystem at
all), but that's it - flags isn't needed.
	* terminate_walk() should do bulk put_link().  So should the
failure cases of complete_walk().  _Success_ of complete_walk() should
be careful about legitimizing links - it *can* be called with one link
on stack, and be followed by access to link body.  Yes, really - do_last()
in O_CREAT case.
	* do_last(), lookup_last() and mountpoint_last() ought to
have put_link() done when called on non-empty stack (thus turning the loops
into something like
                while ((err = lookup_last(nd)) > 0) {
                        err = trailing_symlink(nd);
                        if (err)
                                break;
                }
_After_ the point where they don't need to look at the last component of
name, obviously.
	* I think we should leave terminate_walk() to callers in failure
cases of walk_component() and handle_dots(), as well as get_link().  Makes
life simpler in callers, actually.  I'll play with that a bit more.
	* it might make sense to add the second flag to walk_component(),
in addition to LOOKUP_FOLLOW, meaning "do put_link() once you are done looking
at the name".  In principle, it promises simpler logics with unlazy_walk(),
but that's one area I'm still not entirely sure about.  Will need to
experiment a bit...
	* nd->seq clobbering will need to be dealt with, as discussed upthread.
	* I _really_ hate your "let's use the LSB of struct page * to tell
if we need to kunmap()" approach.  It's too damn ugly to live.  _And_ it's
trivial to avoid - all we need is to have (non-lazy) page_follow_link_light()
and page_symlink() to remove __GFP_HIGHMEM from inode->i_mapping before
ever asking to allocate pages there.  That'll suffice, and it makes sense
regardless of RCU work - that kmap/kunmap with potential for minutes in
between (while waiting for stuck NFS server in the middle of symlink traversal)
is simply wrong.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown May 4, 2015, 7:30 a.m. UTC | #3
On Mon, 4 May 2015 06:11:29 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Fri, Apr 24, 2015 at 02:42:03PM +0100, Al Viro wrote:
> 
> > That avoids this spin_lock() on each absolute symlink at the price of extra
> > 32 bits in struct nameidata.  It looks like doing on-demand reallocation
> > of nd->stack is the right way to go anyway, so the pressure on nameidata size
> > is going to be weaker and that might be the right way to go...
> 
> OK, on-demand reallocation is done.  What I have right now is
> 	* flat MAXSYMLINKS 40, no matter what kind of nesting there might
> be.
> 	* purely iterative link_path_walk().
> 	* no damn nameidata on stack for generic_readlink()
> 	* stack footprint of the entire thing independent from the nesting
> depth, and about on par with "no symlinks at all" case in mainline.
> 	* some massage towards RCU follow_link done (in the end of queue),
> but quite a bit more work remains.
> 
> What I've got so far is in vfs.git#link_path_walk; I'm not too happy about
> posting a 70-chunk mailbomb, but it really will need review and testing.
> It survives xfstests and LTP with no regressions, but it will need
> serious profiling, etc., along with RTFS.  I tried to keep it in reasonably
> small pieces, but there's a lot of them ;-/
> 
> FWIW, I've a bit more reorganization plotted out, but it's not far from
> where we need to be for RCU follow_link.  Some notes:
> 	* I don't believe we want to pass flags to ->follow_link() - it's
> much simpler to give the damn thing NULL for dentry in RCU case.  In *all*
> cases where we might have a change to get the symlink body without blocking
> we can do that by inode alone.  We obviously want to pass dentry and inode
> separately (and in case of fast symlinks we don't hit the filesystem at
> all), but that's it - flags isn't needed.
> 	* terminate_walk() should do bulk put_link().  So should the
> failure cases of complete_walk().  _Success_ of complete_walk() should
> be careful about legitimizing links - it *can* be called with one link
> on stack, and be followed by access to link body.  Yes, really - do_last()
> in O_CREAT case.
> 	* do_last(), lookup_last() and mountpoint_last() ought to
> have put_link() done when called on non-empty stack (thus turning the loops
> into something like
>                 while ((err = lookup_last(nd)) > 0) {
>                         err = trailing_symlink(nd);
>                         if (err)
>                                 break;
>                 }
> _After_ the point where they don't need to look at the last component of
> name, obviously.
> 	* I think we should leave terminate_walk() to callers in failure
> cases of walk_component() and handle_dots(), as well as get_link().  Makes
> life simpler in callers, actually.  I'll play with that a bit more.
> 	* it might make sense to add the second flag to walk_component(),
> in addition to LOOKUP_FOLLOW, meaning "do put_link() once you are done looking
> at the name".  In principle, it promises simpler logics with unlazy_walk(),
> but that's one area I'm still not entirely sure about.  Will need to
> experiment a bit...
> 	* nd->seq clobbering will need to be dealt with, as discussed upthread.
> 	* I _really_ hate your "let's use the LSB of struct page * to tell
> if we need to kunmap()" approach.  It's too damn ugly to live.  _And_ it's
> trivial to avoid - all we need is to have (non-lazy) page_follow_link_light()
> and page_symlink() to remove __GFP_HIGHMEM from inode->i_mapping before
> ever asking to allocate pages there.  That'll suffice, and it makes sense
> regardless of RCU work - that kmap/kunmap with potential for minutes in
> between (while waiting for stuck NFS server in the middle of symlink traversal)
> is simply wrong.


Thanks!
I'll have another look and see about adding what is needed for RCU symlink
support.

NeilBrown
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index d13b4315447f..ce6387d5317c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -947,6 +947,8 @@  follow_link(struct path *link, struct nameidata *nd, void **p)
 				if (!nd->root.mnt)
 					set_root_rcu(nd);
 				nd->path = nd->root;
+				nd->seq = read_seqcount_begin(
+					&nd->path->dentry->d_seq);
 			} else {
 				if (!nd->root.mnt)
 					set_root(nd);
@@ -954,9 +956,9 @@  follow_link(struct path *link, struct nameidata *nd, void **p)
 				nd->path = nd->root;
 				path_get(&nd->root);
 			}
+			nd->inode = nd->path.dentry->d_inode;
 			nd->flags |= LOOKUP_JUMPED;
 		}
-		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
 		if (unlikely(error))
 			put_link(nd, link, inode, *p);