diff mbox

[review,4/6] mnt: Track when a directory escapes a bind mount

Message ID 87egjk9i61.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman Aug. 3, 2015, 9:27 p.m. UTC
When bind mounts are in use, and there is another path to the
filesystem it is possible to rename files or directories from a path
underneath the root of the bind mount to a path that is not underneath
the root of the bind mount.

When a directory is moved out from under the root of a bind mount path
name lookups that go up the directory tree potentially allow accessing
the entire dentry tree of the filesystem.  This is not expected, not
what is desired and winds up being a secruity problem for userspace.

Augment d_move, d_exchange, and __d_unalias with matching calls to
lock_namespace_rename and unlock_namespace_rename, to mark mounts
that directories have escaped from.

A few notes on the implementation:

- The escape count on struct mount must be incremented both before the
  rename and after.  If the count is not incremented before the rename
  it is possible to hit a scenario where the rename happens the code
  walks up the directory tree to somewhere outside of the bind mount
  before the count is touched.  Similary without a count after the
  rename it is possible for the code to look at the escape count
  validate a path is connected before the rename and assume cache the
  escape count, leading to not retesting the path is ok.

- A lock either namespace_sem or mount_lock needs to be held across
  the duration of renames where a directory could be escaping to
  guarantee pairing of the escape_count increments and to ensure
  that a mount is not added, escaped, and missed during the rename.

- The locking order must be mount_lock outside of rename_lock
  as prepend_path already takes the locks in this order.

- I have audited all callers of d_move and d_exchange and in every
  instance it appears safe for d_move and d_exchange to start
  sleeping.  d_splice_alias already sleeps in security_d_instantiate
  so no audit was needed for it to begin sleeping.

  As I can just take mount_lock I don't use that freedom in this
  change, but it can be relevant to small changes to the locking in
  this code.

- The largest change is in d_unalias, where the two cases are split
  apart so they can be handled separately.  In the easy case of a
  rename within the same directory all that is needed is __d_move
  (escaping a mount is impossible in that case).  In the more involved
  case mutexes need to be acquired, and now the spin locks need to be
  dropped so that proper lock aquisition order around __d_move can be
  arranged.

  As I read the code inode->i_lock needs to be held until
  ailas->d_parent->d_parent->i_mutex is taken.  The only case I can
  see that removes an inode from a dentry is d_delete called from a
  path like vfs_unlink.  Those paths all take the parent directories
  inode mutex.  Thus once the parent directories inode mutex is held
  it becomes unnecessary to hold inode->i_lock to ensure the alias
  remains an alias.

  Similarly the rename_lock does not need to be held once the
  s_vfs_rename_mutex is taken.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/dcache.c    |  46 +++++++++++++++++----
 fs/mount.h     |  18 +++++++++
 fs/namespace.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 180 insertions(+), 7 deletions(-)

Comments

Al Viro Aug. 10, 2015, 4:36 a.m. UTC | #1
On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:

> - The escape count on struct mount must be incremented both before the
>   rename and after.  If the count is not incremented before the rename
>   it is possible to hit a scenario where the rename happens the code
>   walks up the directory tree to somewhere outside of the bind mount
>   before the count is touched.  Similary without a count after the
>   rename it is possible for the code to look at the escape count
>   validate a path is connected before the rename and assume cache the
>   escape count, leading to not retesting the path is ok.

Umm...  I wonder if you are overcomplicating the things here.  Sure,
I understand wanting to reduce the checks on "..", but...  It costs you
considerable complexity (especially when it comes to 64bit counts),
it's really brittle (you need to be very careful about the places where
you zero the cached values in fs/namei.c and missing one will lead to
really unpleasant effects there) _and_ it's all for the benefit of 
a very rare case.  With check you are optimizing away not being all that
costly anyway.
 
> - The largest change is in d_unalias, where the two cases are split
>   apart so they can be handled separately.  In the easy case of a
>   rename within the same directory all that is needed is __d_move
>   (escaping a mount is impossible in that case).  In the more involved
>   case mutexes need to be acquired, and now the spin locks need to be
>   dropped so that proper lock aquisition order around __d_move can be
>   arranged.

I _really_ hate that part.  Could you explain WTF is wrong with simply
taking mount_lock in that case of __d_splice_alias() just outside of
rename_lock?

> +	unlock = lock_namespace_rename(dentry, target, false);
> +
>  	write_seqlock(&rename_lock);
>  	__d_move(dentry, target, false);
>  	write_sequnlock(&rename_lock);
> +
> +	if (unlock)
> +		unlock_namespace_rename(unlock, dentry, target, false);
> +

Your unlock_namespace_rename() should've been a static inline.
With the check of unlock != NULL done in there.  Two such inlines,
actually, and to hell with the boolean argument.  Same split for the
lock counterpart, of course.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Aug. 10, 2015, 4:43 a.m. UTC | #2
On Mon, Aug 10, 2015 at 05:36:37AM +0100, Al Viro wrote:

> Your unlock_namespace_rename() should've been a static inline.
> With the check of unlock != NULL done in there.  Two such inlines,
> actually, and to hell with the boolean argument.  Same split for the
> lock counterpart, of course.

PS: that thing should be in fs/dcache.c, at least in the part that
deals with finding the common ancestor, etc.  And __d_move() (and
dentry_lock_for_move()) games with d_ancestor() should be redundant now.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 14, 2015, 4:10 a.m. UTC | #3
Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:
>
>> - The escape count on struct mount must be incremented both before the
>>   rename and after.  If the count is not incremented before the rename
>>   it is possible to hit a scenario where the rename happens the code
>>   walks up the directory tree to somewhere outside of the bind mount
>>   before the count is touched.  Similary without a count after the
>>   rename it is possible for the code to look at the escape count
>>   validate a path is connected before the rename and assume cache the
>>   escape count, leading to not retesting the path is ok.
>
> Umm...  I wonder if you are overcomplicating the things here.  Sure,
> I understand wanting to reduce the checks on "..", but...  It costs you
> considerable complexity (especially when it comes to 64bit counts),
> it's really brittle (you need to be very careful about the places where
> you zero the cached values in fs/namei.c and missing one will lead to
> really unpleasant effects there) _and_ it's all for the benefit of 
> a very rare case.  With check you are optimizing away not being all that
> costly anyway.

I had to give this a long hard think.  Algorithms going to O(N^2) when
it is uncessarry really bother me.  I ran some numbers for really deep
directory trees, slow memory, etc and I could not come up with a
scenario where even in it's worst case d_ancestor would take anywhere
near as long as a one disk seek, and most of the d_ancestor would be
much quicker.

So it appears to me that in the worst case a pathname lookup consisting
of a ridiculous number of .. components starting with a cold cache, on a
mount where a directory has escaped is likely to be faster than a
similar lookup going down the tree with many disk seeks.

I don't think the 64bit counts and the zeroing the cache values are
quite as bad as you make out.  There are much trickier things already in
path name lookup code.  But I do agree that it is easy to get wrong
because nothing will show up in testing, and getting it wrong will have
really unpleasant effects.

I also can't see a scenario where a directory would escape a subtree
that is mounted somewhere without it being a misconfiguration.

So I agree it is not worth it to optimize the code so that there
are an absolute minimum number of d_ancestor calls during pathname
lookup.

Further replacing mnt_escape_count with a mnt_flag makes the code much
simpler.  Which I very much appreciate.

>> - The largest change is in d_unalias, where the two cases are split
>>   apart so they can be handled separately.  In the easy case of a
>>   rename within the same directory all that is needed is __d_move
>>   (escaping a mount is impossible in that case).  In the more involved
>>   case mutexes need to be acquired, and now the spin locks need to be
>>   dropped so that proper lock aquisition order around __d_move can be
>>   arranged.
>
> I _really_ hate that part.  Could you explain WTF is wrong with simply
> taking mount_lock in that case of __d_splice_alias() just outside of
> rename_lock?

Me too.  So upon realizing the that inode->i_lock is held longer than
necessary in d_splice_alias I reworked the locking in d_splice_alias.

Updated patches to follow in a little bit.

> PS: that thing should be in fs/dcache.c, at least in the part that
> deals with finding the common ancestor, etc.  And __d_move() (and
> dentry_lock_for_move()) games with d_ancestor() should be redundant
> now.

It does seem reasonable that the BUG_ONs in __d_move that call
d_ancestor can be removed, or simplified by passing the common ancestor
into __d_move.

I don't know the code in dentry_lock_for_move well enough to say
anything except that the d_ancestor call in dentry_lock_for_move looks
reasonable.

Doing anything inside of __d_move or dentry_lock_for_move appears
to be a detour to the cause of preventing escaping from bind mounts.
So while I have no problems with the with the kinds of changes I hear
you suggesting, but unless I encounter something that makes changing
__d_move or dentry_lock_for_more relevant to the work of preventing
escaping from bind mounts I don't plan on touching them while that
is my focus.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 14, 2015, 4:29 a.m. UTC | #4
It is possible in some situations to rename a file or directory through
one mount point such that it can start out inside of a bind mount and
after the rename wind up outside of the bind mount.  Unfortunately with
user namespaces these conditions can be trivially created by creating a
bind mount under an existing bind mount.

I have identified four situations in which this may be a problem.
- __d_path and d_absolute_path need to error on disconnected paths
  that can not reach some root directory or lsm path based security
  checks can incorrectly succeed.

- Normal path name resolution following .. can lead to a directory
  that is outside of the original loopback mount.

- file handle reconsititution aka exportfs_decode_fh can yield a dentry
  from which d_parent can be followed up to mnt->sb->s_root, but
  d_parent can not be followed up to mnt->mnt_root.

- Mounts on a path that has been renamed outside of a loopback mount
  become unreachable, as there is no possible path that can be passed
  to umount to unmount them.

My strategy:

o File handle reconsitituion problems can be prevented by enabling
  the nfsd subtree checks for nfs exports, and open_by_handle_at
  requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global
  root.  This makes any problems difficult if not impossible to exploit
  in practice so I have not yet written code to address that issue.

o The functions __d_path and d_absolute_path are agumented so that the
  security modules will not be fed a problematic path to work with.

o Following of .. has been agumented to test that after d_parent has
  been resolved the original  directory is connected, and if not
  an error of -ENOENT is returned.

o I do not worry about mounts that are disconnected from their bind
  mount as these mounts can always be freed by either umount -l on
  the bind mount they have escaped from, or by freeing the mount
  namespace.  So I do not believe there is an actual problem.

Pathname resolution is a common fast path and most of the code in this
patchset to support keeping .. from becoming expensive in the common
case.

After hearing the Al's feedback and running some numbers I have given
up attempting to keeping the number of d_ancestor calls during pathname
resolution to an absolute minimum.  It appears that simply preventing
calls d_ancestor unless a directory has escaped is good enough.  This
change in approach has significantly simplified the code.

The big implementation change to note is that I have rewritten
d_splice_alias and made some significant progress in cleaning up how the
locks are dealt with.  The only limitation now is that
dentry->d_parent->d_inode->i_mutex is taken in lookup held when
d_splice_alias is called.    If that ever goes away my new
d_splice_alias can easily take all of the locks it needs to rename
a directory in the proper order.

Does anyone see anything significant that I have missed?

These changes are all against v4.2-rc1. 

For those who like to see everything in a single tree the code is at:

     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

Eric W. Biederman (8):
      dcache: Handle escaped paths in prepend_path
      dcache: Reduce the scope of i_lock in d_splice_alias
      dcache: Clearly separate the two directory rename cases in d_splice_alias
      mnt: Track which mounts use a dentry as root.
      dcache: Implement d_common_ancestor
      dcache: Only read d_flags once is d_is_dir
      mnt: Track when a directory escapes a bind mount
      vfs: Test for and handle paths that are unreachable from their mnt_root

 fs/dcache.c            | 193 +++++++++++++++++++++++++++++++++++++------------
 fs/mount.h             |   9 +++
 fs/namei.c             |  26 ++++++-
 fs/namespace.c         | 152 +++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |  11 ++-
 include/linux/mount.h  |   1 +
 6 files changed, 338 insertions(+), 54 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 9f4de1007a8d..c25ef7ef8e7f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2707,9 +2707,17 @@  static void __d_move(struct dentry *dentry, struct dentry *target,
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry, target, false);
+
 	write_seqlock(&rename_lock);
 	__d_move(dentry, target, false);
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry, target, false);
+
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2720,6 +2728,10 @@  EXPORT_SYMBOL(d_move);
  */
 void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 {
+	const struct dentry *unlock;
+
+	unlock = lock_namespace_rename(dentry1, dentry2, true);
+
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
@@ -2730,6 +2742,9 @@  void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	__d_move(dentry1, dentry2, true);
 
 	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, dentry1, dentry2, true);
 }
 
 /**
@@ -2764,11 +2779,15 @@  static int __d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
-	int ret = -ESTALE;
+	const struct dentry *unlock;
 
 	/* If alias and dentry share a parent, then no extra locks required */
-	if (alias->d_parent == dentry->d_parent)
-		goto out_unalias;
+	if (alias->d_parent == dentry->d_parent) {
+		__d_move(alias, dentry, false);
+		spin_unlock(&inode->i_lock);
+		write_sequnlock(&rename_lock);
+		return 0;
+	}
 
 	/* See lock_rename() */
 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
@@ -2777,16 +2796,30 @@  static int __d_unalias(struct inode *inode,
 	if (!mutex_trylock(&alias->d_parent->d_inode->i_mutex))
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_mutex;
-out_unalias:
+
+	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
+
+	unlock = lock_namespace_rename(alias, dentry, false);
+
+	write_seqlock(&rename_lock);
 	__d_move(alias, dentry, false);
-	ret = 0;
+	write_sequnlock(&rename_lock);
+
+	if (unlock)
+		unlock_namespace_rename(unlock, alias, dentry, false);
+
+	mutex_unlock(m2);
+	mutex_unlock(m1);
+	return 0;
 out_err:
 	spin_unlock(&inode->i_lock);
+	write_sequnlock(&rename_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
 		mutex_unlock(m1);
-	return ret;
+	return -ESTALE;
 }
 
 /**
@@ -2841,7 +2874,6 @@  struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 					inode->i_sb->s_id);
 			} else if (!IS_ROOT(new)) {
 				int err = __d_unalias(inode, dentry, new);
-				write_sequnlock(&rename_lock);
 				if (err) {
 					dput(new);
 					new = ERR_PTR(err);
diff --git a/fs/mount.h b/fs/mount.h
index e8f22970fe59..d32d074cc0d4 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -38,6 +38,7 @@  struct mount {
 	struct mount *mnt_parent;
 	struct dentry *mnt_mountpoint;
 	struct vfsmount mnt;
+	unsigned mnt_escape_count;
 	union {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
@@ -107,6 +108,23 @@  static inline void detach_mounts(struct dentry *dentry)
 	__detach_mounts(dentry);
 }
 
+extern const struct dentry *lock_namespace_rename(struct dentry *, struct dentry *, bool);
+extern void unlock_namespace_rename(const struct dentry *, struct dentry *, struct dentry *, bool);
+
+static inline unsigned read_mnt_escape_count(struct vfsmount *vfsmount)
+{
+	struct mount *mnt = real_mount(vfsmount);
+	unsigned ret = READ_ONCE(mnt->mnt_escape_count);
+	smp_rmb();
+	return ret;
+}
+
+static inline void cache_mnt_escape_count(unsigned *cache, unsigned escape_count)
+{
+	if (likely(escape_count & 1) == 0)
+		*cache = escape_count;
+}
+
 static inline void get_mnt_ns(struct mnt_namespace *ns)
 {
 	atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index 2ce987af9afa..9faec24f3f23 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1681,6 +1681,129 @@  out_unlock:
 	namespace_unlock();
 }
 
+static void lock_escaped_mounts_begin(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			/* Don't return to 0 if the couunt wraps */
+			if (unlikely(mnt->mnt_escape_count == (0U - 2)))
+				mnt->mnt_escape_count = 1;
+			else
+				mnt->mnt_escape_count++;
+			smp_wmb();
+		}
+	}
+}
+
+static void lock_escaped_mounts_end(struct dentry *root)
+{
+	struct mountroot *mr;
+	struct mount *mnt;
+
+	mr = lookup_mountroot(root);
+	if (mr) {
+		/* Mark each mount from which a directory is escaping.
+		 */
+		hlist_for_each_entry(mnt, &mr->r_list, mnt_mr_list) {
+			smp_wmb();
+			mnt->mnt_escape_count++;
+		}
+	}
+}
+
+static void handle_mount_escapes_begin(const struct dentry *ancestor,
+				       struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_begin(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+}
+
+static void handle_mount_escapes_end(const struct dentry *ancestor, struct dentry *escapee)
+{
+	struct dentry *dentry;
+
+	/* Don't look for non-directory escapes */
+	if (!d_is_dir(escapee))
+		return;
+
+	for (dentry = escapee->d_parent; dentry != ancestor;
+	     dentry = dentry->d_parent) {
+
+		if (d_mountroot(dentry))
+			lock_escaped_mounts_end(dentry);
+
+		/* In case there is no common ancestor */
+		if (IS_ROOT(dentry))
+			break;
+	}
+	return;
+}
+
+const struct dentry *lock_namespace_rename(struct dentry *dentry1,
+					   struct dentry *dentry2, bool exchange)
+{
+	const struct dentry *ancestor;
+
+	if (dentry1->d_parent == dentry2->d_parent)
+		return NULL;
+
+	if (!d_is_dir(dentry1) && (!exchange || !d_is_dir(dentry2)))
+		return NULL;
+
+	ancestor = d_common_ancestor(dentry1, dentry2);
+
+	read_seqlock_excl(&mount_lock);
+	if (!exchange) {
+		handle_mount_escapes_begin(ancestor, dentry1);
+	} else {
+		handle_mount_escapes_begin(ancestor, dentry1);
+		handle_mount_escapes_begin(ancestor, dentry2);
+	}
+
+	if (ancestor == NULL)
+		ancestor = ERR_PTR(-ENOENT);
+
+	return ancestor;
+}
+
+void unlock_namespace_rename(const struct dentry *ancestor, struct dentry *dentry1,
+			     struct dentry *dentry2, bool exchange)
+{
+	if (!ancestor)
+		return;
+
+	if (ancestor == ERR_PTR(-ENOENT))
+		ancestor = NULL;
+
+	if (!exchange) {
+		handle_mount_escapes_end(ancestor, dentry2);
+	} else {
+		handle_mount_escapes_end(ancestor, dentry2);
+		handle_mount_escapes_end(ancestor, dentry1);
+	}
+	read_sequnlock_excl(&mount_lock);
+}
+
 /* 
  * Is the caller allowed to modify his namespace?
  */