diff mbox series

[RFC] VFS: lock source directory for link to avoid rename race.

Message ID 166330881189.15759.13499931397891560275@noble.neil.brown.name (mailing list archive)
State New, archived
Headers show
Series [RFC] VFS: lock source directory for link to avoid rename race. | expand

Commit Message

NeilBrown Sept. 16, 2022, 6:13 a.m. UTC
rename(2) is documented as

       If newpath already exists, it will be atomically replaced, so
       that there is no point at which another process attempting to
       access newpath will find it missing.

However link(2) from a given path can race with rename renaming to that
path so that link gets -ENOENT because the path has already been unlinked
by rename, and creating a link to an unlinked file is not permitted.

This can be fixed by locking the source directory before performing the
lookup of the final component.  The lock blocks rename from changing
that component.

We already lock the target directory, so when they are different we need
to be careful about deadlocks.  In the worst case we can use the same
strategy as lock_rename() however the cost of s_vfs_rename_mutex is not
always needed and is best avoided.

Firstly we lock the target and if the source is different we try-lock
that.  This cannot deadlock as we never block while holding a lock.  If
the trylock fails, we drop the first lock, take s_vfs_rename_mutex, and
follow the same pattern as rename_lock().  We only take a shared lock on
the source directory.

->link() functions cannot expect the source directory to be locked as
some callers of vfs_link() already have a dentry and do not perform the
lookup that can race with rename.  nfsd is a clear example - if there is
a race it can happen on the client or between clients, and NFS as
specified cannot avoid that.

The handling of AT_EMPTY_PATH is a little inelegant.

Reported-by: Xavier Roche <xavier.roche@algolia.com>
Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/
Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file")
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 .../filesystems/directory-locking.rst         |  25 +++-
 Documentation/filesystems/locking.rst         |   6 +-
 fs/namei.c                                    | 119 ++++++++++++++----
 3 files changed, 124 insertions(+), 26 deletions(-)

Comments

Miklos Szeredi Sept. 16, 2022, 6:28 a.m. UTC | #1
On Fri, 16 Sept 2022 at 08:13, NeilBrown <neilb@suse.de> wrote:

> @@ -4554,44 +4590,83 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
>         if (flags & AT_SYMLINK_FOLLOW)
>                 how |= LOOKUP_FOLLOW;
>  retry:
> -       error = filename_lookup(olddfd, old, how, &old_path, NULL);
> +       err2 = 0;
> +       error = filename_parentat(olddfd, old, how, &old_path,
> +                                 &old_last, &old_type);
>         if (error)
>                 goto out_putnames;
> +       error = -EISDIR;
> +       if (old_type != LAST_NORM && !(flags & AT_EMPTY_PATH))
> +               goto out_putnames;
> +       error = filename_parentat(newdfd, new, (how & LOOKUP_REVAL), &new_path,
> +                                 &new_last, &new_type);
> +       if (error)
> +               goto out_putoldpath;
>
> -       new_dentry = filename_create(newdfd, new, &new_path,
> -                                       (how & LOOKUP_REVAL));
> -       error = PTR_ERR(new_dentry);
> -       if (IS_ERR(new_dentry))
> -               goto out_putpath;
> +       err2 = mnt_want_write(new_path.mnt);
>
>         error = -EXDEV;
>         if (old_path.mnt != new_path.mnt)
> -               goto out_dput;
> +               goto out_putnewpath;
> +       lock_link(new_path.dentry, old_path.dentry, flags);
> +
> +       new_dentry = __lookup_hash(&new_last, new_path.dentry, how & LOOKUP_REVAL);
> +       error = PTR_ERR(new_dentry);
> +       if (IS_ERR(new_dentry))
> +               goto out_unlock;
> +       error = -EEXIST;
> +       if (d_is_positive(new_dentry))
> +               goto out_dput_new;
> +       if (new_type != LAST_NORM)
> +               goto out_dput_new;
> +
> +       error = err2;
> +       if (error)
> +               goto out_dput_new;
> +
> +       if (flags & AT_EMPTY_PATH)
> +               old_dentry = dget(old_path.dentry);
> +       else
> +               old_dentry = __lookup_hash(&old_last, old_path.dentry, how);

This will break AT_SYMLINK_FOLLOW.

And yes, we can add all the lookup logic to do_linkat() at which point
it will about 10x more complex than it was.

Thanks,
Miklos
NeilBrown Sept. 16, 2022, 6:45 a.m. UTC | #2
On Fri, 16 Sep 2022, Miklos Szeredi wrote:
> 
> This will break AT_SYMLINK_FOLLOW.
> 
> And yes, we can add all the lookup logic to do_linkat() at which point
> it will about 10x more complex than it was.

Excellent point.  I'll give that some thought.  Thanks.

NeilBrown
Al Viro Sept. 16, 2022, 6:49 a.m. UTC | #3
On Fri, Sep 16, 2022 at 08:28:06AM +0200, Miklos Szeredi wrote:

> This will break AT_SYMLINK_FOLLOW.

Right you are.

> And yes, we can add all the lookup logic to do_linkat() at which point
> it will about 10x more complex than it was.

Especially since you can't reject an apparent cross-fs link until you'v
looked the fucker up, since it just might be a symlink to be followed.
Which means it would have to be something like
	find parents
again:
	if on different mounts
		if !follow
			fuck off
		lock old parent
		look the last component up
		if not an existing symlink
			fuck off
		unlock the parent and try to follow that symlink
		goto again
	lock parents
	look the last components up
	if symlink to be followed
		unlock parents
		try to follow symlink
		goto again
	proceed

Not exactly fatal, but...
Amir Goldstein Sept. 16, 2022, 2:32 p.m. UTC | #4
On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote:
>
>
> rename(2) is documented as
>
>        If newpath already exists, it will be atomically replaced, so
>        that there is no point at which another process attempting to
>        access newpath will find it missing.
>
> However link(2) from a given path can race with rename renaming to that
> path so that link gets -ENOENT because the path has already been unlinked
> by rename, and creating a link to an unlinked file is not permitted.
>

I have to ask. Is this a real problem or just a matter of respecting
the laws of this man page?

If we manage to return EBUSY in that case to link(2)
will everyone be happy and we can avoid trying to make link(2)
atomic w.r.t. rename(2)?

Thanks,
Amir.
Christian Brauner Sept. 19, 2022, 8:28 a.m. UTC | #5
On Fri, Sep 16, 2022 at 05:32:45PM +0300, Amir Goldstein wrote:
> On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote:
> >
> >
> > rename(2) is documented as
> >
> >        If newpath already exists, it will be atomically replaced, so
> >        that there is no point at which another process attempting to
> >        access newpath will find it missing.
> >
> > However link(2) from a given path can race with rename renaming to that
> > path so that link gets -ENOENT because the path has already been unlinked
> > by rename, and creating a link to an unlinked file is not permitted.
> >
> 
> I have to ask. Is this a real problem or just a matter of respecting
> the laws of this man page?

I have to say that I have the same reaction. The commit message doesn't
really explain where the current behavior becomes an issue and whether
there are any users seeing issues with this. And the patch makes
do_linkat() way more complex than it was before.
NeilBrown Sept. 19, 2022, 10:56 p.m. UTC | #6
On Mon, 19 Sep 2022, Christian Brauner wrote:
> On Fri, Sep 16, 2022 at 05:32:45PM +0300, Amir Goldstein wrote:
> > On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote:
> > >
> > >
> > > rename(2) is documented as
> > >
> > >        If newpath already exists, it will be atomically replaced, so
> > >        that there is no point at which another process attempting to
> > >        access newpath will find it missing.
> > >
> > > However link(2) from a given path can race with rename renaming to that
> > > path so that link gets -ENOENT because the path has already been unlinked
> > > by rename, and creating a link to an unlinked file is not permitted.
> > >
> > 
> > I have to ask. Is this a real problem or just a matter of respecting
> > the laws of this man page?
> 
> I have to say that I have the same reaction. The commit message doesn't
> really explain where the current behavior becomes an issue and whether
> there are any users seeing issues with this. And the patch makes
> do_linkat() way more complex than it was before.
> 

A bug is a bug .... and in this case it is an intriguing puzzle too.

Yes, the commit message could say a bit more about context.

The patch also isn't correct, so the complexity is not relevant in this
case.  Some complexity will likely been needed (I do have a really
simple patch that just retries the whole op, but I don't think that is
safe), and we do need to balance the complexity against the value.
Ideally we could end up making the code simpler ...  I'm not sure I can
manage that though :-)

Thanks,
NeilBrown
kernel test robot Sept. 23, 2022, 3:02 a.m. UTC | #7
Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3fb4ec6faac286d97e27f48715f2cda56a701cd3 ("[PATCH RFC] VFS: lock source directory for link to avoid rename race.")
url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/VFS-lock-source-directory-for-link-to-avoid-rename-race/20220916-141546
base: https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git for-next
patch link: https://lore.kernel.org/linux-fsdevel/166330881189.15759.13499931397891560275@noble.neil.brown.name

in testcase: ltp
version: ltp-x86_64-14c1f76-1_20220829
with following parameters:

	disk: 1HDD
	fs: ext4
	test: syscalls-01

test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.
test-url: http://linux-test-project.github.io/


on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):




If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/r/202209231042.4f771ffe-oliver.sang@intel.com



<<<test_start>>>
tag=linkat02 stime=1663772091
cmdline="linkat02"
contacts=""
analysis=exit
<<<test_output>>>
mke2fs 1.46.5 (30-Dec-2021)
linkat02    0  TINFO  :  Found free device 0 '/dev/loop0'
linkat02    0  TINFO  :  Formatting /dev/loop0 with ext2 opts='' extra opts=''
linkat02    0  TINFO  :  the maximum number of hard links to emlink_dir/testfile0 is hit: 65000
linkat02    1  TPASS  :  linkat failed as expected: TEST_ERRNO=ENAMETOOLONG(36): File name too long
linkat02    2  TPASS  :  linkat failed as expected: TEST_ERRNO=ENAMETOOLONG(36): File name too long
linkat02    3  TPASS  :  linkat failed as expected: TEST_ERRNO=EEXIST(17): File exists
linkat02    4  TFAIL  :  linkat02.c:132: linkat failed unexpectedly; expected: 40 - Too many levels of symbolic links: TEST_ERRNO=EEXIST(17): File exists
linkat02    5  TPASS  :  linkat failed as expected: TEST_ERRNO=EACCES(13): Permission denied
linkat02    6  TFAIL  :  linkat02.c:132: linkat failed unexpectedly; expected: 30 - Read-only file system: TEST_ERRNO=EXDEV(18): Invalid cross-device link
linkat02    7  TPASS  :  linkat failed as expected: TEST_ERRNO=EMLINK(31): Too many links



To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
diff mbox series

Patch

diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index 504ba940c36c..da6fa5eff81d 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -11,7 +11,7 @@  When taking the i_rwsem on multiple non-directory objects, we
 always acquire the locks in order by increasing address.  We'll call
 that "inode pointer" order in the following.
 
-For our purposes all operations fall in 5 classes:
+For our purposes all operations fall in 7 classes:
 
 1) read access.  Locking rules: caller locks directory we are accessing.
 The lock is taken shared.
@@ -31,9 +31,9 @@  Then call the method.  All locks are exclusive.
 NB: we might get away with locking the source (and target in exchange
 case) shared.
 
-5) link creation.  Locking rules:
+5) link creation - source and target in name directory.  Locking rules:
 
-	* lock parent
+	* lock parent before looking up base names
 	* check that source is not a directory
 	* lock source
 	* call the method.
@@ -58,6 +58,22 @@  rules:
 All ->i_rwsem are taken exclusive.  Again, we might get away with locking
 the source (and target in exchange case) shared.
 
+7) cross-directory link.  This requires the source directory to be locked
+so the source cannot be the target of a rename and so be unlinked before
+the link happens (creating a link to an unlinked file is illegal).
+
+Same rules as cross-directory rename can be used (with different errors).
+Locking the filesystem is expensive and often unnecessary so we have a 
+fast path that avoids it.  Locking rules:
+
+	* lock target parent
+	* trylock source parent.  If this fails we unlock target parent
+	  and fall back to full rename locking, then unlock filesystem once
+          directory locks are held.
+	* lookup base names
+
+Lock on source may be shared.
+
 The rules above obviously guarantee that all directories that are going to be
 read, modified or removed by method will be locked by caller.
 
@@ -101,6 +117,9 @@  non-directory objects are not included in the set of contended locks.
 Thus link creation can't be a part of deadlock - it can't be
 blocked on source and it means that it doesn't hold any locks.
 
+The fast-path in link create cannot deadlock as it never blocks while 
+holding a lock.
+
 Any contended object is either held by cross-directory rename or
 has a child that is also contended.  Indeed, suppose that it is held by
 operation other than cross-directory rename.  Then the lock this operation
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 4bb2627026ec..3190bb18f1c2 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -92,7 +92,7 @@  ops		i_rwsem(inode)
 =============	=============================================
 lookup:		shared
 create:		exclusive
-link:		exclusive (both)
+link:		exclusive (both) possibly shared on source dir
 mknod:		exclusive
 symlink:	exclusive
 mkdir:		exclusive
@@ -117,7 +117,11 @@  fileattr_set:	exclusive
 
 	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
 	exclusive on victim.
+	->rename() has ->i_rwsem on target if it exists, and also on
+        source if it is a non-directory.
 	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+	->link() may have shared ->i_rwsem on source directory if it is
+	different from target directory.
 
 See Documentation/filesystems/directory-locking.rst for more detailed discussion
 of the locking scheme for directory operations.
diff --git a/fs/namei.c b/fs/namei.c
index 53b4bc094db2..877cac4e2e63 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4518,6 +4518,39 @@  int vfs_link(struct dentry *old_dentry, struct user_namespace *mnt_userns,
 }
 EXPORT_SYMBOL(vfs_link);
 
+static void lock_link(struct dentry *dest, struct dentry *source, int flags)
+{
+	inode_lock_nested(dest->d_inode, I_MUTEX_PARENT);
+	if (dest == source || (flags & AT_EMPTY_PATH))
+		return;
+	if (inode_trylock_shared(source->d_inode))
+		return;
+
+	/* Need rename mutex */
+	inode_unlock(dest->d_inode);
+
+	mutex_lock(&dest->d_sb->s_vfs_rename_mutex);
+
+	if (d_ancestor(dest, source)) {
+		inode_lock_nested(dest->d_inode, I_MUTEX_PARENT);
+		inode_lock_shared_nested(source->d_inode, I_MUTEX_CHILD);
+	} else if (d_ancestor(source, dest)) {
+		inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT);
+		inode_lock_nested(dest->d_inode, I_MUTEX_CHILD);
+	} else {
+		inode_lock_nested(dest->d_inode, I_MUTEX_PARENT);
+		inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT2);
+	}
+	mutex_unlock(&dest->d_sb->s_vfs_rename_mutex);
+}
+
+static void unlock_link(struct dentry *dest, struct dentry *source, int flags)
+{
+	if (source != dest && !(flags & AT_EMPTY_PATH))
+		inode_unlock_shared(source->d_inode);
+	inode_unlock(dest->d_inode);
+}
+
 /*
  * Hardlinks are often used in delicate situations.  We avoid
  * security-related surprises by not following symlinks on the
@@ -4531,11 +4564,14 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	      struct filename *new, int flags)
 {
 	struct user_namespace *mnt_userns;
-	struct dentry *new_dentry;
-	struct path old_path, new_path;
+	struct dentry *old_dentry, *new_dentry;
+	struct path old_path, new_path, link_path;
+	struct qstr old_last, new_last;
+	int old_type, new_type;
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	int err2;
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
 		error = -EINVAL;
@@ -4554,44 +4590,83 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 	if (flags & AT_SYMLINK_FOLLOW)
 		how |= LOOKUP_FOLLOW;
 retry:
-	error = filename_lookup(olddfd, old, how, &old_path, NULL);
+	err2 = 0;
+	error = filename_parentat(olddfd, old, how, &old_path,
+				  &old_last, &old_type);
 	if (error)
 		goto out_putnames;
+	error = -EISDIR;
+	if (old_type != LAST_NORM && !(flags & AT_EMPTY_PATH))
+		goto out_putnames;
+	error = filename_parentat(newdfd, new, (how & LOOKUP_REVAL), &new_path,
+				  &new_last, &new_type);
+	if (error)
+		goto out_putoldpath;
 
-	new_dentry = filename_create(newdfd, new, &new_path,
-					(how & LOOKUP_REVAL));
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
-		goto out_putpath;
+	err2 = mnt_want_write(new_path.mnt);
 
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
-		goto out_dput;
+		goto out_putnewpath;
+	lock_link(new_path.dentry, old_path.dentry, flags);
+
+	new_dentry = __lookup_hash(&new_last, new_path.dentry, how & LOOKUP_REVAL);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = -EEXIST;
+	if (d_is_positive(new_dentry))
+		goto out_dput_new;
+	if (new_type != LAST_NORM)
+		goto out_dput_new;
+
+	error = err2;
+	if (error)
+		goto out_dput_new;
+
+	if (flags & AT_EMPTY_PATH)
+		old_dentry = dget(old_path.dentry);
+	else
+		old_dentry = __lookup_hash(&old_last, old_path.dentry, how);
+	error = PTR_ERR(old_dentry);
+	if (IS_ERR(old_dentry))
+		goto out_dput_new;
+	error = -ENOENT;
+	if (d_is_negative(old_dentry))
+		goto out_dput_old;
+
 	mnt_userns = mnt_user_ns(new_path.mnt);
-	error = may_linkat(mnt_userns, &old_path);
+	link_path.mnt = old_path.mnt;
+	link_path.dentry = old_dentry;
+	error = may_linkat(mnt_userns, &link_path);
 	if (unlikely(error))
-		goto out_dput;
-	error = security_path_link(old_path.dentry, &new_path, new_dentry);
+		goto out_dput_old;
+	error = security_path_link(old_dentry, &new_path, new_dentry);
 	if (error)
-		goto out_dput;
-	error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode,
+		goto out_dput_old;
+	error = vfs_link(old_dentry, mnt_userns, new_path.dentry->d_inode,
 			 new_dentry, &delegated_inode);
-out_dput:
-	done_path_create(&new_path, new_dentry);
+out_dput_old:
+	dput(old_dentry);
+out_dput_new:
+	dput(new_dentry);
+out_unlock:
+	unlock_link(new_path.dentry, old_path.dentry, flags);
+out_putnewpath:
+	if (!err2)
+		mnt_drop_write(new_path.mnt);
+	path_put(&new_path);
+out_putoldpath:
+	path_put(&old_path);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
-		if (!error) {
-			path_put(&old_path);
+		if (!error)
 			goto retry;
-		}
 	}
 	if (retry_estale(error, how)) {
-		path_put(&old_path);
 		how |= LOOKUP_REVAL;
 		goto retry;
 	}
-out_putpath:
-	path_put(&old_path);
 out_putnames:
 	putname(old);
 	putname(new);