From patchwork Fri Sep 16 06:13:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 12978157 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7239ECAAD8 for ; Fri, 16 Sep 2022 06:13:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229876AbiIPGNl (ORCPT ); Fri, 16 Sep 2022 02:13:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229503AbiIPGNl (ORCPT ); Fri, 16 Sep 2022 02:13:41 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC74FA1D5D for ; Thu, 15 Sep 2022 23:13:39 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 8509A3389A; Fri, 16 Sep 2022 06:13:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1663308818; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d5GXe9vUqt+1UZ4q+/F3nx5Z5U5Uq7mDoX2V3REVxJU=; b=nNWUllJt2gSnfHpeI/T2oNXt2MpsPJWnTqae8cnwI+bP0+PmHe1d1ztKYpaM0Pcr0J7rvj HozF/HlmmG3gn/CCKKS8pYSAQj1LqL9HPDl3uyGtvQFmRgTEIOlqfSpehofj0XUD6PeVX6 Jy+ZjOjL9AcvaI4HgDs7dDxsY5ohMuY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1663308818; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d5GXe9vUqt+1UZ4q+/F3nx5Z5U5Uq7mDoX2V3REVxJU=; b=P7aQRxjqMxjHzFgTSDDnbfBnz2YOSYXHor6caiMK3/Jehmcj4HyGzcVascOsMIBJuSlph/ RN4fICsYTQEGrQCA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id C4AA11346B; Fri, 16 Sep 2022 06:13:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id w273HhAUJGO5EAAAMHmgww (envelope-from ); Fri, 16 Sep 2022 06:13:36 +0000 MIME-Version: 1.0 From: "NeilBrown" To: "Al Viro" Cc: "Miklos Szeredi" , "Xavier Roche" , linux-fsdevel@vger.kernel.org Subject: [PATCH RFC] VFS: lock source directory for link to avoid rename race. In-reply-to: References: <20220221082002.508392-1-mszeredi@redhat.com>, <166304411168.30452.12018495245762529070@noble.neil.brown.name>, , <166311315747.20483.5039023553379547679@noble.neil.brown.name>, Date: Fri, 16 Sep 2022 16:13:31 +1000 Message-id: <166330881189.15759.13499931397891560275@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org rename(2) is documented as If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However link(2) from a given path can race with rename renaming to that path so that link gets -ENOENT because the path has already been unlinked by rename, and creating a link to an unlinked file is not permitted. This can be fixed by locking the source directory before performing the lookup of the final component. The lock blocks rename from changing that component. We already lock the target directory, so when they are different we need to be careful about deadlocks. In the worst case we can use the same strategy as lock_rename() however the cost of s_vfs_rename_mutex is not always needed and is best avoided. Firstly we lock the target and if the source is different we try-lock that. This cannot deadlock as we never block while holding a lock. If the trylock fails, we drop the first lock, take s_vfs_rename_mutex, and follow the same pattern as rename_lock(). We only take a shared lock on the source directory. ->link() functions cannot expect the source directory to be locked as some callers of vfs_link() already have a dentry and do not perform the lookup that can race with rename. nfsd is a clear example - if there is a race it can happen on the client or between clients, and NFS as specified cannot avoid that. The handling of AT_EMPTY_PATH is a little inelegant. Reported-by: Xavier Roche Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/ Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file") Reported-by: Miklos Szeredi Signed-off-by: NeilBrown --- .../filesystems/directory-locking.rst | 25 +++- Documentation/filesystems/locking.rst | 6 +- fs/namei.c | 119 ++++++++++++++---- 3 files changed, 124 insertions(+), 26 deletions(-) diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst index 504ba940c36c..da6fa5eff81d 100644 --- a/Documentation/filesystems/directory-locking.rst +++ b/Documentation/filesystems/directory-locking.rst @@ -11,7 +11,7 @@ When taking the i_rwsem on multiple non-directory objects, we always acquire the locks in order by increasing address. We'll call that "inode pointer" order in the following. -For our purposes all operations fall in 5 classes: +For our purposes all operations fall in 7 classes: 1) read access. Locking rules: caller locks directory we are accessing. The lock is taken shared. @@ -31,9 +31,9 @@ Then call the method. All locks are exclusive. NB: we might get away with locking the source (and target in exchange case) shared. -5) link creation. Locking rules: +5) link creation - source and target in name directory. Locking rules: - * lock parent + * lock parent before looking up base names * check that source is not a directory * lock source * call the method. @@ -58,6 +58,22 @@ rules: All ->i_rwsem are taken exclusive. Again, we might get away with locking the source (and target in exchange case) shared. +7) cross-directory link. This requires the source directory to be locked +so the source cannot be the target of a rename and so be unlinked before +the link happens (creating a link to an unlinked file is illegal). + +Same rules as cross-directory rename can be used (with different errors). +Locking the filesystem is expensive and often unnecessary so we have a +fast path that avoids it. Locking rules: + + * lock target parent + * trylock source parent. If this fails we unlock target parent + and fall back to full rename locking, then unlock filesystem once + directory locks are held. + * lookup base names + +Lock on source may be shared. + The rules above obviously guarantee that all directories that are going to be read, modified or removed by method will be locked by caller. @@ -101,6 +117,9 @@ non-directory objects are not included in the set of contended locks. Thus link creation can't be a part of deadlock - it can't be blocked on source and it means that it doesn't hold any locks. +The fast-path in link create cannot deadlock as it never blocks while +holding a lock. + Any contended object is either held by cross-directory rename or has a child that is also contended. Indeed, suppose that it is held by operation other than cross-directory rename. Then the lock this operation diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 4bb2627026ec..3190bb18f1c2 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -92,7 +92,7 @@ ops i_rwsem(inode) ============= ============================================= lookup: shared create: exclusive -link: exclusive (both) +link: exclusive (both) possibly shared on source dir mknod: exclusive symlink: exclusive mkdir: exclusive @@ -117,7 +117,11 @@ fileattr_set: exclusive Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem exclusive on victim. + ->rename() has ->i_rwsem on target if it exists, and also on + source if it is a non-directory. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. + ->link() may have shared ->i_rwsem on source directory if it is + different from target directory. See Documentation/filesystems/directory-locking.rst for more detailed discussion of the locking scheme for directory operations. diff --git a/fs/namei.c b/fs/namei.c index 53b4bc094db2..877cac4e2e63 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4518,6 +4518,39 @@ int vfs_link(struct dentry *old_dentry, struct user_namespace *mnt_userns, } EXPORT_SYMBOL(vfs_link); +static void lock_link(struct dentry *dest, struct dentry *source, int flags) +{ + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + if (dest == source || (flags & AT_EMPTY_PATH)) + return; + if (inode_trylock_shared(source->d_inode)) + return; + + /* Need rename mutex */ + inode_unlock(dest->d_inode); + + mutex_lock(&dest->d_sb->s_vfs_rename_mutex); + + if (d_ancestor(dest, source)) { + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + inode_lock_shared_nested(source->d_inode, I_MUTEX_CHILD); + } else if (d_ancestor(source, dest)) { + inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT); + inode_lock_nested(dest->d_inode, I_MUTEX_CHILD); + } else { + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT2); + } + mutex_unlock(&dest->d_sb->s_vfs_rename_mutex); +} + +static void unlock_link(struct dentry *dest, struct dentry *source, int flags) +{ + if (source != dest && !(flags & AT_EMPTY_PATH)) + inode_unlock_shared(source->d_inode); + inode_unlock(dest->d_inode); +} + /* * Hardlinks are often used in delicate situations. We avoid * security-related surprises by not following symlinks on the @@ -4531,11 +4564,14 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, struct filename *new, int flags) { struct user_namespace *mnt_userns; - struct dentry *new_dentry; - struct path old_path, new_path; + struct dentry *old_dentry, *new_dentry; + struct path old_path, new_path, link_path; + struct qstr old_last, new_last; + int old_type, new_type; struct inode *delegated_inode = NULL; int how = 0; int error; + int err2; if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) { error = -EINVAL; @@ -4554,44 +4590,83 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, if (flags & AT_SYMLINK_FOLLOW) how |= LOOKUP_FOLLOW; retry: - error = filename_lookup(olddfd, old, how, &old_path, NULL); + err2 = 0; + error = filename_parentat(olddfd, old, how, &old_path, + &old_last, &old_type); if (error) goto out_putnames; + error = -EISDIR; + if (old_type != LAST_NORM && !(flags & AT_EMPTY_PATH)) + goto out_putnames; + error = filename_parentat(newdfd, new, (how & LOOKUP_REVAL), &new_path, + &new_last, &new_type); + if (error) + goto out_putoldpath; - new_dentry = filename_create(newdfd, new, &new_path, - (how & LOOKUP_REVAL)); - error = PTR_ERR(new_dentry); - if (IS_ERR(new_dentry)) - goto out_putpath; + err2 = mnt_want_write(new_path.mnt); error = -EXDEV; if (old_path.mnt != new_path.mnt) - goto out_dput; + goto out_putnewpath; + lock_link(new_path.dentry, old_path.dentry, flags); + + new_dentry = __lookup_hash(&new_last, new_path.dentry, how & LOOKUP_REVAL); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = -EEXIST; + if (d_is_positive(new_dentry)) + goto out_dput_new; + if (new_type != LAST_NORM) + goto out_dput_new; + + error = err2; + if (error) + goto out_dput_new; + + if (flags & AT_EMPTY_PATH) + old_dentry = dget(old_path.dentry); + else + old_dentry = __lookup_hash(&old_last, old_path.dentry, how); + error = PTR_ERR(old_dentry); + if (IS_ERR(old_dentry)) + goto out_dput_new; + error = -ENOENT; + if (d_is_negative(old_dentry)) + goto out_dput_old; + mnt_userns = mnt_user_ns(new_path.mnt); - error = may_linkat(mnt_userns, &old_path); + link_path.mnt = old_path.mnt; + link_path.dentry = old_dentry; + error = may_linkat(mnt_userns, &link_path); if (unlikely(error)) - goto out_dput; - error = security_path_link(old_path.dentry, &new_path, new_dentry); + goto out_dput_old; + error = security_path_link(old_dentry, &new_path, new_dentry); if (error) - goto out_dput; - error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode, + goto out_dput_old; + error = vfs_link(old_dentry, mnt_userns, new_path.dentry->d_inode, new_dentry, &delegated_inode); -out_dput: - done_path_create(&new_path, new_dentry); +out_dput_old: + dput(old_dentry); +out_dput_new: + dput(new_dentry); +out_unlock: + unlock_link(new_path.dentry, old_path.dentry, flags); +out_putnewpath: + if (!err2) + mnt_drop_write(new_path.mnt); + path_put(&new_path); +out_putoldpath: + path_put(&old_path); if (delegated_inode) { error = break_deleg_wait(&delegated_inode); - if (!error) { - path_put(&old_path); + if (!error) goto retry; - } } if (retry_estale(error, how)) { - path_put(&old_path); how |= LOOKUP_REVAL; goto retry; } -out_putpath: - path_put(&old_path); out_putnames: putname(old); putname(new);