namei: implemented RENAME_NEWER flag for renameat2() conditional replace

RENAME_NEWER is a new userspace-visible flag for renameat2(), and
stands alongside existing flags such as RENAME_NOREPLACE,
RENAME_EXCHANGE, and RENAME_WHITEOUT.

RENAME_NEWER is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename will
only succeed if the source file is newer than the target (i.e. source
mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER is very useful in distributed systems that mirror a
directory structure, or use a directory as a key/value store, and need
to guarantee that files will only be overwritten by newer files, and
that all updates are atomic.

While this patch may appear large at first glance, most of the changes
deal with renameat2() flags validation, and the core logic is only
5 lines in the do_renameat2() function in fs/namei.c:

	if ((flags & RENAME_NEWER)
	    && d_is_positive(new_dentry)
	    && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
				  &d_backing_inode(new_dentry)->i_mtime) <= 0)
		goto exit5;

It's pretty cool in a way that a new atomic file operation can even be
implemented in just 5 lines of code, and it's thanks to the existing
locking infrastructure around file rename/move that these operations
become almost trivial.  Unfortunately, every fs must approve a new
renameat2() flag, so it bloats the patch a bit.

So one question to ask is could this functionality be implemented
in userspace without adding a new renameat2() flag?  I think you
could attempt it with iterative RENAME_EXCHANGE, but it's hackish,
inefficient, and not atomic, because races could cause temporary
mtime backtracks.  How about using file locking?  Probably not,
because the problem we want to solve is maintaining file/directory
atomicity for readers by creating files out-of-directory, setting
their mtime, and atomically moving them into place.  The strategy
to lock such an operation really requires more complex locking methods
than are generally exposed to userspace.  And if you are using inotify
on the directory to notify readers of changes, it certainly makes
sense to reduce unnecessary churn by preventing a move operation
based on the mtime check.

While some people might question the utility of adding features to
filesystems to make them more like databases, there is real value
in the performance, atomicity, consistent VFS interface, multi-thread
safety, and async-notify capabilities of modern filesystems that
starts to blur the line, and actually make filesystem-based key-value
stores a win for many applications.

Like RENAME_NOREPLACE, the RENAME_NEWER implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
For this reason, my general approach with flags is to accept
RENAME_NEWER wherever RENAME_NOREPLACE is also accepted, since
RENAME_NEWER is simply a conditional variant of RENAME_NOREPLACE.

I noticed only one fs driver (cifs) that treated RENAME_NOREPLACE
in a non-generic way, because no-replace is the natural behavior
for CIFS, and it therefore considers RENAME_NOREPLACE as a hint that
no replacement can occur.  Aside from this special case, it seems
safe to assume that any fs that supports RENAME_NOREPLACE should
also be able to support RENAME_NEWER out of the box.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER.  Build and run with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Signed-off-by: James Yonan <james@openvpn.net>
---
 Documentation/filesystems/vfs.rst             |   5 +
 fs/affs/namei.c                               |   2 +-
 fs/bfs/dir.c                                  |   2 +-
 fs/btrfs/inode.c                              |   2 +-
 fs/cifs/inode.c                               |   2 +-
 fs/exfat/namei.c                              |   7 +-
 fs/ext2/namei.c                               |   2 +-
 fs/ext4/namei.c                               |   2 +-
 fs/f2fs/namei.c                               |   2 +-
 fs/fat/namei_msdos.c                          |   2 +-
 fs/fat/namei_vfat.c                           |   2 +-
 fs/fuse/dir.c                                 |   2 +-
 fs/gfs2/inode.c                               |   2 +-
 fs/hfs/dir.c                                  |   2 +-
 fs/hfsplus/dir.c                              |   2 +-
 fs/hostfs/hostfs_kern.c                       |   2 +-
 fs/hpfs/namei.c                               |   2 +-
 fs/jffs2/dir.c                                |   2 +-
 fs/jfs/namei.c                                |   2 +-
 fs/libfs.c                                    |   2 +-
 fs/minix/namei.c                              |   2 +-
 fs/namei.c                                    |  10 +-
 fs/nilfs2/namei.c                             |   2 +-
 fs/ntfs3/namei.c                              |   2 +-
 fs/omfs/dir.c                                 |   2 +-
 fs/overlayfs/dir.c                            |   4 +-
 fs/reiserfs/namei.c                           |   2 +-
 fs/sysv/namei.c                               |   2 +-
 fs/ubifs/dir.c                                |   2 +-
 fs/udf/namei.c                                |   2 +-
 fs/ufs/namei.c                                |   2 +-
 fs/xfs/xfs_iops.c                             |   2 +-
 include/uapi/linux/fs.h                       |   1 +
 mm/shmem.c                                    |   2 +-
 tools/include/uapi/linux/fs.h                 |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/renameat2/.gitignore  |   1 +
 tools/testing/selftests/renameat2/Makefile    |  10 ++
 .../selftests/renameat2/renameat2_tests.c     | 142 ++++++++++++++++++
 39 files changed, 204 insertions(+), 36 deletions(-)
 create mode 100644 tools/testing/selftests/renameat2/.gitignore
 create mode 100644 tools/testing/selftests/renameat2/Makefile
 create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c

Message ID	20220627221107.176495-1-james@openvpn.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24BEDC433EF for <linux-fsdevel@archiver.kernel.org>; Mon, 27 Jun 2022 22:20:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242569AbiF0WU6 (ORCPT <rfc822;linux-fsdevel@archiver.kernel.org>); Mon, 27 Jun 2022 18:20:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239498AbiF0WU5 (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>); Mon, 27 Jun 2022 18:20:57 -0400 X-Greylist: delayed 440 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Mon, 27 Jun 2022 15:20:53 PDT Received: from mail.yonan.net (mail.yonan.net [54.244.116.145]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97E5E13EA0 for <linux-fsdevel@vger.kernel.org>; Mon, 27 Jun 2022 15:20:53 -0700 (PDT) Received: from unless.localdomain (unknown [76.130.91.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.yonan.net (Postfix) with ESMTPSA id 995553E947; Mon, 27 Jun 2022 22:13:32 +0000 (UTC) From: James Yonan <james@openvpn.net> To: linux-fsdevel@vger.kernel.org Cc: James Yonan <james@openvpn.net> Subject: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace Date: Mon, 27 Jun 2022 16:11:07 -0600 Message-Id: <20220627221107.176495-1-james@openvpn.net> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-fsdevel.vger.kernel.org> X-Mailing-List: linux-fsdevel@vger.kernel.org
Series	namei: implemented RENAME_NEWER flag for renameat2() conditional replace \| expand namei: implemented RENAME_NEWER flag for renameat2() conditional replace

namei: implemented RENAME_NEWER flag for renameat2() conditional replace

Commit Message

Comments

Patch