diff mbox series

[v2,RESEND] namei: clear nd->root.mnt before O_CREAT unlazy

Message ID 20220923140334.514276-1-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v2,RESEND] namei: clear nd->root.mnt before O_CREAT unlazy | expand

Commit Message

Brian Foster Sept. 23, 2022, 2:03 p.m. UTC
The unlazy sequence of an rcuwalk lookup occurs a bit earlier than
normal for O_CREAT lookups (i.e. in open_last_lookups()). The create
logic here historically invoked complete_walk(), which clears the
nd->root.mnt pointer when appropriate before the unlazy.  This
changed in commit 72287417abd1 ("open_last_lookups(): don't abuse
complete_walk() when all we want is unlazy"), which refactored the
create path to invoke unlazy_walk() and not consider nd->root.mnt.

This tweak negatively impacts performance on a concurrent
open(O_CREAT) workload to multiple independent mounts beneath the
root directory. This attributes to increased spinlock contention on
the root dentry via legitimize_root(), to the point where the
spinlock becomes the primary bottleneck over the directory inode
rwsem of the individual submounts. For example, the completion rate
of a 32k thread aim7 create/close benchmark that repeatedly passes
O_CREAT to open preexisting files drops from over 700k "jobs per
minute" to 30, increasing the overall test time from a few minutes
to over an hour.

A similar, more simplified test to create a set of opener tasks
across a set of submounts can demonstrate the problem more quickly.
For example, consider sets of 100 open/close tasks each running
against 64 independent filesystem mounts (i.e. 6400 tasks total),
with each task completing 10k iterations before it exits. On an
80xcpu box running v5.16.0-rc2, this test completes in 50-55s. With
this patch applied, the same test completes in 10-15s.

This is not the most realistic workload in the world as it factors
out inode allocation in the filesystem. The contention can also be
avoided by more selective use of O_CREAT or via use of relative
pathnames. That said, this regression appears to be an unintentional
side effect of code cleanup and might be unexpected for users.
Restore original behavior prior to commit 72287417abd1 by factoring
the nd->root handling logic from complete_walk() into a new helper
and invoke that from both places.

Note that the LOOKUP_CACHE logic is not required here because it is
incompatible with O_CREAT. Otherwise the tradeoff for this change is
that this may impact behavior when an absolute path O_CREAT lookup
lands on a symlink that contains another absolute path. The unlazy
sequence of the create lookup now clears the nd->root mount pointer,
which means that once we read said link via step_into(), the
subsequent nd_jump_root() calls into set_root() to grab the mount
pointer again (from refwalk mode). This is historical behavior for
O_CREAT and less common than the current behavior of a typical
create lookup unnecessarily legitimizing the root dentry.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Al,

It looks like this one fell through the cracks from the last time I
posted it [1]. IIRC, the change in v2 was to try and address your
concern around the factoring and unclear helper naming of the original
v1 [2]. Any chance of getting this version pulled? Thanks.

Brian

[1] https://lore.kernel.org/linux-fsdevel/20220112142459.544276-1-bfoster@redhat.com/
[2] https://lore.kernel.org/linux-fsdevel/20220105180259.115760-1-bfoster@redhat.com/

 fs/namei.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

Comments

Al Viro Oct. 2, 2022, 12:06 a.m. UTC | #1
On Fri, Sep 23, 2022 at 10:03:34AM -0400, Brian Foster wrote:

> incompatible with O_CREAT. Otherwise the tradeoff for this change is
> that this may impact behavior when an absolute path O_CREAT lookup
> lands on a symlink that contains another absolute path. The unlazy
> sequence of the create lookup now clears the nd->root mount pointer,
> which means that once we read said link via step_into(), the
> subsequent nd_jump_root() calls into set_root() to grab the mount
> pointer again (from refwalk mode). This is historical behavior for
> O_CREAT and less common than the current behavior of a typical
> create lookup unnecessarily legitimizing the root dentry.

I'm not worried about the overhead of retrieving the root again;
using the different values for beginning and the end of pathwalk,
OTOH...

It's probably OK, but it makes analysis harder.  Do we have a real-world
testcases where the contention would be observable?
Brian Foster Oct. 4, 2022, 5:09 p.m. UTC | #2
On Sun, Oct 02, 2022 at 01:06:22AM +0100, Al Viro wrote:
> On Fri, Sep 23, 2022 at 10:03:34AM -0400, Brian Foster wrote:
> 
> > incompatible with O_CREAT. Otherwise the tradeoff for this change is
> > that this may impact behavior when an absolute path O_CREAT lookup
> > lands on a symlink that contains another absolute path. The unlazy
> > sequence of the create lookup now clears the nd->root mount pointer,
> > which means that once we read said link via step_into(), the
> > subsequent nd_jump_root() calls into set_root() to grab the mount
> > pointer again (from refwalk mode). This is historical behavior for
> > O_CREAT and less common than the current behavior of a typical
> > create lookup unnecessarily legitimizing the root dentry.
> 
> I'm not worried about the overhead of retrieving the root again;
> using the different values for beginning and the end of pathwalk,
> OTOH...
> 
> It's probably OK, but it makes analysis harder.  Do we have a real-world
> testcases where the contention would be observable?
> 

The reproducer was an old aim7 benchmark doing open(O_CREAT)'s and
close()'s. The only way I was able to reproduce it at the time was to
scale out open(O_CREAT)'s of prexisting files across many different
submounts, which ended up being limited by the root entry of the rootfs.
If I try to run a sustained file allocation workload in a similar
environment, then the underlying filesystems tend to bottleneck before
this particular dentry lock and it's not really noticeable from what I
can see (though I don't think I have as fast storage as the original
reporter).

My thought process for this patch was not so much that the workload was
critical, but rather that the regression seemed an unintentional side
effect of refactoring and easy enough to avoid.

Brian
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 53b4bc094db2..083b8b6bc566 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -858,6 +858,18 @@  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
 		return 1;
 }
 
+static inline bool nd_reset_root_and_unlazy(struct nameidata *nd)
+{
+	/*
+	 * We don't want to zero nd->root for scoped-lookups or
+	 * externally-managed nd->root.
+	 */
+	if (!(nd->state & ND_ROOT_PRESET))
+		if (!(nd->flags & LOOKUP_IS_SCOPED))
+			nd->root.mnt = NULL;
+	return try_to_unlazy(nd);
+}
+
 /**
  * complete_walk - successful completion of path walk
  * @nd:  pointer nameidata
@@ -874,15 +886,8 @@  static int complete_walk(struct nameidata *nd)
 	int status;
 
 	if (nd->flags & LOOKUP_RCU) {
-		/*
-		 * We don't want to zero nd->root for scoped-lookups or
-		 * externally-managed nd->root.
-		 */
-		if (!(nd->state & ND_ROOT_PRESET))
-			if (!(nd->flags & LOOKUP_IS_SCOPED))
-				nd->root.mnt = NULL;
 		nd->flags &= ~LOOKUP_CACHED;
-		if (!try_to_unlazy(nd))
+		if (!nd_reset_root_and_unlazy(nd))
 			return -ECHILD;
 	}
 
@@ -3457,7 +3462,7 @@  static const char *open_last_lookups(struct nameidata *nd,
 	} else {
 		/* create side of things */
 		if (nd->flags & LOOKUP_RCU) {
-			if (!try_to_unlazy(nd))
+			if (!nd_reset_root_and_unlazy(nd))
 				return ERR_PTR(-ECHILD);
 		}
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);