diff mbox series

[2/2] fs: expose LOOKUP_NONBLOCK through openat2() RESOLVE_NONBLOCK

Message ID 20201210200114.525026-3-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand

Commit Message

Jens Axboe Dec. 10, 2020, 8:01 p.m. UTC
Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/open.c                    | 2 ++
 include/linux/fcntl.h        | 2 +-
 include/uapi/linux/openat2.h | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

Comments

Dave Chinner Dec. 10, 2020, 10:29 p.m. UTC | #1
On Thu, Dec 10, 2020 at 01:01:14PM -0700, Jens Axboe wrote:
> Now that we support non-blocking path resolution internally, expose it
> via openat2() in the struct open_how ->resolve flags. This allows
> applications using openat2() to limit path resolution to the extent that
> it is already cached.
> 
> If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
> will return -1/-EAGAIN.
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/open.c                    | 2 ++
>  include/linux/fcntl.h        | 2 +-
>  include/uapi/linux/openat2.h | 2 ++
>  3 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 9af548fb841b..07dc9f3d1628 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1087,6 +1087,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>  		lookup_flags |= LOOKUP_BENEATH;
>  	if (how->resolve & RESOLVE_IN_ROOT)
>  		lookup_flags |= LOOKUP_IN_ROOT;
> +	if (how->resolve & RESOLVE_NONBLOCK)
> +		lookup_flags |= LOOKUP_NONBLOCK;
>  
>  	op->lookup_flags = lookup_flags;
>  	return 0;
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 921e750843e6..919a13c9317c 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -19,7 +19,7 @@
>  /* List of all valid flags for the how->resolve argument: */
>  #define VALID_RESOLVE_FLAGS \
>  	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
> -	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
> +	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK)
>  
>  /* List of all open_how "versions". */
>  #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> index 58b1eb711360..ddbf0796841a 100644
> --- a/include/uapi/linux/openat2.h
> +++ b/include/uapi/linux/openat2.h
> @@ -35,5 +35,7 @@ struct open_how {
>  #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
>  					be scoped inside the dirfd
>  					(similar to chroot(2)). */
> +#define RESOLVE_NONBLOCK	0x20 /* Only complete if resolution can be
> +					done without IO */

I don't think this describes the implementation correctly - it has
nothing to actually do with whether IO is needed, just whether the
lookup can be done without taking blocking locks. The slow path can
complete without doing IO - it might miss the dentry cache but hit
the filesystem buffer cache on lookup and the inode cache when
retrieving the inode. And it may not even block anywhere doing this.

So, really, this isn't avoiding IO at all - it's avoiding the
possibility of running a lookup path that might blocking on
something.

This also needs a openat2(2) man page update explaining exactly what
behaviour/semantics this flag provides and that userspace can rely
on when this flag is set...

We've been failing to define the behaviour of our interfaces clearly,
especially around non-blocking IO behaviour in recent times. We need
to fix that, not make matters worse by adding new, poorly defined
non-blocking behaviours...

I'd also like to know how we actually test this is working- a
reliable regression test for fstests would be very useful for
ensuring that the behaviour as defined by the man page is not broken
accidentally by future changes...

Cheers,

Dave.
Jens Axboe Dec. 10, 2020, 11:12 p.m. UTC | #2
On 12/10/20 3:29 PM, Dave Chinner wrote:
> On Thu, Dec 10, 2020 at 01:01:14PM -0700, Jens Axboe wrote:
>> Now that we support non-blocking path resolution internally, expose it
>> via openat2() in the struct open_how ->resolve flags. This allows
>> applications using openat2() to limit path resolution to the extent that
>> it is already cached.
>>
>> If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
>> will return -1/-EAGAIN.
>>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/open.c                    | 2 ++
>>  include/linux/fcntl.h        | 2 +-
>>  include/uapi/linux/openat2.h | 2 ++
>>  3 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 9af548fb841b..07dc9f3d1628 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -1087,6 +1087,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>>  		lookup_flags |= LOOKUP_BENEATH;
>>  	if (how->resolve & RESOLVE_IN_ROOT)
>>  		lookup_flags |= LOOKUP_IN_ROOT;
>> +	if (how->resolve & RESOLVE_NONBLOCK)
>> +		lookup_flags |= LOOKUP_NONBLOCK;
>>  
>>  	op->lookup_flags = lookup_flags;
>>  	return 0;
>> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
>> index 921e750843e6..919a13c9317c 100644
>> --- a/include/linux/fcntl.h
>> +++ b/include/linux/fcntl.h
>> @@ -19,7 +19,7 @@
>>  /* List of all valid flags for the how->resolve argument: */
>>  #define VALID_RESOLVE_FLAGS \
>>  	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
>> -	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
>> +	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK)
>>  
>>  /* List of all open_how "versions". */
>>  #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
>> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
>> index 58b1eb711360..ddbf0796841a 100644
>> --- a/include/uapi/linux/openat2.h
>> +++ b/include/uapi/linux/openat2.h
>> @@ -35,5 +35,7 @@ struct open_how {
>>  #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
>>  					be scoped inside the dirfd
>>  					(similar to chroot(2)). */
>> +#define RESOLVE_NONBLOCK	0x20 /* Only complete if resolution can be
>> +					done without IO */
> 
> I don't think this describes the implementation correctly - it has
> nothing to actually do with whether IO is needed, just whether the
> lookup can be done without taking blocking locks. The slow path can
> complete without doing IO - it might miss the dentry cache but hit
> the filesystem buffer cache on lookup and the inode cache when
> retrieving the inode. And it may not even block anywhere doing this.
> 
> So, really, this isn't avoiding IO at all - it's avoiding the
> possibility of running a lookup path that might blocking on
> something.

Right, it's about not blocking, as the commit message says. That could
be IO, either directly or indirectly, or it could be just locking. I'll
update this comment.

> This also needs a openat2(2) man page update explaining exactly what
> behaviour/semantics this flag provides and that userspace can rely
> on when this flag is set...

Agree, I'll add that.

> We've been failing to define the behaviour of our interfaces clearly,
> especially around non-blocking IO behaviour in recent times. We need
> to fix that, not make matters worse by adding new, poorly defined
> non-blocking behaviours...

Also agree on that! It doesn't help that different folks have different
criteria of what nowait/nonblock means...

> I'd also like to know how we actually test this is working- a
> reliable regression test for fstests would be very useful for
> ensuring that the behaviour as defined by the man page is not broken
> accidentally by future changes...

Definitely. On the io_uring side, I generally run with a debug patch
that triggers if anything blocks off the submission path. That won't
really work for this one, however.

FWIW, I'm quite fine deferring this patch, I obviously care more about
the io_uring side of things. It seems like a no-brainer to support for
openat2 though, as it would allow applications to decide when to punt
open to another thread. Since this one ties in very closely with
LOOKUP_RCU, I'm not _too_ worried about this one in particular. But it
would be great to have something that we could use with eg RWF_NOWAIT as
well, and similar constructs. I'll give it some thought.
Linus Torvalds Dec. 10, 2020, 11:29 p.m. UTC | #3
On Thu, Dec 10, 2020 at 2:29 PM Dave Chinner <david@fromorbit.com> wrote:
>
> So, really, this isn't avoiding IO at all - it's avoiding the
> possibility of running a lookup path that might blocking on
> something.

For pathname lookup, the only case that matters is the RCU lockless lookup.

That cache hits basically 100% of the time except for the first
lookup, or under memory pressure.

And honestly, from a performance perspective, it's the lockless path
lookup that matters most. By the time you have to go to the
filesystem, take the directory locks etc, you've already lost.

So we're never going to bother with some kind of "lockless lookup for
actual filesystems", because it's only extra work for no actual gain.

End result: LOOKUP_NONBLOCK is about not just avoiding IO, but about
avoiding the filesystem and the inevitable locking that causes.

              Linus
Dave Chinner Dec. 11, 2020, 12:58 a.m. UTC | #4
On Thu, Dec 10, 2020 at 03:29:23PM -0800, Linus Torvalds wrote:
> On Thu, Dec 10, 2020 at 2:29 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > So, really, this isn't avoiding IO at all - it's avoiding the
> > possibility of running a lookup path that might blocking on
> > something.
> 
> For pathname lookup, the only case that matters is the RCU lockless lookup.
> 
> That cache hits basically 100% of the time except for the first
> lookup, or under memory pressure.
> 
> And honestly, from a performance perspective, it's the lockless path
> lookup that matters most. By the time you have to go to the
> filesystem, take the directory locks etc, you've already lost.
> 
> So we're never going to bother with some kind of "lockless lookup for
> actual filesystems", because it's only extra work for no actual gain.
> 
> End result: LOOKUP_NONBLOCK is about not just avoiding IO, but about
> avoiding the filesystem and the inevitable locking that causes.

Umm, yes, that is _exactly_ what I just said. :/

Cheers,

Dave.
Linus Torvalds Dec. 11, 2020, 1:01 a.m. UTC | #5
On Thu, Dec 10, 2020 at 4:58 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Umm, yes, that is _exactly_ what I just said. :/

.,. but it _sounded_ like you would actually want to do the whole
filesystem thing, since why would you have piped up otherwise. I just
wanted to clarify that the onle sane model is the one that patch
actually implements.

Otherwise, your email was just nit-picking about a single word in a
comment in a header file.

Was that really what you wanted to do?

            Linus
Dave Chinner Dec. 11, 2020, 3:45 a.m. UTC | #6
On Thu, Dec 10, 2020 at 05:01:44PM -0800, Linus Torvalds wrote:
> On Thu, Dec 10, 2020 at 4:58 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Umm, yes, that is _exactly_ what I just said. :/
> 
> .,. but it _sounded_ like you would actually want to do the whole
> filesystem thing, since why would you have piped up otherwise. I just
> wanted to clarify that the onle sane model is the one that patch
> actually implements.

<sigh>

Is that really what you think motivates me, Linus? It sounds like
you've created an Evil Dave strawman and you simply cannot see past
the taint Evil Dave has created in your head. :/

I commented because Jens has recently found several issues with
inconsistencies in "non-blocking" APIs that we have in the IO path.
He's triggered bugs in the non-blocking behaviour in filesystem code
through io_uring that we've had to fix (e.g. see commit 883a790a8440
("xfs: don't allow NOWAIT DIO across extent boundaries"). Then there
are the active discussions about the limited ability to use
IOCB_NOWAIT for full stack non-blocking IO behaviour w/ REQ_NOWAIT
in the block layer because the semantics of IOCB_NOWAIT are directly
tied to the requirements of the RWF_NOWAIT preadv2/pwritev2 flag and
using REQ_NOWAIT in the block layer will break them.

Part of the problem we have with the non-blocking behaviour is that
the user interfaces have been under specified, poorly reviewed and
targetted a single specific use case on a single filesystem rather
than generic behaviour. And mostly they lack the necessary test
coverage to ensure all filesystems behave the same way and to inform
us of a regression that *would break userspace applications*.

Yes, I recognise and accept that some of the problems are partially
my fault. I also have a habit of trying to learn from the mistakes
I've made and then take steps to ensure that *we do not make those
same mistakes again*.

> Otherwise, your email was just nit-picking about a single word in a
> comment in a header file.
> 
> Was that really what you wanted to do?

So for all your talk about "don't break userspace", you think that
actively taking steps during reviews to avoid a poor userspace API
is "nit-picking"? FYI, having a reviewer ask for a userspace API
modification to:

	- have clearly specified and documented behaviour,
	- be provided with user documentation, and
	- be submitted with regression tests

is not at all unusual or unexpected. Asking for these things during
review on -fsdevel and various filesystem lists is a normal part of
the process for getting changes to user APIs reviewed and merged.
The fact that Jens replied with "yep, no problems, let's make sure
we nail down the semantics" and Al has replied "what does
RESOLVE_NONBLOCK actually mean for all the blocking stuff that open
does /after/ the pathwalk?" shows that these semantics really do
matter Hence they need to be defined, specified, documented and
carefully exercised by regression tests. i.e. the patch that
introduces the RESOLVE_NONBLOCK flag is the -easy part-, filling in
the rest of the blanks is where all the hard work is...

Hence calling these requests "nit picking" sets entirely the wrong
tone for the wider community. You may not care about things like
properly documenting interfaces, but application developers and
users sure do and hence it's something we need to pay attention to
and encourage.

Leaders are supposed to encourage and support good development
practices, not be arseholes to the people who ask for good practices
to be followed.

Please start behaving more like a leader should when I'm around,
Linus.

-Dave.
(Not really Evil.)
Linus Torvalds Dec. 11, 2020, 6:07 p.m. UTC | #7
On Thu, Dec 10, 2020 at 7:45 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Part of the problem we have with the non-blocking behaviour is that
> the user interfaces have been under specified, poorly reviewed and
> targetted a single specific use case on a single filesystem rather
> than generic behaviour. And mostly they lack the necessary test
> coverage to ensure all filesystems behave the same way and to inform
> us of a regression that *would break userspace applications*.

Fair enough. I just didn't really see much a concern here, exactly
because this ends up not being a hard guarantee in the first place
(but the reason I suggested then adding that RESOLVE_NONBLOCK is so
that it could be tested without having to rely on timing and io_uring
that _should_ get the same result regardless in the end).

But the second reason I don't see much concern is exactly because it
wouldn't affect individual filesystems. There's nothing new going on
as far as  filesystem is concerned.

> Yes, I recognise and accept that some of the problems are partially
> my fault. I also have a habit of trying to learn from the mistakes
> I've made and then take steps to ensure that *we do not make those
> same mistakes again*.

So the third reason I reacted was because we have a history, and you
have traditionally not ever really cared unless it's about xfs and IO.
Which this thing would very explicitly not be about. The low-level
filesystem would never see the semantics at all, and could never get
it wrong.

Well, a filesystem could "get it wrong" in the same sense that it can
get the current LOOKUP_RCU wrong, of course.

But that would be either an outright bug and a correctness problem -
sleeping in RCU context - or be a performance problem - returning
ECHILD very easily due to other reasons. And it would be entirely
unrelated to the nonblocking path open, because it would be a
performance issue even _normally_, just not visible as semantics.

And that's the second reason I like this, and would like to see this,
and see RESOLVE_NONBLOCK: exactly because we have _had_ those subtle
issues that aren't actually correctness issues, but only "oh, this
situation always takes us out of RCU lookup, and it's a very subtle
performance problem".

For example, it used to be that whenever we saw a symlink, we'd throw
up our hands and say "we can't do this in RCU lookup" and give up.
That wasn't a low-level filesystem problem, it was literally at the
VFS layer, because the RCU lookup was fairly limited.

Or some of the security models (well, _all_ of them) used to just say
"oh, I can't do this check in RCU mode" and forced the slow path.

It was very noticeable from a performance angle under certain loads,
because RCU lookup is just _so_ much faster when you otherwise get
locking and reference counting cache conflicts. Yes, yes, Al fixed the
symlinks long ago, but we know RCU lookup still gives up fairly easily
in other circumstances.

And RCU lookup giving up eagerly is fine, of course. The whole point
is that it's an optimistic "let's see if we can do this really
quickly" interface that just works _most_ of the time.

But that also makes it easy to miss things, because it's so hard to
test when it's purely about latency and all the operations will retry
and get the right result in the end. The "noticeable performance
issues" are generally not very noticeable at the level of an
individual operation - you need to do a lot of them, and often in
parallel, to see the _big_ benefits.

So RESOLVE_NONBLOCK would be a nice way to perhaps add automated
testing for "did we screw up RCU pathname lookup", in addition to
perhaps making it easier to find the cases where we currently give up
too quickly just because it was _fairly_ rare and nobody had much
visibility into that case.

And we have had that "oh, RCU lookup broke" a couple of times by
mistake - and then it takes a while to notice, because it's purely
visible as a performance bug and not necessarily all _that_ obvious -
exactly because it's purely about performance, and the semantic end
result is the same.

           Linus
diff mbox series

Patch

diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..07dc9f3d1628 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1087,6 +1087,8 @@  inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		lookup_flags |= LOOKUP_BENEATH;
 	if (how->resolve & RESOLVE_IN_ROOT)
 		lookup_flags |= LOOKUP_IN_ROOT;
+	if (how->resolve & RESOLVE_NONBLOCK)
+		lookup_flags |= LOOKUP_NONBLOCK;
 
 	op->lookup_flags = lookup_flags;
 	return 0;
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 921e750843e6..919a13c9317c 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -19,7 +19,7 @@ 
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
-	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK)
 
 /* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..ddbf0796841a 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -35,5 +35,7 @@  struct open_how {
 #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
 					be scoped inside the dirfd
 					(similar to chroot(2)). */
+#define RESOLVE_NONBLOCK	0x20 /* Only complete if resolution can be
+					done without IO */
 
 #endif /* _UAPI_LINUX_OPENAT2_H */