diff mbox series

[2/4] fs: add support for LOOKUP_NONBLOCK

Message ID 20201214191323.173773-3-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand

Commit Message

Jens Axboe Dec. 14, 2020, 7:13 p.m. UTC
io_uring always punts opens to async context, since there's no control
over whether the lookup blocks or not. Add LOOKUP_NONBLOCK to support
just doing the fast RCU based lookups, which we know will not block. If
we can do a cached path resolution of the filename, then we don't have
to always punt lookups for a worker.

We explicitly disallow O_CREAT | O_TRUNC opens, as those will require
blocking, and O_TMPFILE as that requires filesystem interactions and
there's currently no way to pass down an attempt to do nonblocking
operations there. This basically boils down to whether or not we can
do the fast path of open or not. If we can't, then return -EAGAIN and
let the caller retry from an appropriate context that can handle
blocking.

During path resolution, we always do LOOKUP_RCU first. If that fails and
we terminate LOOKUP_RCU, then fail a LOOKUP_NONBLOCK attempt as well.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namei.c            | 27 ++++++++++++++++++++++++++-
 include/linux/namei.h |  1 +
 2 files changed, 27 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Dec. 15, 2020, 12:24 p.m. UTC | #1
On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
> +++ b/fs/namei.c
> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
>  
>  	nd->flags &= ~LOOKUP_RCU;
> +	if (nd->flags & LOOKUP_NONBLOCK)
> +		goto out1;

If we try a walk in a non-blocking context, it fails, then we punt to
a thread, do we want to prohibit that thread trying an RCU walk first?
I can see arguments both ways -- this may only be a temporary RCU walk
failure, or we may never be able to RCU walk this path.
Jens Axboe Dec. 15, 2020, 3:29 p.m. UTC | #2
On 12/15/20 5:24 AM, Matthew Wilcox wrote:
> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>> +++ b/fs/namei.c
>> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
>>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
>>  
>>  	nd->flags &= ~LOOKUP_RCU;
>> +	if (nd->flags & LOOKUP_NONBLOCK)
>> +		goto out1;
> 
> If we try a walk in a non-blocking context, it fails, then we punt to
> a thread, do we want to prohibit that thread trying an RCU walk first?
> I can see arguments both ways -- this may only be a temporary RCU walk
> failure, or we may never be able to RCU walk this path.

In my opinion, it's not worth it trying to over complicate matters by
handling the retry side differently. Better to just keep them the
same. We'd need a lookup anyway to avoid aliasing.
Matthew Wilcox Dec. 15, 2020, 3:33 p.m. UTC | #3
On Tue, Dec 15, 2020 at 08:29:40AM -0700, Jens Axboe wrote:
> On 12/15/20 5:24 AM, Matthew Wilcox wrote:
> > On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
> >> +++ b/fs/namei.c
> >> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
> >>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
> >>  
> >>  	nd->flags &= ~LOOKUP_RCU;
> >> +	if (nd->flags & LOOKUP_NONBLOCK)
> >> +		goto out1;
> > 
> > If we try a walk in a non-blocking context, it fails, then we punt to
> > a thread, do we want to prohibit that thread trying an RCU walk first?
> > I can see arguments both ways -- this may only be a temporary RCU walk
> > failure, or we may never be able to RCU walk this path.
> 
> In my opinion, it's not worth it trying to over complicate matters by
> handling the retry side differently. Better to just keep them the
> same. We'd need a lookup anyway to avoid aliasing.

but by clearing LOOKUP_RCU here, aren't you making the retry handle
things differently?  maybe i got lost.
Jens Axboe Dec. 15, 2020, 3:37 p.m. UTC | #4
On 12/15/20 8:33 AM, Matthew Wilcox wrote:
> On Tue, Dec 15, 2020 at 08:29:40AM -0700, Jens Axboe wrote:
>> On 12/15/20 5:24 AM, Matthew Wilcox wrote:
>>> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>>>> +++ b/fs/namei.c
>>>> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
>>>>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
>>>>  
>>>>  	nd->flags &= ~LOOKUP_RCU;
>>>> +	if (nd->flags & LOOKUP_NONBLOCK)
>>>> +		goto out1;
>>>
>>> If we try a walk in a non-blocking context, it fails, then we punt to
>>> a thread, do we want to prohibit that thread trying an RCU walk first?
>>> I can see arguments both ways -- this may only be a temporary RCU walk
>>> failure, or we may never be able to RCU walk this path.
>>
>> In my opinion, it's not worth it trying to over complicate matters by
>> handling the retry side differently. Better to just keep them the
>> same. We'd need a lookup anyway to avoid aliasing.
> 
> but by clearing LOOKUP_RCU here, aren't you making the retry handle
> things differently?  maybe i got lost.

That's already how it works, I'm just clearing LOOKUP_NONBLOCK (which
relies on LOOKUP_RCU) when we're clearing LOOKUP_RCU. I can try and
benchmark skipping LOOKUP_RCU when we do the blocking retry, but my gut
tells me it'll be noise.
Jens Axboe Dec. 15, 2020, 4:08 p.m. UTC | #5
On 12/15/20 8:37 AM, Jens Axboe wrote:
> On 12/15/20 8:33 AM, Matthew Wilcox wrote:
>> On Tue, Dec 15, 2020 at 08:29:40AM -0700, Jens Axboe wrote:
>>> On 12/15/20 5:24 AM, Matthew Wilcox wrote:
>>>> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>>>>> +++ b/fs/namei.c
>>>>> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
>>>>>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
>>>>>  
>>>>>  	nd->flags &= ~LOOKUP_RCU;
>>>>> +	if (nd->flags & LOOKUP_NONBLOCK)
>>>>> +		goto out1;
>>>>
>>>> If we try a walk in a non-blocking context, it fails, then we punt to
>>>> a thread, do we want to prohibit that thread trying an RCU walk first?
>>>> I can see arguments both ways -- this may only be a temporary RCU walk
>>>> failure, or we may never be able to RCU walk this path.
>>>
>>> In my opinion, it's not worth it trying to over complicate matters by
>>> handling the retry side differently. Better to just keep them the
>>> same. We'd need a lookup anyway to avoid aliasing.
>>
>> but by clearing LOOKUP_RCU here, aren't you making the retry handle
>> things differently?  maybe i got lost.
> 
> That's already how it works, I'm just clearing LOOKUP_NONBLOCK (which
> relies on LOOKUP_RCU) when we're clearing LOOKUP_RCU. I can try and
> benchmark skipping LOOKUP_RCU when we do the blocking retry, but my gut
> tells me it'll be noise.

OK, ran some numbers. The test app benchmarks opening X files, I just
used /usr on my test box. That's 182677 files. To mimic real worldy
kind of setups, 33% of the files can be looked up hot, so LOOKUP_NONBLOCK
will succeed.

Patchset as posted:

Method		Time (usec)
---------------------------
openat		2,268,930
openat		2,274,256
openat		2,274,256
io_uring	  917,813
io_uring	  921,448 
io_uring	  915,233

And with a LOOKUP_NO_RCU flag, which io_uring sets when it has to do
retry, and which will make namei skip the first LOOKUP_RCU for path
resolution:

Method		Time (usec)
---------------------------
io_uring	  902,410
io_uring	  902,725
io_uring	  896,289

Definitely not faster - whether that's just reboot noise, or if it's
significant, I'd need to look deeper to figure out.
Jens Axboe Dec. 15, 2020, 4:14 p.m. UTC | #6
On 12/15/20 9:08 AM, Jens Axboe wrote:
> On 12/15/20 8:37 AM, Jens Axboe wrote:
>> On 12/15/20 8:33 AM, Matthew Wilcox wrote:
>>> On Tue, Dec 15, 2020 at 08:29:40AM -0700, Jens Axboe wrote:
>>>> On 12/15/20 5:24 AM, Matthew Wilcox wrote:
>>>>> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>>>>>> +++ b/fs/namei.c
>>>>>> @@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
>>>>>>  	BUG_ON(!(nd->flags & LOOKUP_RCU));
>>>>>>  
>>>>>>  	nd->flags &= ~LOOKUP_RCU;
>>>>>> +	if (nd->flags & LOOKUP_NONBLOCK)
>>>>>> +		goto out1;
>>>>>
>>>>> If we try a walk in a non-blocking context, it fails, then we punt to
>>>>> a thread, do we want to prohibit that thread trying an RCU walk first?
>>>>> I can see arguments both ways -- this may only be a temporary RCU walk
>>>>> failure, or we may never be able to RCU walk this path.
>>>>
>>>> In my opinion, it's not worth it trying to over complicate matters by
>>>> handling the retry side differently. Better to just keep them the
>>>> same. We'd need a lookup anyway to avoid aliasing.
>>>
>>> but by clearing LOOKUP_RCU here, aren't you making the retry handle
>>> things differently?  maybe i got lost.
>>
>> That's already how it works, I'm just clearing LOOKUP_NONBLOCK (which
>> relies on LOOKUP_RCU) when we're clearing LOOKUP_RCU. I can try and
>> benchmark skipping LOOKUP_RCU when we do the blocking retry, but my gut
>> tells me it'll be noise.
> 
> OK, ran some numbers. The test app benchmarks opening X files, I just
> used /usr on my test box. That's 182677 files. To mimic real worldy
> kind of setups, 33% of the files can be looked up hot, so LOOKUP_NONBLOCK
> will succeed.
> 
> Patchset as posted:
> 
> Method		Time (usec)
> ---------------------------
> openat		2,268,930
> openat		2,274,256
> openat		2,274,256
> io_uring	  917,813
> io_uring	  921,448 
> io_uring	  915,233
> 
> And with a LOOKUP_NO_RCU flag, which io_uring sets when it has to do
> retry, and which will make namei skip the first LOOKUP_RCU for path
> resolution:
> 
> Method		Time (usec)
> ---------------------------
> io_uring	  902,410
> io_uring	  902,725
> io_uring	  896,289
> 
> Definitely not faster - whether that's just reboot noise, or if it's
> significant, I'd need to look deeper to figure out.

If you're puzzled by the conclusion based on the numbers, there's a good
reason. The first table is io_uring + LOOKUP_NO_RCU for retry, second
table is io_uring as posted. I mistakenly swapped the numbers around...

So conclusion still stands, I just pasted in the wrong set for the
table.
Linus Torvalds Dec. 15, 2020, 6:29 p.m. UTC | #7
On Tue, Dec 15, 2020 at 8:08 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> OK, ran some numbers. The test app benchmarks opening X files, I just
> used /usr on my test box. That's 182677 files. To mimic real worldy
> kind of setups, 33% of the files can be looked up hot, so LOOKUP_NONBLOCK
> will succeed.

Perhaps more interestingly, what's the difference between the patchset
as posted for just io_uring?

IOW, does the synchronous LOOKUP_NONBLOCK actually help?

I'm obviously a big believer in the whole "avoid thread setup costs if
not necessary", so I'd _expect_ it to help, but maybe the possible
extra parallelism is enough to overcome the thread setup and
synchronization costs even for a fast cached RCU lookup.

(I also suspect the reality is often much closer to 100% cached
lookups than just 33%, but who knows - there are things like just
concurrent renames that can cause the RCU lookup to fail even if it
_was_ cached, so it's not purely about whether things are in the
dcache or not).

              Linus
Jens Axboe Dec. 15, 2020, 6:44 p.m. UTC | #8
On 12/15/20 11:29 AM, Linus Torvalds wrote:
> On Tue, Dec 15, 2020 at 8:08 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> OK, ran some numbers. The test app benchmarks opening X files, I just
>> used /usr on my test box. That's 182677 files. To mimic real worldy
>> kind of setups, 33% of the files can be looked up hot, so LOOKUP_NONBLOCK
>> will succeed.
> 
> Perhaps more interestingly, what's the difference between the patchset
> as posted for just io_uring?
> 
> IOW, does the synchronous LOOKUP_NONBLOCK actually help?
> 
> I'm obviously a big believer in the whole "avoid thread setup costs if
> not necessary", so I'd _expect_ it to help, but maybe the possible
> extra parallelism is enough to overcome the thread setup and
> synchronization costs even for a fast cached RCU lookup.

For basically all cases on the io_uring side where I've ended up being
able to do the hot/fast path inline, it's been a nice win. The only real
exception to that rule is buffered reads that are fully cached, and
having multiple async workers copy the data is obviously always going to
be faster at some point due to the extra parallelism and memory
bandwidth. So yes, I too am a big believer in being able to perform
operations inline if at all possible, even if for some thing it turns
into a full retry when we fail. The hot path more than makes up for it.

> (I also suspect the reality is often much closer to 100% cached
> lookups than just 33%, but who knows - there are things like just
> concurrent renames that can cause the RCU lookup to fail even if it
> _was_ cached, so it's not purely about whether things are in the
> dcache or not).

In usecs again, same test, this time just using io_uring:

Cached		5.10-git	5.10-git+LOOKUP_NONBLOCK
--------------------------------------------------------
33%		1,014,975	900,474
100%		435,636		151,475

As expected, the closer we get to fully cached, the better off we are
with using LOOKUP_NONBLOCK. It's a nice win even at just 33% cached.
Linus Torvalds Dec. 15, 2020, 6:47 p.m. UTC | #9
On Tue, Dec 15, 2020 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> In usecs again, same test, this time just using io_uring:
>
> Cached          5.10-git        5.10-git+LOOKUP_NONBLOCK
> --------------------------------------------------------
> 33%             1,014,975       900,474
> 100%            435,636         151,475

Ok, that's quite convincing. Thanks,

            Linus
Jens Axboe Dec. 15, 2020, 7:03 p.m. UTC | #10
On 12/15/20 11:47 AM, Linus Torvalds wrote:
> On Tue, Dec 15, 2020 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> In usecs again, same test, this time just using io_uring:
>>
>> Cached          5.10-git        5.10-git+LOOKUP_NONBLOCK
>> --------------------------------------------------------
>> 33%             1,014,975       900,474
>> 100%            435,636         151,475
> 
> Ok, that's quite convincing. Thanks,

Just for completeness, here's 89% (tool is pretty basic, closest I can
get to 90%):

Cached          5.10-git        5.10-git+LOOKUP_NONBLOCK
--------------------------------------------------------
89%             545,466		292,937

which is about an 1.8x win, where 100% is a 2.8x win.

And for comparison, doing the same thing with the regular openat2()
system call takes 867,897 usec.
Linus Torvalds Dec. 15, 2020, 7:32 p.m. UTC | #11
On Tue, Dec 15, 2020 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> And for comparison, doing the same thing with the regular openat2()
> system call takes 867,897 usec.

Well, you could always thread it. That's what git does for its
'stat()' loop for 'git diff' and friends (the whole "check that the
index is up-to-date with all the files in the working tree" is
basically just a lot of lstat() calls).

           Linus
Jens Axboe Dec. 15, 2020, 7:38 p.m. UTC | #12
On 12/15/20 12:32 PM, Linus Torvalds wrote:
> On Tue, Dec 15, 2020 at 11:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> And for comparison, doing the same thing with the regular openat2()
>> system call takes 867,897 usec.
> 
> Well, you could always thread it. That's what git does for its
> 'stat()' loop for 'git diff' and friends (the whole "check that the
> index is up-to-date with all the files in the working tree" is
> basically just a lot of lstat() calls).

Sure, but imho the whole point of LOOKUP_NONBLOCK + io_uring is that you
don't have to worry about (or deal with) threading. As the offload
numbers prove, offloading when you don't have to is always going to be a
big loss. I like the model where the fast path is just done the right
way, and the slow path is handled automatically, a whole lot more than
chunking it off like that.

Granted that's less of an issue if the workload is "stat this crap ton
of files", it's more difficult when you're doing it in a more piecemeal
fashion as needed.

Actually been on my list to use io_uring stat with git, but needed to
get the LOOKUP_NONBLOCK done first...
Al Viro Dec. 16, 2020, 2:36 a.m. UTC | #13
On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
> io_uring always punts opens to async context, since there's no control
> over whether the lookup blocks or not. Add LOOKUP_NONBLOCK to support
> just doing the fast RCU based lookups, which we know will not block. If
> we can do a cached path resolution of the filename, then we don't have
> to always punt lookups for a worker.
> 
> We explicitly disallow O_CREAT | O_TRUNC opens, as those will require
> blocking, and O_TMPFILE as that requires filesystem interactions and
> there's currently no way to pass down an attempt to do nonblocking
> operations there. This basically boils down to whether or not we can
> do the fast path of open or not. If we can't, then return -EAGAIN and
> let the caller retry from an appropriate context that can handle
> blocking.
> 
> During path resolution, we always do LOOKUP_RCU first. If that fails and
> we terminate LOOKUP_RCU, then fail a LOOKUP_NONBLOCK attempt as well.

Ho-hum...  FWIW, I'm tempted to do the same change of calling conventions
for unlazy_child() (try_to_unlazy_child(), true on success).  OTOH, the
call site is right next to removal of unlikely(status == -ECHILD) suggested
a few days ago...

Mind if I take your first commit + that removal of unlikely + change of calling
conventions for unlazy_child() into #work.namei (based at 5.10), so that
the rest of your series got rebased on top of that?

> @@ -3299,7 +3315,16 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
>  {
>  	struct dentry *child;
>  	struct path path;
> -	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
> +	int error;
> +
> +	/*
> +	 * We can't guarantee that the fs doesn't block further down, so
> +	 * just disallow nonblock attempts at O_TMPFILE for now.
> +	 */
> +	if (flags & LOOKUP_NONBLOCK)
> +		return -EAGAIN;

Not sure I like it here, TBH...
Al Viro Dec. 16, 2020, 2:43 a.m. UTC | #14
On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
> @@ -3140,6 +3149,12 @@ static const char *open_last_lookups(struct nameidata *nd,
>  			return ERR_CAST(dentry);
>  		if (likely(dentry))
>  			goto finish_lookup;
> +		/*
> +		 * We can't guarantee nonblocking semantics beyond this, if
> +		 * the fast lookup fails.
> +		 */
> +		if (nd->flags & LOOKUP_NONBLOCK)
> +			return ERR_PTR(-EAGAIN);
>  
>  		BUG_ON(nd->flags & LOOKUP_RCU);

That can't be right - we already must have removed LOOKUP_RCU here
(see BUG_ON() right after that point).  What is that test supposed
to catch?

What am I missing here?
Jens Axboe Dec. 16, 2020, 3:30 a.m. UTC | #15
On 12/15/20 7:36 PM, Al Viro wrote:
> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>> io_uring always punts opens to async context, since there's no control
>> over whether the lookup blocks or not. Add LOOKUP_NONBLOCK to support
>> just doing the fast RCU based lookups, which we know will not block. If
>> we can do a cached path resolution of the filename, then we don't have
>> to always punt lookups for a worker.
>>
>> We explicitly disallow O_CREAT | O_TRUNC opens, as those will require
>> blocking, and O_TMPFILE as that requires filesystem interactions and
>> there's currently no way to pass down an attempt to do nonblocking
>> operations there. This basically boils down to whether or not we can
>> do the fast path of open or not. If we can't, then return -EAGAIN and
>> let the caller retry from an appropriate context that can handle
>> blocking.
>>
>> During path resolution, we always do LOOKUP_RCU first. If that fails and
>> we terminate LOOKUP_RCU, then fail a LOOKUP_NONBLOCK attempt as well.
> 
> Ho-hum...  FWIW, I'm tempted to do the same change of calling
> conventions for unlazy_child() (try_to_unlazy_child(), true on
> success).  OTOH, the call site is right next to removal of
> unlikely(status == -ECHILD) suggested a few days ago...
> 
> Mind if I take your first commit + that removal of unlikely + change
> of calling conventions for unlazy_child() into #work.namei (based at
> 5.10), so that the rest of your series got rebased on top of that?

Of course, go ahead.

>> @@ -3299,7 +3315,16 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
>>  {
>>  	struct dentry *child;
>>  	struct path path;
>> -	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
>> +	int error;
>> +
>> +	/*
>> +	 * We can't guarantee that the fs doesn't block further down, so
>> +	 * just disallow nonblock attempts at O_TMPFILE for now.
>> +	 */
>> +	if (flags & LOOKUP_NONBLOCK)
>> +		return -EAGAIN;
> 
> Not sure I like it here, TBH...

This ties in with the later email, so you'd prefer to gate this upfront
instead of putting it in here? I'm fine with that.
Jens Axboe Dec. 16, 2020, 3:32 a.m. UTC | #16
On 12/15/20 7:43 PM, Al Viro wrote:
> On Mon, Dec 14, 2020 at 12:13:22PM -0700, Jens Axboe wrote:
>> @@ -3140,6 +3149,12 @@ static const char *open_last_lookups(struct nameidata *nd,
>>  			return ERR_CAST(dentry);
>>  		if (likely(dentry))
>>  			goto finish_lookup;
>> +		/*
>> +		 * We can't guarantee nonblocking semantics beyond this, if
>> +		 * the fast lookup fails.
>> +		 */
>> +		if (nd->flags & LOOKUP_NONBLOCK)
>> +			return ERR_PTR(-EAGAIN);
>>  
>>  		BUG_ON(nd->flags & LOOKUP_RCU);
> 
> That can't be right - we already must have removed LOOKUP_RCU here
> (see BUG_ON() right after that point).  What is that test supposed
> to catch?
> 
> What am I missing here?

Nothing I think, doesn't look like it's needed. If we don't return
a valid dentry under LOOKUP_RCU, we will indeed have unlazied at
this point. So this hunk can go.
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 7eb7830da298..83a7f7866232 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -686,6 +686,8 @@  static bool try_to_unlazy(struct nameidata *nd)
 	BUG_ON(!(nd->flags & LOOKUP_RCU));
 
 	nd->flags &= ~LOOKUP_RCU;
+	if (nd->flags & LOOKUP_NONBLOCK)
+		goto out1;
 	if (unlikely(!legitimize_links(nd)))
 		goto out1;
 	if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
@@ -722,6 +724,8 @@  static int unlazy_child(struct nameidata *nd, struct dentry *dentry, unsigned se
 	BUG_ON(!(nd->flags & LOOKUP_RCU));
 
 	nd->flags &= ~LOOKUP_RCU;
+	if (nd->flags & LOOKUP_NONBLOCK)
+		goto out2;
 	if (unlikely(!legitimize_links(nd)))
 		goto out2;
 	if (unlikely(!legitimize_mnt(nd->path.mnt, nd->m_seq)))
@@ -792,6 +796,7 @@  static int complete_walk(struct nameidata *nd)
 		 */
 		if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
 			nd->root.mnt = NULL;
+		nd->flags &= ~LOOKUP_NONBLOCK;
 		if (!try_to_unlazy(nd))
 			return -ECHILD;
 	}
@@ -2202,6 +2207,10 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 	int error;
 	const char *s = nd->name->name;
 
+	/* LOOKUP_NONBLOCK requires RCU, ask caller to retry */
+	if ((flags & (LOOKUP_RCU | LOOKUP_NONBLOCK)) == LOOKUP_NONBLOCK)
+		return ERR_PTR(-EAGAIN);
+
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
 	if (flags & LOOKUP_RCU)
@@ -3140,6 +3149,12 @@  static const char *open_last_lookups(struct nameidata *nd,
 			return ERR_CAST(dentry);
 		if (likely(dentry))
 			goto finish_lookup;
+		/*
+		 * We can't guarantee nonblocking semantics beyond this, if
+		 * the fast lookup fails.
+		 */
+		if (nd->flags & LOOKUP_NONBLOCK)
+			return ERR_PTR(-EAGAIN);
 
 		BUG_ON(nd->flags & LOOKUP_RCU);
 	} else {
@@ -3233,6 +3248,7 @@  static int do_open(struct nameidata *nd,
 		open_flag &= ~O_TRUNC;
 		acc_mode = 0;
 	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
+		WARN_ON_ONCE(nd->flags & LOOKUP_NONBLOCK);
 		error = mnt_want_write(nd->path.mnt);
 		if (error)
 			return error;
@@ -3299,7 +3315,16 @@  static int do_tmpfile(struct nameidata *nd, unsigned flags,
 {
 	struct dentry *child;
 	struct path path;
-	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
+	int error;
+
+	/*
+	 * We can't guarantee that the fs doesn't block further down, so
+	 * just disallow nonblock attempts at O_TMPFILE for now.
+	 */
+	if (flags & LOOKUP_NONBLOCK)
+		return -EAGAIN;
+
+	error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
 	if (unlikely(error))
 		return error;
 	error = mnt_want_write(path.mnt);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a4bb992623c4..c36c4e0805fc 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -46,6 +46,7 @@  enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
 #define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
 #define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
+#define LOOKUP_NONBLOCK		0x200000 /* don't block for lookup */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)