diff mbox series

[5/6] io_uring: add support for futex wake and wait

Message ID 20230609183125.673140-6-axboe@kernel.dk (mailing list archive)
State New
Headers show
Series Add io_uring support for futex wait/wake | expand

Commit Message

Jens Axboe June 9, 2023, 6:31 p.m. UTC
Add support for FUTEX_WAKE/WAIT primitives.

IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
it does support passing in a bitset.

Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
FUTEX_WAIT_BITSET.

FUTEX_WAKE is straight forward, as we can always just do those inline.
FUTEX_WAIT will queue the futex with an appropriate callback, and
that callback will in turn post a CQE when it has triggered.

Cancelations are supported, both from the application point-of-view,
but also to be able to cancel pending waits if the ring exits before
all events have occurred.

This is just the barebones wait/wake support. Features to be added
later:

- We do not support the PI or requeue operations. The immediate use
  case don't need them, unsure if future support for these would be
  useful.

- Should we support futex wait with timeout? Not clear if this is
  necessary, as the usual io_uring linked timeouts could fill this
  purpose.

- Would be nice to support registered futexes, just like we do buffers.
  This would avoid mapping in user memory for each operation.

- Probably lots more that I just didn't think of.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |   2 +
 include/uapi/linux/io_uring.h  |   3 +
 io_uring/Makefile              |   4 +-
 io_uring/cancel.c              |   5 +
 io_uring/cancel.h              |   4 +
 io_uring/futex.c               | 194 +++++++++++++++++++++++++++++++++
 io_uring/futex.h               |  26 +++++
 io_uring/io_uring.c            |   3 +
 io_uring/opdef.c               |  25 ++++-
 9 files changed, 264 insertions(+), 2 deletions(-)
 create mode 100644 io_uring/futex.c
 create mode 100644 io_uring/futex.h

Comments

Gabriel Krisman Bertazi June 12, 2023, 4:06 p.m. UTC | #1
Jens Axboe <axboe@kernel.dk> writes:

> Add support for FUTEX_WAKE/WAIT primitives.

This is great.  I was so sure io_uring had this support already for some
reason.  I might have dreamed it.

The semantics are tricky, though. You might want to CC peterZ and tglx
directly.

> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> it does support passing in a bitset.

As far as I know, the _BITSET variant are not commonly used in the
current interface.  I haven't seen any code that really benefits from
it.

> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> FUTEX_WAIT_BITSET.

But it is definitely safe to have a single one, basically with the
_BITSET semantics.

> FUTEX_WAKE is straight forward, as we can always just do those inline.
> FUTEX_WAIT will queue the futex with an appropriate callback, and
> that callback will in turn post a CQE when it has triggered.

Even with an asynchronous model, it might make sense to halt execution
of further queued operations until futex completes.  I think
IOSQE_IO_DRAIN is a barrier only against the submission part, so it
wouldn't hep.  Is there a way to ensure this ordering?

I know, it goes against the asynchronous nature of io_uring, but I think
it might be a valid use case. Say we extend FUTEX_WAIT with a way to
acquire the futex in kernel space.  Then, when the CQE returns, we know
the lock is acquired.  if we can queue dependencies on that (stronger
than the link semantics), we could queue operations to be executed once
the lock is taken. Makes sense?

> Cancelations are supported, both from the application point-of-view,
> but also to be able to cancel pending waits if the ring exits before
> all events have occurred.
>
> This is just the barebones wait/wake support. Features to be added
> later:

One item high on my wishlist would be the futexv semantics (wait on any
of a set of futexes).  It cannot be implemented by issuing several
FUTEX_WAIT.
Jens Axboe June 12, 2023, 8:37 p.m. UTC | #2
On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Add support for FUTEX_WAKE/WAIT primitives.
> 
> This is great.  I was so sure io_uring had this support already for some
> reason.  I might have dreamed it.

I think you did :-)

> The semantics are tricky, though. You might want to CC peterZ and tglx
> directly.

For sure, I'll take it wider soon enough. Just wanted to iron out
io_uring details first.

>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>> it does support passing in a bitset.
> 
> As far as I know, the _BITSET variant are not commonly used in the
> current interface.  I haven't seen any code that really benefits from
> it.

Since FUTEX_WAKE is a strict subset of FUTEX_WAKE_BITSET, makes little
sense to not just support both imho.

>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>> FUTEX_WAIT_BITSET.
> 
> But it is definitely safe to have a single one, basically with the
> _BITSET semantics.

Yep I think so.

>> FUTEX_WAKE is straight forward, as we can always just do those inline.
>> FUTEX_WAIT will queue the futex with an appropriate callback, and
>> that callback will in turn post a CQE when it has triggered.
> 
> Even with an asynchronous model, it might make sense to halt execution
> of further queued operations until futex completes.  I think
> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
> wouldn't hep.  Is there a way to ensure this ordering?

You'd use link for that - link whatever depends on the wake to the futex
wait. Or just queue it up once you reap the wait completion, when that
is posted because we got woken.

> I know, it goes against the asynchronous nature of io_uring, but I think
> it might be a valid use case. Say we extend FUTEX_WAIT with a way to
> acquire the futex in kernel space.  Then, when the CQE returns, we know
> the lock is acquired.  if we can queue dependencies on that (stronger
> than the link semantics), we could queue operations to be executed once
> the lock is taken. Makes sense?

It does, and acquiring it _may_ make sense indeed. But I'd rather punt
that to a later thing, and focus on getting the standard (and smaller)
primitives done first.

>> Cancelations are supported, both from the application point-of-view,
>> but also to be able to cancel pending waits if the ring exits before
>> all events have occurred.
>>
>> This is just the barebones wait/wake support. Features to be added
>> later:
> 
> One item high on my wishlist would be the futexv semantics (wait on any
> of a set of futexes).  It cannot be implemented by issuing several
> FUTEX_WAIT.

Yep, I do think that one is interesting enough to consider upfront.
Unfortunately the internal implementation of that does not look that
great, though I'm sure we can make that work. But would likely require
some futexv refactoring to make it work. I can take a look at it.

You could obviously do futexv with this patchset, just posting N futex
waits and canceling N-1 when you get woken by one. Though that's of
course not very pretty or nice to use, but design wise it would totally
work as you don't actually block on these with io_uring.
Gabriel Krisman Bertazi June 12, 2023, 11 p.m. UTC | #3
Jens Axboe <axboe@kernel.dk> writes:

> On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
>> Jens Axboe <axboe@kernel.dk> writes:
>> 
>>> Add support for FUTEX_WAKE/WAIT primitives.
>> 
>> This is great.  I was so sure io_uring had this support already for some
>> reason.  I might have dreamed it.
>
> I think you did :-)

Premonitory!  Still, there should be better things to dream about than
with the kernel code.

>> Even with an asynchronous model, it might make sense to halt execution
>> of further queued operations until futex completes.  I think
>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>> wouldn't hep.  Is there a way to ensure this ordering?
>
> You'd use link for that - link whatever depends on the wake to the futex
> wait. Or just queue it up once you reap the wait completion, when that
> is posted because we got woken.

The challenge of linked requests, in my opinion, is that once a link
chain starts, everything needs to be link together, and a single error
fails everything, which is ok when operations are related, but
not so much when doing IO to different files from the same ring.

>>> Cancelations are supported, both from the application point-of-view,
>>> but also to be able to cancel pending waits if the ring exits before
>>> all events have occurred.
>>>
>>> This is just the barebones wait/wake support. Features to be added
>>> later:
>> 
>> One item high on my wishlist would be the futexv semantics (wait on any
>> of a set of futexes).  It cannot be implemented by issuing several
>> FUTEX_WAIT.
>
> Yep, I do think that one is interesting enough to consider upfront.
>Unfortunately the internal implementation of that does not look that
>great, though I'm sure we can make that work.  ?  But would likely
>require some futexv refactoring to make it work. I can take a look at
>it.

No disagreement here.  To be fair, the main challenge was making the new
interface compatible with a futex being waited on/waked the original
interface. At some point, we had a really nice design for a single
object, but we spent two years bikesheding over the interface and ended
up merging something pretty much similar to the proposal from two years
prior.

> You could obviously do futexv with this patchset, just posting N futex
> waits and canceling N-1 when you get woken by one. Though that's of
> course not very pretty or nice to use, but design wise it would totally
> work as you don't actually block on these with io_uring.

Yes, but at that point, i guess it'd make more sense to implement the
same semantics by polling over a set of eventfds or having a single
futex and doing dispatch in userspace.

thanks,
Jens Axboe June 13, 2023, 1:09 a.m. UTC | #4
On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
>>> Jens Axboe <axboe@kernel.dk> writes:
>>>
>>>> Add support for FUTEX_WAKE/WAIT primitives.
>>>
>>> This is great.  I was so sure io_uring had this support already for some
>>> reason.  I might have dreamed it.
>>
>> I think you did :-)
> 
> Premonitory!  Still, there should be better things to dream about than
> with the kernel code.

I dunno, if it's io_uring related I'm a supporter.

>>> Even with an asynchronous model, it might make sense to halt execution
>>> of further queued operations until futex completes.  I think
>>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>>> wouldn't hep.  Is there a way to ensure this ordering?
>>
>> You'd use link for that - link whatever depends on the wake to the futex
>> wait. Or just queue it up once you reap the wait completion, when that
>> is posted because we got woken.
> 
> The challenge of linked requests, in my opinion, is that once a link
> chain starts, everything needs to be link together, and a single error
> fails everything, which is ok when operations are related, but
> not so much when doing IO to different files from the same ring.

Not quite sure if you're misunderstanding links, or just have a
different use case in mind. You can certainly have several independent
chains of links.

>>>> Cancelations are supported, both from the application point-of-view,
>>>> but also to be able to cancel pending waits if the ring exits before
>>>> all events have occurred.
>>>>
>>>> This is just the barebones wait/wake support. Features to be added
>>>> later:
>>>
>>> One item high on my wishlist would be the futexv semantics (wait on any
>>> of a set of futexes).  It cannot be implemented by issuing several
>>> FUTEX_WAIT.
>>
>> Yep, I do think that one is interesting enough to consider upfront.
>> Unfortunately the internal implementation of that does not look that
>> great, though I'm sure we can make that work.  ?  But would likely
>> require some futexv refactoring to make it work. I can take a look at
>> it.
> 
> No disagreement here.  To be fair, the main challenge was making the new
> interface compatible with a futex being waited on/waked the original
> interface. At some point, we had a really nice design for a single
> object, but we spent two years bikesheding over the interface and ended
> up merging something pretty much similar to the proposal from two years
> prior.

It turned out not to be too bad - here's a poc:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-futex&id=421b12df4ed0bb25c53afe496370bc2b70b04e15

needs a bit of splitting and cleaning, notably I think I need to redo
the futex_q->wake_data bit to make that cleaner with the current use
case and the async use case. With that, then everything can just use
futex_queue() and the only difference really is that the sync variants
will do timer setup upfront and then sleep at the bottom, where the
async part just calls the meat of the function.

>> You could obviously do futexv with this patchset, just posting N futex
>> waits and canceling N-1 when you get woken by one. Though that's of
>> course not very pretty or nice to use, but design wise it would totally
>> work as you don't actually block on these with io_uring.
> 
> Yes, but at that point, i guess it'd make more sense to implement the
> same semantics by polling over a set of eventfds or having a single
> futex and doing dispatch in userspace.

Oh yeah, would not recommend the above approach. Just saying that you
COULD do that if you really wanted to, which is not something you could
do with futex before waitv. But kind of moot now that there's at least a
prototype.
Gabriel Krisman Bertazi June 13, 2023, 2:55 a.m. UTC | #5
Jens Axboe <axboe@kernel.dk> writes:

> On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote:

>>>> Even with an asynchronous model, it might make sense to halt execution
>>>> of further queued operations until futex completes.  I think
>>>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>>>> wouldn't hep.  Is there a way to ensure this ordering?
>>>
>>> You'd use link for that - link whatever depends on the wake to the futex
>>> wait. Or just queue it up once you reap the wait completion, when that
>>> is posted because we got woken.
>> 
>> The challenge of linked requests, in my opinion, is that once a link
>> chain starts, everything needs to be link together, and a single error
>> fails everything, which is ok when operations are related, but
>> not so much when doing IO to different files from the same ring.
>
> Not quite sure if you're misunderstanding links, or just have a
> different use case in mind. You can certainly have several independent
> chains of links.

I might be. Or my use case might be bogus. Please, correct me if either
is the case.

My understanding is that a link is a sequence of commands all carrying
the IOSQE_IO_LINK flag.  io_uring guarantees the ordering within the
link and, if a previous command fails, the rest of the link chain is
aborted.

But, if I have independent commands to submit in between, i.e. on a
different fd, I might want an intermediary operation to not be dependent
on the rest of the link without breaking the chain.  Most of the time I
know ahead of time the entire chain, and I can batch the operations
together.  But, I think it might be a problem specifically for some
unbounded commands like FUTEX_WAIT and recv.  I want a specific
operation to depend on a recv, but I might not be able to specify ahead
of time all of the dependent operations. I'd need to wait for a recv
command to complete and only then issue the dependency, to guarantee
ordering, or I make sure that everything I put on the ring in the
meantime is part of one big link submitted sequentially.

A related issue/example that comes to mind would be two recvs/sends
against the same socket.  When doing a syscall, I know the first recv
will return ahead of the second because it is, well, synchronous.  On
io_uring, I think it must be a link.  I might be recv'ing a huge stream
from the network, and I can't tell if the packet is done on a single
recv.  I could have to issue a second recv but I either make it linked
ahead of time, or I need to wait for the first recv to complete, to only
then submit the second one.  The problem is the ordering of recvs; from
my understanding of the code, I cannot assure the first recv will
complete before the second, without a link.

Sorry if I'm wrong and there are ways around it, but it is a struggling
points for me at the moment with using io_uring.
Andres Freund June 23, 2023, 7:04 p.m. UTC | #6
Hi,

I'd been chatting with Jens about this, so obviously I'm interested in the
feature...

On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
> Add support for FUTEX_WAKE/WAIT primitives.
> 
> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> it does support passing in a bitset.
> 
> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> FUTEX_WAIT_BITSET.

One thing I was wondering about is what happens when there are multiple
OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
don't really have an opinion about what would be best, just that it'd be
helpful to specify the behaviour.

Greetings,

Andres Freund
Jens Axboe June 23, 2023, 7:07 p.m. UTC | #7
On 6/23/23 1:04?PM, Andres Freund wrote:
> Hi,
> 
> I'd been chatting with Jens about this, so obviously I'm interested in the
> feature...
> 
> On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
>> Add support for FUTEX_WAKE/WAIT primitives.
>>
>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>> it does support passing in a bitset.
>>
>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>> FUTEX_WAIT_BITSET.
> 
> One thing I was wondering about is what happens when there are multiple
> OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
> don't really have an opinion about what would be best, just that it'd be
> helpful to specify the behaviour.

Not sure I follow the question, can you elaborate?

If you have N futex waits on the same futex and someone does a wait
(with wakenum >= N), then they'd all wake and post a CQE. If less are
woken because the caller asked for less than N, than that number should
be woken.

IOW, should have the same semantics as "normal" futex waits.

Or maybe I'm totally missing what is being asked here...
Andres Freund June 23, 2023, 7:34 p.m. UTC | #8
Hi,

On 2023-06-23 13:07:12 -0600, Jens Axboe wrote:
> On 6/23/23 1:04?PM, Andres Freund wrote:
> > Hi,
> >
> > I'd been chatting with Jens about this, so obviously I'm interested in the
> > feature...
> >
> > On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
> >> Add support for FUTEX_WAKE/WAIT primitives.
> >>
> >> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> >> it does support passing in a bitset.
> >>
> >> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> >> FUTEX_WAIT_BITSET.
> >
> > One thing I was wondering about is what happens when there are multiple
> > OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
> > don't really have an opinion about what would be best, just that it'd be
> > helpful to specify the behaviour.
>
> Not sure I follow the question, can you elaborate?
>
> If you have N futex waits on the same futex and someone does a wait
> (with wakenum >= N), then they'd all wake and post a CQE. If less are
> woken because the caller asked for less than N, than that number should
> be woken.
>
> IOW, should have the same semantics as "normal" futex waits.

With a normal futex wait you can't wait multiple times on the same futex in
one thread. But with the proposed io_uring interface, one can.

Basically, what is the defined behaviour for:

   sqe = io_uring_get_sqe(ring);
   io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);

   sqe = io_uring_get_sqe(ring);
   io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);

   io_uring_submit(ring)

when someone does:
   futex(FUTEX_WAKE, futex, 1, 0, 0, 0);
   or
   futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0);

or the equivalent io_uring operation.

Is it an error? Will there always be two cqes queued? Will it depend on the
number of wakeups specified by the waker?  I'd assume the latter, but it'd be
good to specify that.

Greetings,

Andres Freund
Jens Axboe June 23, 2023, 7:46 p.m. UTC | #9
On 6/23/23 1:34?PM, Andres Freund wrote:
> Hi,
> 
> On 2023-06-23 13:07:12 -0600, Jens Axboe wrote:
>> On 6/23/23 1:04?PM, Andres Freund wrote:
>>> Hi,
>>>
>>> I'd been chatting with Jens about this, so obviously I'm interested in the
>>> feature...
>>>
>>> On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
>>>> Add support for FUTEX_WAKE/WAIT primitives.
>>>>
>>>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>>>> it does support passing in a bitset.
>>>>
>>>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>>>> FUTEX_WAIT_BITSET.
>>>
>>> One thing I was wondering about is what happens when there are multiple
>>> OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
>>> don't really have an opinion about what would be best, just that it'd be
>>> helpful to specify the behaviour.
>>
>> Not sure I follow the question, can you elaborate?
>>
>> If you have N futex waits on the same futex and someone does a wait
>> (with wakenum >= N), then they'd all wake and post a CQE. If less are
>> woken because the caller asked for less than N, than that number should
>> be woken.
>>
>> IOW, should have the same semantics as "normal" futex waits.
> 
> With a normal futex wait you can't wait multiple times on the same futex in
> one thread. But with the proposed io_uring interface, one can.

Right, but you could have N threads waiting on the same futex.

> Basically, what is the defined behaviour for:
> 
>    sqe = io_uring_get_sqe(ring);
>    io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);
> 
>    sqe = io_uring_get_sqe(ring);
>    io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);
> 
>    io_uring_submit(ring)
> 
> when someone does:
>    futex(FUTEX_WAKE, futex, 1, 0, 0, 0);
>    or
>    futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0);
> 
> or the equivalent io_uring operation.

Waking with num=1 should wake just one of them, which one is really down
to the futex ordering which depends on task priority (which would be the
same here), and ordered after that. So first one should wake the first
sqe queued.

Second one will wake all of them, in that order.

I'll put that in the the test case.

> Is it an error? Will there always be two cqes queued? Will it depend
> on the number of wakeups specified by the waker?  I'd assume the
> latter, but it'd be good to specify that.

It's not an error, I would not want to police that. It will purely
depend on the number of wakes specified by the wake operation. If it's
1, one will be triggered. If it's INT_MAX, then both of them will
trigger. First case will generate one CQE, second one will generate both
CQEs.

No documentation has been written for the io_uring bits yet. But the
current branch is ready for wider posting, so I should get that written
up too...
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f04ce513fadb..d796b578c129 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -273,6 +273,8 @@  struct io_ring_ctx {
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
 
+	struct hlist_head	futex_list;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f222d263bc55..b1a151ab8150 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@  struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		futex_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,8 @@  enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_FUTEX_WAIT,
+	IORING_OP_FUTEX_WAKE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..2e4779bc550c 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,7 @@  obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					openclose.o uring_cmd.o epoll.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
-					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+					cancel.o kbuf.o rsrc.o rw.o opdef.o \
+					notif.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
+obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index b4f5dfacc0c3..280fb83145d3 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -15,6 +15,7 @@ 
 #include "tctx.h"
 #include "poll.h"
 #include "timeout.h"
+#include "futex.h"
 #include "cancel.h"
 
 struct io_cancel {
@@ -98,6 +99,10 @@  int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 	if (ret != -ENOENT)
 		return ret;
 
+	ret = io_futex_cancel(ctx, cd, issue_flags);
+	if (ret != -ENOENT)
+		return ret;
+
 	spin_lock(&ctx->completion_lock);
 	if (!(cd->flags & IORING_ASYNC_CANCEL_FD))
 		ret = io_timeout_cancel(ctx, cd);
diff --git a/io_uring/cancel.h b/io_uring/cancel.h
index 6a59ee484d0c..6a2a38df7159 100644
--- a/io_uring/cancel.h
+++ b/io_uring/cancel.h
@@ -1,4 +1,6 @@ 
 // SPDX-License-Identifier: GPL-2.0
+#ifndef IORING_CANCEL_H
+#define IORING_CANCEL_H
 
 #include <linux/io_uring_types.h>
 
@@ -21,3 +23,5 @@  int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 void init_hash_table(struct io_hash_table *table, unsigned size);
 
 int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg);
+
+#endif
diff --git a/io_uring/futex.c b/io_uring/futex.c
new file mode 100644
index 000000000000..a1d50145927a
--- /dev/null
+++ b/io_uring/futex.c
@@ -0,0 +1,194 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "../kernel/futex/futex.h"
+#include "io_uring.h"
+#include "futex.h"
+
+struct io_futex {
+	struct file	*file;
+	u32 __user	*uaddr;
+	int		futex_op;
+	unsigned int	futex_val;
+	unsigned int	futex_flags;
+	unsigned int	futex_mask;
+	bool		has_timeout;
+	ktime_t		timeout;
+};
+
+static void io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	kfree(req->async_data);
+	io_tw_lock(ctx, ts);
+	hlist_del_init(&req->hash_node);
+	io_req_task_complete(req, ts);
+}
+
+static bool __io_futex_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	struct futex_q *q = req->async_data;
+
+	/* futex wake already done or in progress */
+	if (!futex_unqueue(q))
+		return false;
+
+	hlist_del_init(&req->hash_node);
+	io_req_set_res(req, -ECANCELED, 0);
+	req->io_task_work.func = io_futex_complete;
+	io_req_task_work_add(req);
+	return true;
+}
+
+int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+		    unsigned int issue_flags)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	int nr = 0;
+
+	if (cd->flags & (IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_FD_FIXED))
+		return 0;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) {
+		if (req->cqe.user_data != cd->data &&
+		    !(cd->flags & IORING_ASYNC_CANCEL_ANY))
+			continue;
+		if (__io_futex_cancel(ctx, req))
+			nr++;
+		nr++;
+		if (!(cd->flags & IORING_ASYNC_CANCEL_ALL))
+			break;
+	}
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	if (nr)
+		return nr;
+
+	return -ENOENT;
+}
+
+bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
+			 bool cancel_all)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	bool found = false;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) {
+		if (!io_match_task_safe(req, task, cancel_all))
+			continue;
+		__io_futex_cancel(ctx, req);
+		found = true;
+	}
+
+	return found;
+}
+
+int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	struct __kernel_timespec __user *utime;
+	struct timespec64 t;
+
+	iof->futex_op = READ_ONCE(sqe->fd);
+	iof->uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	iof->futex_val = READ_ONCE(sqe->len);
+	iof->has_timeout = false;
+	iof->futex_mask = READ_ONCE(sqe->file_index);
+	utime = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	if (utime) {
+		if (get_timespec64(&t, utime))
+			return -EFAULT;
+		iof->timeout = timespec64_to_ktime(t);
+		iof->timeout = ktime_add_safe(ktime_get(), iof->timeout);
+		iof->has_timeout = true;
+	}
+	iof->futex_flags = READ_ONCE(sqe->futex_flags);
+	if (iof->futex_flags & FUTEX_CMD_MASK)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void io_futex_wake_fn(struct wake_q_head *wake_q, struct futex_q *q)
+{
+	struct io_kiocb *req = q->wake_data;
+
+	__futex_unqueue(q);
+	smp_store_release(&q->lock_ptr, NULL);
+
+	io_req_set_res(req, 0, 0);
+	req->io_task_work.func = io_futex_complete;
+	io_req_task_work_add(req);
+}
+
+int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned int flags = 0;
+	struct futex_q *q;
+	int ret;
+
+	if (!futex_op_to_flags(FUTEX_WAIT, iof->futex_flags, &flags)) {
+		ret = -ENOSYS;
+		goto done;
+	}
+
+	q = kmalloc(sizeof(*q), GFP_NOWAIT);
+	if (!q) {
+		ret = -ENOMEM;
+		goto done;
+	}
+
+	req->async_data = q;
+	*q = futex_q_init;
+	q->bitset = iof->futex_mask;
+	q->wake = io_futex_wake_fn;
+	q->wake_data = req;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_add_head(&req->hash_node, &ctx->futex_list);
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	ret = futex_queue_wait(q, iof->uaddr, flags, iof->futex_val);
+	if (ret)
+		goto done;
+
+	return IOU_ISSUE_SKIP_COMPLETE;
+done:
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
+
+int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	unsigned int flags = 0;
+	int ret;
+
+	if (!futex_op_to_flags(FUTEX_WAKE, iof->futex_flags, &flags)) {
+		ret = -ENOSYS;
+		goto done;
+	}
+
+	ret = futex_wake(iof->uaddr, flags, iof->futex_val, iof->futex_mask);
+done:
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/futex.h b/io_uring/futex.h
new file mode 100644
index 000000000000..16add2c069cc
--- /dev/null
+++ b/io_uring/futex.h
@@ -0,0 +1,26 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include "cancel.h"
+
+int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags);
+int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags);
+
+#if defined(CONFIG_FUTEX)
+int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+		    unsigned int issue_flags);
+bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
+			 bool cancel_all);
+#else
+static inline int io_futex_cancel(struct io_ring_ctx *ctx,
+				  struct io_cancel_data *cd,
+				  unsigned int issue_flags);
+{
+	return 0;
+}
+static inline bool io_futex_remove_all(struct io_ring_ctx *ctx,
+				       struct task_struct *task, bool cancel_all)
+{
+	return false;
+}
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a467064da1af..8270f37c312d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -92,6 +92,7 @@ 
 #include "cancel.h"
 #include "net.h"
 #include "notif.h"
+#include "futex.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -336,6 +337,7 @@  static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_WQ_LIST(&ctx->locked_free_list);
+	INIT_HLIST_HEAD(&ctx->futex_list);
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
 	return ctx;
@@ -3309,6 +3311,7 @@  static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_cancel_defer_files(ctx, task, cancel_all);
 	mutex_lock(&ctx->uring_lock);
 	ret |= io_poll_remove_all(ctx, task, cancel_all);
+	ret |= io_futex_remove_all(ctx, task, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, task, cancel_all);
 	if (task)
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..e6b03d7f82e5 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@ 
 #include "poll.h"
 #include "cancel.h"
 #include "rw.h"
+#include "futex.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -426,11 +427,26 @@  const struct io_issue_def io_issue_defs[] = {
 		.issue			= io_sendmsg_zc,
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_FUTEX_WAIT] = {
+#if defined(CONFIG_FUTEX)
+		.prep			= io_futex_prep,
+		.issue			= io_futex_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_FUTEX_WAKE] = {
+#if defined(CONFIG_FUTEX)
+		.prep			= io_futex_prep,
+		.issue			= io_futex_wake,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
 
-
 const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_NOP] = {
 		.name			= "NOP",
@@ -648,6 +664,13 @@  const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_FUTEX_WAIT] = {
+		.name			= "FUTEX_WAIT",
+	},
+
+	[IORING_OP_FUTEX_WAKE] = {
+		.name			= "FUTEX_WAKE",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)