Message ID | 20230609183125.673140-6-axboe@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add io_uring support for futex wait/wake | expand |
Jens Axboe <axboe@kernel.dk> writes: > Add support for FUTEX_WAKE/WAIT primitives. This is great. I was so sure io_uring had this support already for some reason. I might have dreamed it. The semantics are tricky, though. You might want to CC peterZ and tglx directly. > IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as > it does support passing in a bitset. As far as I know, the _BITSET variant are not commonly used in the current interface. I haven't seen any code that really benefits from it. > Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and > FUTEX_WAIT_BITSET. But it is definitely safe to have a single one, basically with the _BITSET semantics. > FUTEX_WAKE is straight forward, as we can always just do those inline. > FUTEX_WAIT will queue the futex with an appropriate callback, and > that callback will in turn post a CQE when it has triggered. Even with an asynchronous model, it might make sense to halt execution of further queued operations until futex completes. I think IOSQE_IO_DRAIN is a barrier only against the submission part, so it wouldn't hep. Is there a way to ensure this ordering? I know, it goes against the asynchronous nature of io_uring, but I think it might be a valid use case. Say we extend FUTEX_WAIT with a way to acquire the futex in kernel space. Then, when the CQE returns, we know the lock is acquired. if we can queue dependencies on that (stronger than the link semantics), we could queue operations to be executed once the lock is taken. Makes sense? > Cancelations are supported, both from the application point-of-view, > but also to be able to cancel pending waits if the ring exits before > all events have occurred. > > This is just the barebones wait/wake support. Features to be added > later: One item high on my wishlist would be the futexv semantics (wait on any of a set of futexes). It cannot be implemented by issuing several FUTEX_WAIT.
On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> Add support for FUTEX_WAKE/WAIT primitives. > > This is great. I was so sure io_uring had this support already for some > reason. I might have dreamed it. I think you did :-) > The semantics are tricky, though. You might want to CC peterZ and tglx > directly. For sure, I'll take it wider soon enough. Just wanted to iron out io_uring details first. >> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as >> it does support passing in a bitset. > > As far as I know, the _BITSET variant are not commonly used in the > current interface. I haven't seen any code that really benefits from > it. Since FUTEX_WAKE is a strict subset of FUTEX_WAKE_BITSET, makes little sense to not just support both imho. >> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and >> FUTEX_WAIT_BITSET. > > But it is definitely safe to have a single one, basically with the > _BITSET semantics. Yep I think so. >> FUTEX_WAKE is straight forward, as we can always just do those inline. >> FUTEX_WAIT will queue the futex with an appropriate callback, and >> that callback will in turn post a CQE when it has triggered. > > Even with an asynchronous model, it might make sense to halt execution > of further queued operations until futex completes. I think > IOSQE_IO_DRAIN is a barrier only against the submission part, so it > wouldn't hep. Is there a way to ensure this ordering? You'd use link for that - link whatever depends on the wake to the futex wait. Or just queue it up once you reap the wait completion, when that is posted because we got woken. > I know, it goes against the asynchronous nature of io_uring, but I think > it might be a valid use case. Say we extend FUTEX_WAIT with a way to > acquire the futex in kernel space. Then, when the CQE returns, we know > the lock is acquired. if we can queue dependencies on that (stronger > than the link semantics), we could queue operations to be executed once > the lock is taken. Makes sense? It does, and acquiring it _may_ make sense indeed. But I'd rather punt that to a later thing, and focus on getting the standard (and smaller) primitives done first. >> Cancelations are supported, both from the application point-of-view, >> but also to be able to cancel pending waits if the ring exits before >> all events have occurred. >> >> This is just the barebones wait/wake support. Features to be added >> later: > > One item high on my wishlist would be the futexv semantics (wait on any > of a set of futexes). It cannot be implemented by issuing several > FUTEX_WAIT. Yep, I do think that one is interesting enough to consider upfront. Unfortunately the internal implementation of that does not look that great, though I'm sure we can make that work. But would likely require some futexv refactoring to make it work. I can take a look at it. You could obviously do futexv with this patchset, just posting N futex waits and canceling N-1 when you get woken by one. Though that's of course not very pretty or nice to use, but design wise it would totally work as you don't actually block on these with io_uring.
Jens Axboe <axboe@kernel.dk> writes: > On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote: >> Jens Axboe <axboe@kernel.dk> writes: >> >>> Add support for FUTEX_WAKE/WAIT primitives. >> >> This is great. I was so sure io_uring had this support already for some >> reason. I might have dreamed it. > > I think you did :-) Premonitory! Still, there should be better things to dream about than with the kernel code. >> Even with an asynchronous model, it might make sense to halt execution >> of further queued operations until futex completes. I think >> IOSQE_IO_DRAIN is a barrier only against the submission part, so it >> wouldn't hep. Is there a way to ensure this ordering? > > You'd use link for that - link whatever depends on the wake to the futex > wait. Or just queue it up once you reap the wait completion, when that > is posted because we got woken. The challenge of linked requests, in my opinion, is that once a link chain starts, everything needs to be link together, and a single error fails everything, which is ok when operations are related, but not so much when doing IO to different files from the same ring. >>> Cancelations are supported, both from the application point-of-view, >>> but also to be able to cancel pending waits if the ring exits before >>> all events have occurred. >>> >>> This is just the barebones wait/wake support. Features to be added >>> later: >> >> One item high on my wishlist would be the futexv semantics (wait on any >> of a set of futexes). It cannot be implemented by issuing several >> FUTEX_WAIT. > > Yep, I do think that one is interesting enough to consider upfront. >Unfortunately the internal implementation of that does not look that >great, though I'm sure we can make that work. ? But would likely >require some futexv refactoring to make it work. I can take a look at >it. No disagreement here. To be fair, the main challenge was making the new interface compatible with a futex being waited on/waked the original interface. At some point, we had a really nice design for a single object, but we spent two years bikesheding over the interface and ended up merging something pretty much similar to the proposal from two years prior. > You could obviously do futexv with this patchset, just posting N futex > waits and canceling N-1 when you get woken by one. Though that's of > course not very pretty or nice to use, but design wise it would totally > work as you don't actually block on these with io_uring. Yes, but at that point, i guess it'd make more sense to implement the same semantics by polling over a set of eventfds or having a single futex and doing dispatch in userspace. thanks,
On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote: >>> Jens Axboe <axboe@kernel.dk> writes: >>> >>>> Add support for FUTEX_WAKE/WAIT primitives. >>> >>> This is great. I was so sure io_uring had this support already for some >>> reason. I might have dreamed it. >> >> I think you did :-) > > Premonitory! Still, there should be better things to dream about than > with the kernel code. I dunno, if it's io_uring related I'm a supporter. >>> Even with an asynchronous model, it might make sense to halt execution >>> of further queued operations until futex completes. I think >>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it >>> wouldn't hep. Is there a way to ensure this ordering? >> >> You'd use link for that - link whatever depends on the wake to the futex >> wait. Or just queue it up once you reap the wait completion, when that >> is posted because we got woken. > > The challenge of linked requests, in my opinion, is that once a link > chain starts, everything needs to be link together, and a single error > fails everything, which is ok when operations are related, but > not so much when doing IO to different files from the same ring. Not quite sure if you're misunderstanding links, or just have a different use case in mind. You can certainly have several independent chains of links. >>>> Cancelations are supported, both from the application point-of-view, >>>> but also to be able to cancel pending waits if the ring exits before >>>> all events have occurred. >>>> >>>> This is just the barebones wait/wake support. Features to be added >>>> later: >>> >>> One item high on my wishlist would be the futexv semantics (wait on any >>> of a set of futexes). It cannot be implemented by issuing several >>> FUTEX_WAIT. >> >> Yep, I do think that one is interesting enough to consider upfront. >> Unfortunately the internal implementation of that does not look that >> great, though I'm sure we can make that work. ? But would likely >> require some futexv refactoring to make it work. I can take a look at >> it. > > No disagreement here. To be fair, the main challenge was making the new > interface compatible with a futex being waited on/waked the original > interface. At some point, we had a really nice design for a single > object, but we spent two years bikesheding over the interface and ended > up merging something pretty much similar to the proposal from two years > prior. It turned out not to be too bad - here's a poc: https://git.kernel.dk/cgit/linux/commit/?h=io_uring-futex&id=421b12df4ed0bb25c53afe496370bc2b70b04e15 needs a bit of splitting and cleaning, notably I think I need to redo the futex_q->wake_data bit to make that cleaner with the current use case and the async use case. With that, then everything can just use futex_queue() and the only difference really is that the sync variants will do timer setup upfront and then sleep at the bottom, where the async part just calls the meat of the function. >> You could obviously do futexv with this patchset, just posting N futex >> waits and canceling N-1 when you get woken by one. Though that's of >> course not very pretty or nice to use, but design wise it would totally >> work as you don't actually block on these with io_uring. > > Yes, but at that point, i guess it'd make more sense to implement the > same semantics by polling over a set of eventfds or having a single > futex and doing dispatch in userspace. Oh yeah, would not recommend the above approach. Just saying that you COULD do that if you really wanted to, which is not something you could do with futex before waitv. But kind of moot now that there's at least a prototype.
Jens Axboe <axboe@kernel.dk> writes: > On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote: >>>> Even with an asynchronous model, it might make sense to halt execution >>>> of further queued operations until futex completes. I think >>>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it >>>> wouldn't hep. Is there a way to ensure this ordering? >>> >>> You'd use link for that - link whatever depends on the wake to the futex >>> wait. Or just queue it up once you reap the wait completion, when that >>> is posted because we got woken. >> >> The challenge of linked requests, in my opinion, is that once a link >> chain starts, everything needs to be link together, and a single error >> fails everything, which is ok when operations are related, but >> not so much when doing IO to different files from the same ring. > > Not quite sure if you're misunderstanding links, or just have a > different use case in mind. You can certainly have several independent > chains of links. I might be. Or my use case might be bogus. Please, correct me if either is the case. My understanding is that a link is a sequence of commands all carrying the IOSQE_IO_LINK flag. io_uring guarantees the ordering within the link and, if a previous command fails, the rest of the link chain is aborted. But, if I have independent commands to submit in between, i.e. on a different fd, I might want an intermediary operation to not be dependent on the rest of the link without breaking the chain. Most of the time I know ahead of time the entire chain, and I can batch the operations together. But, I think it might be a problem specifically for some unbounded commands like FUTEX_WAIT and recv. I want a specific operation to depend on a recv, but I might not be able to specify ahead of time all of the dependent operations. I'd need to wait for a recv command to complete and only then issue the dependency, to guarantee ordering, or I make sure that everything I put on the ring in the meantime is part of one big link submitted sequentially. A related issue/example that comes to mind would be two recvs/sends against the same socket. When doing a syscall, I know the first recv will return ahead of the second because it is, well, synchronous. On io_uring, I think it must be a link. I might be recv'ing a huge stream from the network, and I can't tell if the packet is done on a single recv. I could have to issue a second recv but I either make it linked ahead of time, or I need to wait for the first recv to complete, to only then submit the second one. The problem is the ordering of recvs; from my understanding of the code, I cannot assure the first recv will complete before the second, without a link. Sorry if I'm wrong and there are ways around it, but it is a struggling points for me at the moment with using io_uring.
Hi, I'd been chatting with Jens about this, so obviously I'm interested in the feature... On 2023-06-09 12:31:24 -0600, Jens Axboe wrote: > Add support for FUTEX_WAKE/WAIT primitives. > > IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as > it does support passing in a bitset. > > Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and > FUTEX_WAIT_BITSET. One thing I was wondering about is what happens when there are multiple OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I don't really have an opinion about what would be best, just that it'd be helpful to specify the behaviour. Greetings, Andres Freund
On 6/23/23 1:04?PM, Andres Freund wrote: > Hi, > > I'd been chatting with Jens about this, so obviously I'm interested in the > feature... > > On 2023-06-09 12:31:24 -0600, Jens Axboe wrote: >> Add support for FUTEX_WAKE/WAIT primitives. >> >> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as >> it does support passing in a bitset. >> >> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and >> FUTEX_WAIT_BITSET. > > One thing I was wondering about is what happens when there are multiple > OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I > don't really have an opinion about what would be best, just that it'd be > helpful to specify the behaviour. Not sure I follow the question, can you elaborate? If you have N futex waits on the same futex and someone does a wait (with wakenum >= N), then they'd all wake and post a CQE. If less are woken because the caller asked for less than N, than that number should be woken. IOW, should have the same semantics as "normal" futex waits. Or maybe I'm totally missing what is being asked here...
Hi, On 2023-06-23 13:07:12 -0600, Jens Axboe wrote: > On 6/23/23 1:04?PM, Andres Freund wrote: > > Hi, > > > > I'd been chatting with Jens about this, so obviously I'm interested in the > > feature... > > > > On 2023-06-09 12:31:24 -0600, Jens Axboe wrote: > >> Add support for FUTEX_WAKE/WAIT primitives. > >> > >> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as > >> it does support passing in a bitset. > >> > >> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and > >> FUTEX_WAIT_BITSET. > > > > One thing I was wondering about is what happens when there are multiple > > OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I > > don't really have an opinion about what would be best, just that it'd be > > helpful to specify the behaviour. > > Not sure I follow the question, can you elaborate? > > If you have N futex waits on the same futex and someone does a wait > (with wakenum >= N), then they'd all wake and post a CQE. If less are > woken because the caller asked for less than N, than that number should > be woken. > > IOW, should have the same semantics as "normal" futex waits. With a normal futex wait you can't wait multiple times on the same futex in one thread. But with the proposed io_uring interface, one can. Basically, what is the defined behaviour for: sqe = io_uring_get_sqe(ring); io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY); sqe = io_uring_get_sqe(ring); io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY); io_uring_submit(ring) when someone does: futex(FUTEX_WAKE, futex, 1, 0, 0, 0); or futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0); or the equivalent io_uring operation. Is it an error? Will there always be two cqes queued? Will it depend on the number of wakeups specified by the waker? I'd assume the latter, but it'd be good to specify that. Greetings, Andres Freund
On 6/23/23 1:34?PM, Andres Freund wrote: > Hi, > > On 2023-06-23 13:07:12 -0600, Jens Axboe wrote: >> On 6/23/23 1:04?PM, Andres Freund wrote: >>> Hi, >>> >>> I'd been chatting with Jens about this, so obviously I'm interested in the >>> feature... >>> >>> On 2023-06-09 12:31:24 -0600, Jens Axboe wrote: >>>> Add support for FUTEX_WAKE/WAIT primitives. >>>> >>>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as >>>> it does support passing in a bitset. >>>> >>>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and >>>> FUTEX_WAIT_BITSET. >>> >>> One thing I was wondering about is what happens when there are multiple >>> OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I >>> don't really have an opinion about what would be best, just that it'd be >>> helpful to specify the behaviour. >> >> Not sure I follow the question, can you elaborate? >> >> If you have N futex waits on the same futex and someone does a wait >> (with wakenum >= N), then they'd all wake and post a CQE. If less are >> woken because the caller asked for less than N, than that number should >> be woken. >> >> IOW, should have the same semantics as "normal" futex waits. > > With a normal futex wait you can't wait multiple times on the same futex in > one thread. But with the proposed io_uring interface, one can. Right, but you could have N threads waiting on the same futex. > Basically, what is the defined behaviour for: > > sqe = io_uring_get_sqe(ring); > io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY); > > sqe = io_uring_get_sqe(ring); > io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY); > > io_uring_submit(ring) > > when someone does: > futex(FUTEX_WAKE, futex, 1, 0, 0, 0); > or > futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0); > > or the equivalent io_uring operation. Waking with num=1 should wake just one of them, which one is really down to the futex ordering which depends on task priority (which would be the same here), and ordered after that. So first one should wake the first sqe queued. Second one will wake all of them, in that order. I'll put that in the the test case. > Is it an error? Will there always be two cqes queued? Will it depend > on the number of wakeups specified by the waker? I'd assume the > latter, but it'd be good to specify that. It's not an error, I would not want to police that. It will purely depend on the number of wakes specified by the wake operation. If it's 1, one will be triggered. If it's INT_MAX, then both of them will trigger. First case will generate one CQE, second one will generate both CQEs. No documentation has been written for the io_uring bits yet. But the current branch is ready for wider posting, so I should get that written up too...
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index f04ce513fadb..d796b578c129 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -273,6 +273,8 @@ struct io_ring_ctx { struct io_wq_work_list locked_free_list; unsigned int locked_free_nr; + struct hlist_head futex_list; + const struct cred *sq_creds; /* cred used for __io_sq_thread() */ struct io_sq_data *sq_data; /* if using sq thread polling */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f222d263bc55..b1a151ab8150 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -65,6 +65,7 @@ struct io_uring_sqe { __u32 xattr_flags; __u32 msg_ring_flags; __u32 uring_cmd_flags; + __u32 futex_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -235,6 +236,8 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_FUTEX_WAIT, + IORING_OP_FUTEX_WAKE, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/Makefile b/io_uring/Makefile index 8cc8e5387a75..2e4779bc550c 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -7,5 +7,7 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ openclose.o uring_cmd.o epoll.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ - cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o + cancel.o kbuf.o rsrc.o rw.o opdef.o \ + notif.o obj-$(CONFIG_IO_WQ) += io-wq.o +obj-$(CONFIG_FUTEX) += futex.o diff --git a/io_uring/cancel.c b/io_uring/cancel.c index b4f5dfacc0c3..280fb83145d3 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -15,6 +15,7 @@ #include "tctx.h" #include "poll.h" #include "timeout.h" +#include "futex.h" #include "cancel.h" struct io_cancel { @@ -98,6 +99,10 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd, if (ret != -ENOENT) return ret; + ret = io_futex_cancel(ctx, cd, issue_flags); + if (ret != -ENOENT) + return ret; + spin_lock(&ctx->completion_lock); if (!(cd->flags & IORING_ASYNC_CANCEL_FD)) ret = io_timeout_cancel(ctx, cd); diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0c..6a2a38df7159 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -1,4 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 +#ifndef IORING_CANCEL_H +#define IORING_CANCEL_H #include <linux/io_uring_types.h> @@ -21,3 +23,5 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd, void init_hash_table(struct io_hash_table *table, unsigned size); int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg); + +#endif diff --git a/io_uring/futex.c b/io_uring/futex.c new file mode 100644 index 000000000000..a1d50145927a --- /dev/null +++ b/io_uring/futex.c @@ -0,0 +1,194 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/kernel.h> +#include <linux/errno.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/io_uring.h> + +#include <uapi/linux/io_uring.h> + +#include "../kernel/futex/futex.h" +#include "io_uring.h" +#include "futex.h" + +struct io_futex { + struct file *file; + u32 __user *uaddr; + int futex_op; + unsigned int futex_val; + unsigned int futex_flags; + unsigned int futex_mask; + bool has_timeout; + ktime_t timeout; +}; + +static void io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts) +{ + struct io_ring_ctx *ctx = req->ctx; + + kfree(req->async_data); + io_tw_lock(ctx, ts); + hlist_del_init(&req->hash_node); + io_req_task_complete(req, ts); +} + +static bool __io_futex_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + struct futex_q *q = req->async_data; + + /* futex wake already done or in progress */ + if (!futex_unqueue(q)) + return false; + + hlist_del_init(&req->hash_node); + io_req_set_res(req, -ECANCELED, 0); + req->io_task_work.func = io_futex_complete; + io_req_task_work_add(req); + return true; +} + +int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd, + unsigned int issue_flags) +{ + struct hlist_node *tmp; + struct io_kiocb *req; + int nr = 0; + + if (cd->flags & (IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_FD_FIXED)) + return 0; + + io_ring_submit_lock(ctx, issue_flags); + hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) { + if (req->cqe.user_data != cd->data && + !(cd->flags & IORING_ASYNC_CANCEL_ANY)) + continue; + if (__io_futex_cancel(ctx, req)) + nr++; + nr++; + if (!(cd->flags & IORING_ASYNC_CANCEL_ALL)) + break; + } + io_ring_submit_unlock(ctx, issue_flags); + + if (nr) + return nr; + + return -ENOENT; +} + +bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task, + bool cancel_all) +{ + struct hlist_node *tmp; + struct io_kiocb *req; + bool found = false; + + lockdep_assert_held(&ctx->uring_lock); + + hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) { + if (!io_match_task_safe(req, task, cancel_all)) + continue; + __io_futex_cancel(ctx, req); + found = true; + } + + return found; +} + +int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex); + struct __kernel_timespec __user *utime; + struct timespec64 t; + + iof->futex_op = READ_ONCE(sqe->fd); + iof->uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + iof->futex_val = READ_ONCE(sqe->len); + iof->has_timeout = false; + iof->futex_mask = READ_ONCE(sqe->file_index); + utime = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + if (utime) { + if (get_timespec64(&t, utime)) + return -EFAULT; + iof->timeout = timespec64_to_ktime(t); + iof->timeout = ktime_add_safe(ktime_get(), iof->timeout); + iof->has_timeout = true; + } + iof->futex_flags = READ_ONCE(sqe->futex_flags); + if (iof->futex_flags & FUTEX_CMD_MASK) + return -EINVAL; + + return 0; +} + +static void io_futex_wake_fn(struct wake_q_head *wake_q, struct futex_q *q) +{ + struct io_kiocb *req = q->wake_data; + + __futex_unqueue(q); + smp_store_release(&q->lock_ptr, NULL); + + io_req_set_res(req, 0, 0); + req->io_task_work.func = io_futex_complete; + io_req_task_work_add(req); +} + +int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex); + struct io_ring_ctx *ctx = req->ctx; + unsigned int flags = 0; + struct futex_q *q; + int ret; + + if (!futex_op_to_flags(FUTEX_WAIT, iof->futex_flags, &flags)) { + ret = -ENOSYS; + goto done; + } + + q = kmalloc(sizeof(*q), GFP_NOWAIT); + if (!q) { + ret = -ENOMEM; + goto done; + } + + req->async_data = q; + *q = futex_q_init; + q->bitset = iof->futex_mask; + q->wake = io_futex_wake_fn; + q->wake_data = req; + + io_ring_submit_lock(ctx, issue_flags); + hlist_add_head(&req->hash_node, &ctx->futex_list); + io_ring_submit_unlock(ctx, issue_flags); + + ret = futex_queue_wait(q, iof->uaddr, flags, iof->futex_val); + if (ret) + goto done; + + return IOU_ISSUE_SKIP_COMPLETE; +done: + if (ret < 0) + req_set_fail(req); + io_req_set_res(req, ret, 0); + return IOU_OK; +} + +int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex); + unsigned int flags = 0; + int ret; + + if (!futex_op_to_flags(FUTEX_WAKE, iof->futex_flags, &flags)) { + ret = -ENOSYS; + goto done; + } + + ret = futex_wake(iof->uaddr, flags, iof->futex_val, iof->futex_mask); +done: + if (ret < 0) + req_set_fail(req); + io_req_set_res(req, ret, 0); + return IOU_OK; +} diff --git a/io_uring/futex.h b/io_uring/futex.h new file mode 100644 index 000000000000..16add2c069cc --- /dev/null +++ b/io_uring/futex.h @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "cancel.h" + +int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags); +int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags); + +#if defined(CONFIG_FUTEX) +int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd, + unsigned int issue_flags); +bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task, + bool cancel_all); +#else +static inline int io_futex_cancel(struct io_ring_ctx *ctx, + struct io_cancel_data *cd, + unsigned int issue_flags); +{ + return 0; +} +static inline bool io_futex_remove_all(struct io_ring_ctx *ctx, + struct task_struct *task, bool cancel_all) +{ + return false; +} +#endif diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index a467064da1af..8270f37c312d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -92,6 +92,7 @@ #include "cancel.h" #include "net.h" #include "notif.h" +#include "futex.h" #include "timeout.h" #include "poll.h" @@ -336,6 +337,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->tctx_list); ctx->submit_state.free_list.next = NULL; INIT_WQ_LIST(&ctx->locked_free_list); + INIT_HLIST_HEAD(&ctx->futex_list); INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); INIT_WQ_LIST(&ctx->submit_state.compl_reqs); return ctx; @@ -3309,6 +3311,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, ret |= io_cancel_defer_files(ctx, task, cancel_all); mutex_lock(&ctx->uring_lock); ret |= io_poll_remove_all(ctx, task, cancel_all); + ret |= io_futex_remove_all(ctx, task, cancel_all); mutex_unlock(&ctx->uring_lock); ret |= io_kill_timeouts(ctx, task, cancel_all); if (task) diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 3b9c6489b8b6..e6b03d7f82e5 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -33,6 +33,7 @@ #include "poll.h" #include "cancel.h" #include "rw.h" +#include "futex.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -426,11 +427,26 @@ const struct io_issue_def io_issue_defs[] = { .issue = io_sendmsg_zc, #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_FUTEX_WAIT] = { +#if defined(CONFIG_FUTEX) + .prep = io_futex_prep, + .issue = io_futex_wait, +#else + .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_FUTEX_WAKE] = { +#if defined(CONFIG_FUTEX) + .prep = io_futex_prep, + .issue = io_futex_wake, +#else + .prep = io_eopnotsupp_prep, #endif }, }; - const struct io_cold_def io_cold_defs[] = { [IORING_OP_NOP] = { .name = "NOP", @@ -648,6 +664,13 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_FUTEX_WAIT] = { + .name = "FUTEX_WAIT", + }, + + [IORING_OP_FUTEX_WAKE] = { + .name = "FUTEX_WAKE", + }, }; const char *io_uring_get_opcode(u8 opcode)
Add support for FUTEX_WAKE/WAIT primitives. IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as it does support passing in a bitset. Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and FUTEX_WAIT_BITSET. FUTEX_WAKE is straight forward, as we can always just do those inline. FUTEX_WAIT will queue the futex with an appropriate callback, and that callback will in turn post a CQE when it has triggered. Cancelations are supported, both from the application point-of-view, but also to be able to cancel pending waits if the ring exits before all events have occurred. This is just the barebones wait/wake support. Features to be added later: - We do not support the PI or requeue operations. The immediate use case don't need them, unsure if future support for these would be useful. - Should we support futex wait with timeout? Not clear if this is necessary, as the usual io_uring linked timeouts could fill this purpose. - Would be nice to support registered futexes, just like we do buffers. This would avoid mapping in user memory for each operation. - Probably lots more that I just didn't think of. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- include/linux/io_uring_types.h | 2 + include/uapi/linux/io_uring.h | 3 + io_uring/Makefile | 4 +- io_uring/cancel.c | 5 + io_uring/cancel.h | 4 + io_uring/futex.c | 194 +++++++++++++++++++++++++++++++++ io_uring/futex.h | 26 +++++ io_uring/io_uring.c | 3 + io_uring/opdef.c | 25 ++++- 9 files changed, 264 insertions(+), 2 deletions(-) create mode 100644 io_uring/futex.c create mode 100644 io_uring/futex.h