Message ID | 0eb3f55722540a11b036d3c90771220eb082d65e.1710514702.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | remove aux CQE caches | expand |
On 3/15/24 9:30 AM, Pavel Begunkov wrote: > io_post_aux_cqe(), which is used for multishot requests, delays > completions by putting CQEs into a temporary array for the purpose > completion lock/flush batching. > > DEFER_TASKRUN doesn't need any locking, so for it we can put completions > directly into the CQ and defer post completion handling with a flag. > That leaves !DEFER_TASKRUN, which is not that interesting / hot for > multishot requests, so have conditional locking with deferred flush > for them. This breaks the read-mshot test case, looking into what is going on there.
On 3/15/24 16:20, Jens Axboe wrote: > On 3/15/24 9:30 AM, Pavel Begunkov wrote: >> io_post_aux_cqe(), which is used for multishot requests, delays >> completions by putting CQEs into a temporary array for the purpose >> completion lock/flush batching. >> >> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >> directly into the CQ and defer post completion handling with a flag. >> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >> multishot requests, so have conditional locking with deferred flush >> for them. > > This breaks the read-mshot test case, looking into what is going on > there. I forgot to mention, yes it does, the test makes odd assumptions about overflows, IIRC it expects that the kernel allows one and only one aux CQE to be overflown. Let me double check
On 3/15/24 10:23 AM, Pavel Begunkov wrote: > On 3/15/24 16:20, Jens Axboe wrote: >> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>> io_post_aux_cqe(), which is used for multishot requests, delays >>> completions by putting CQEs into a temporary array for the purpose >>> completion lock/flush batching. >>> >>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>> directly into the CQ and defer post completion handling with a flag. >>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>> multishot requests, so have conditional locking with deferred flush >>> for them. >> >> This breaks the read-mshot test case, looking into what is going on >> there. > > I forgot to mention, yes it does, the test makes odd assumptions about > overflows, IIRC it expects that the kernel allows one and only one aux > CQE to be overflown. Let me double check Yeah this is very possible, the overflow checking could be broken in there. I'll poke at it and report back.
On 3/15/24 10:25 AM, Jens Axboe wrote: > On 3/15/24 10:23 AM, Pavel Begunkov wrote: >> On 3/15/24 16:20, Jens Axboe wrote: >>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>> completions by putting CQEs into a temporary array for the purpose >>>> completion lock/flush batching. >>>> >>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>> directly into the CQ and defer post completion handling with a flag. >>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>> multishot requests, so have conditional locking with deferred flush >>>> for them. >>> >>> This breaks the read-mshot test case, looking into what is going on >>> there. >> >> I forgot to mention, yes it does, the test makes odd assumptions about >> overflows, IIRC it expects that the kernel allows one and only one aux >> CQE to be overflown. Let me double check > > Yeah this is very possible, the overflow checking could be broken in > there. I'll poke at it and report back. It does, this should fix it: diff --git a/test/read-mshot.c b/test/read-mshot.c index 8fcb79857bf0..501ca69a98dc 100644 --- a/test/read-mshot.c +++ b/test/read-mshot.c @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) } if (!(cqe->flags & IORING_CQE_F_MORE)) { /* we expect this on overflow */ - if (overflow && (i - 1 == NR_OVERFLOW)) + if (overflow && i >= NR_OVERFLOW) break; fprintf(stderr, "no more cqes\n"); return 1;
On 3/15/24 16:25, Jens Axboe wrote: > On 3/15/24 10:23 AM, Pavel Begunkov wrote: >> On 3/15/24 16:20, Jens Axboe wrote: >>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>> completions by putting CQEs into a temporary array for the purpose >>>> completion lock/flush batching. >>>> >>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>> directly into the CQ and defer post completion handling with a flag. >>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>> multishot requests, so have conditional locking with deferred flush >>>> for them. >>> >>> This breaks the read-mshot test case, looking into what is going on >>> there. >> >> I forgot to mention, yes it does, the test makes odd assumptions about >> overflows, IIRC it expects that the kernel allows one and only one aux >> CQE to be overflown. Let me double check > > Yeah this is very possible, the overflow checking could be broken in > there. I'll poke at it and report back. test() { if (!(cqe->flags & IORING_CQE_F_MORE)) { /* we expect this on overflow */ if (overflow && (i - 1 == NR_OVERFLOW)) break; fprintf(stderr, "no more cqes\n"); return 1; } ... } It's this chunk. I think I silenced it with s/i - 1 == NR_OVERFLOW/i == NR_OVERFLOW/ but it should probably be i >= NR_OVERFLOW or so
On 3/15/24 10:29 AM, Pavel Begunkov wrote: > On 3/15/24 16:25, Jens Axboe wrote: >> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>> On 3/15/24 16:20, Jens Axboe wrote: >>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>> completions by putting CQEs into a temporary array for the purpose >>>>> completion lock/flush batching. >>>>> >>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>> directly into the CQ and defer post completion handling with a flag. >>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>> multishot requests, so have conditional locking with deferred flush >>>>> for them. >>>> >>>> This breaks the read-mshot test case, looking into what is going on >>>> there. >>> >>> I forgot to mention, yes it does, the test makes odd assumptions about >>> overflows, IIRC it expects that the kernel allows one and only one aux >>> CQE to be overflown. Let me double check >> >> Yeah this is very possible, the overflow checking could be broken in >> there. I'll poke at it and report back. > > test() { > if (!(cqe->flags & IORING_CQE_F_MORE)) { > /* we expect this on overflow */ > if (overflow && (i - 1 == NR_OVERFLOW)) > break; > fprintf(stderr, "no more cqes\n"); > return 1; > } > ... > } > > It's this chunk. I think I silenced it with > > s/i - 1 == NR_OVERFLOW/i == NR_OVERFLOW/ > > but it should probably be i >= NR_OVERFLOW or so Yeah see other email, I did the latter. It's pushed out.
On 3/15/24 16:27, Jens Axboe wrote: > On 3/15/24 10:25 AM, Jens Axboe wrote: >> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>> On 3/15/24 16:20, Jens Axboe wrote: >>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>> completions by putting CQEs into a temporary array for the purpose >>>>> completion lock/flush batching. >>>>> >>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>> directly into the CQ and defer post completion handling with a flag. >>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>> multishot requests, so have conditional locking with deferred flush >>>>> for them. >>>> >>>> This breaks the read-mshot test case, looking into what is going on >>>> there. >>> >>> I forgot to mention, yes it does, the test makes odd assumptions about >>> overflows, IIRC it expects that the kernel allows one and only one aux >>> CQE to be overflown. Let me double check >> >> Yeah this is very possible, the overflow checking could be broken in >> there. I'll poke at it and report back. > > It does, this should fix it: > > > diff --git a/test/read-mshot.c b/test/read-mshot.c > index 8fcb79857bf0..501ca69a98dc 100644 > --- a/test/read-mshot.c > +++ b/test/read-mshot.c > @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) > } > if (!(cqe->flags & IORING_CQE_F_MORE)) { > /* we expect this on overflow */ > - if (overflow && (i - 1 == NR_OVERFLOW)) > + if (overflow && i >= NR_OVERFLOW) Which is not ideal either, e.g. I wouldn't mind if the kernel stops one entry before CQ is full, so that the request can complete w/o overflowing. Not supposing the change because it's a marginal case, but we shouldn't limit ourselves.
On 3/15/24 10:44 AM, Pavel Begunkov wrote: > On 3/15/24 16:27, Jens Axboe wrote: >> On 3/15/24 10:25 AM, Jens Axboe wrote: >>> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>>> On 3/15/24 16:20, Jens Axboe wrote: >>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>>> completions by putting CQEs into a temporary array for the purpose >>>>>> completion lock/flush batching. >>>>>> >>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>>> directly into the CQ and defer post completion handling with a flag. >>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>>> multishot requests, so have conditional locking with deferred flush >>>>>> for them. >>>>> >>>>> This breaks the read-mshot test case, looking into what is going on >>>>> there. >>>> >>>> I forgot to mention, yes it does, the test makes odd assumptions about >>>> overflows, IIRC it expects that the kernel allows one and only one aux >>>> CQE to be overflown. Let me double check >>> >>> Yeah this is very possible, the overflow checking could be broken in >>> there. I'll poke at it and report back. >> >> It does, this should fix it: >> >> >> diff --git a/test/read-mshot.c b/test/read-mshot.c >> index 8fcb79857bf0..501ca69a98dc 100644 >> --- a/test/read-mshot.c >> +++ b/test/read-mshot.c >> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) >> } >> if (!(cqe->flags & IORING_CQE_F_MORE)) { >> /* we expect this on overflow */ >> - if (overflow && (i - 1 == NR_OVERFLOW)) >> + if (overflow && i >= NR_OVERFLOW) > > Which is not ideal either, e.g. I wouldn't mind if the kernel stops > one entry before CQ is full, so that the request can complete w/o > overflowing. Not supposing the change because it's a marginal > case, but we shouldn't limit ourselves. But if the event keeps triggering we have to keep posting CQEs, otherwise we could get stuck. As far as I'm concerned, the behavior with the patch looks correct. The last CQE is overflown, and that terminates it, and it doesn't have MORE set. The one before that has MORE set, but it has to, unless you aborted it early. But that seems impossible, because what if that was indeed the last current CQE, and we reap CQEs before the next one is posted. So unless I'm missing something, I don't think we can be doing any better.
On 3/15/24 16:49, Jens Axboe wrote: > On 3/15/24 10:44 AM, Pavel Begunkov wrote: >> On 3/15/24 16:27, Jens Axboe wrote: >>> On 3/15/24 10:25 AM, Jens Axboe wrote: >>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>>>> On 3/15/24 16:20, Jens Axboe wrote: >>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>>>> completions by putting CQEs into a temporary array for the purpose >>>>>>> completion lock/flush batching. >>>>>>> >>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>>>> directly into the CQ and defer post completion handling with a flag. >>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>>>> multishot requests, so have conditional locking with deferred flush >>>>>>> for them. >>>>>> >>>>>> This breaks the read-mshot test case, looking into what is going on >>>>>> there. >>>>> >>>>> I forgot to mention, yes it does, the test makes odd assumptions about >>>>> overflows, IIRC it expects that the kernel allows one and only one aux >>>>> CQE to be overflown. Let me double check >>>> >>>> Yeah this is very possible, the overflow checking could be broken in >>>> there. I'll poke at it and report back. >>> >>> It does, this should fix it: >>> >>> >>> diff --git a/test/read-mshot.c b/test/read-mshot.c >>> index 8fcb79857bf0..501ca69a98dc 100644 >>> --- a/test/read-mshot.c >>> +++ b/test/read-mshot.c >>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) >>> } >>> if (!(cqe->flags & IORING_CQE_F_MORE)) { >>> /* we expect this on overflow */ >>> - if (overflow && (i - 1 == NR_OVERFLOW)) >>> + if (overflow && i >= NR_OVERFLOW) >> >> Which is not ideal either, e.g. I wouldn't mind if the kernel stops >> one entry before CQ is full, so that the request can complete w/o >> overflowing. Not supposing the change because it's a marginal >> case, but we shouldn't limit ourselves. > > But if the event keeps triggering we have to keep posting CQEs, > otherwise we could get stuck. Or we can complete the request, then the user consumes CQEs and restarts as usual > As far as I'm concerned, the behavior with > the patch looks correct. The last CQE is overflown, and that terminates > it, and it doesn't have MORE set. The one before that has MORE set, but > it has to, unless you aborted it early. But that seems impossible, > because what if that was indeed the last current CQE, and we reap CQEs > before the next one is posted. > > So unless I'm missing something, I don't think we can be doing any > better. You can opportunistically try to avoid overflows, unreliably bool io_post_cqe() { // Not enough space in the CQ left, so if there is a next // completion pending we'd have to overflow. Avoid that by // terminating it now. // // If there are no more CQEs after this one, we might // terminate a bit earlier, but that better because // overflows are so expensive and unhandy and so on. if (cq_space_left() <= 1) return false; fill_cqe(); return true; } some_multishot_function(req) { if (!io_post_cqe(res)) complete_req(req, res); } Again, not suggesting the change for all the obvious reasons, but I think semantically we should be able to do it.
On 3/15/24 11:26 AM, Pavel Begunkov wrote: > On 3/15/24 16:49, Jens Axboe wrote: >> On 3/15/24 10:44 AM, Pavel Begunkov wrote: >>> On 3/15/24 16:27, Jens Axboe wrote: >>>> On 3/15/24 10:25 AM, Jens Axboe wrote: >>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>>>>> On 3/15/24 16:20, Jens Axboe wrote: >>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>>>>> completions by putting CQEs into a temporary array for the purpose >>>>>>>> completion lock/flush batching. >>>>>>>> >>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>>>>> directly into the CQ and defer post completion handling with a flag. >>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>>>>> multishot requests, so have conditional locking with deferred flush >>>>>>>> for them. >>>>>>> >>>>>>> This breaks the read-mshot test case, looking into what is going on >>>>>>> there. >>>>>> >>>>>> I forgot to mention, yes it does, the test makes odd assumptions about >>>>>> overflows, IIRC it expects that the kernel allows one and only one aux >>>>>> CQE to be overflown. Let me double check >>>>> >>>>> Yeah this is very possible, the overflow checking could be broken in >>>>> there. I'll poke at it and report back. >>>> >>>> It does, this should fix it: >>>> >>>> >>>> diff --git a/test/read-mshot.c b/test/read-mshot.c >>>> index 8fcb79857bf0..501ca69a98dc 100644 >>>> --- a/test/read-mshot.c >>>> +++ b/test/read-mshot.c >>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) >>>> } >>>> if (!(cqe->flags & IORING_CQE_F_MORE)) { >>>> /* we expect this on overflow */ >>>> - if (overflow && (i - 1 == NR_OVERFLOW)) >>>> + if (overflow && i >= NR_OVERFLOW) >>> >>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops >>> one entry before CQ is full, so that the request can complete w/o >>> overflowing. Not supposing the change because it's a marginal >>> case, but we shouldn't limit ourselves. >> >> But if the event keeps triggering we have to keep posting CQEs, >> otherwise we could get stuck. > > Or we can complete the request, then the user consumes CQEs > and restarts as usual So you'd want to track if we'd overflow, wait for overflow to clear, and then restart that request? I think that sounds a bit involved, no? Particularly for a case like overflow, which generally should not occur. If it does, just terminate it, and have the user re-issue it. That seems like the simpler and better solution to me. >> As far as I'm concerned, the behavior with >> the patch looks correct. The last CQE is overflown, and that terminates >> it, and it doesn't have MORE set. The one before that has MORE set, but >> it has to, unless you aborted it early. But that seems impossible, >> because what if that was indeed the last current CQE, and we reap CQEs >> before the next one is posted. >> >> So unless I'm missing something, I don't think we can be doing any >> better. > > You can opportunistically try to avoid overflows, unreliably > > bool io_post_cqe() { > // Not enough space in the CQ left, so if there is a next > // completion pending we'd have to overflow. Avoid that by > // terminating it now. > // > // If there are no more CQEs after this one, we might > // terminate a bit earlier, but that better because > // overflows are so expensive and unhandy and so on. > if (cq_space_left() <= 1) > return false; > fill_cqe(); > return true; > } > > some_multishot_function(req) { > if (!io_post_cqe(res)) > complete_req(req, res); > } > > Again, not suggesting the change for all the obvious reasons, but > I think semantically we should be able to do it. Yeah not convinced this is worth looking at. If it was the case that the hot path would often see overflows and it'd help to avoid it, then probably it'd make sense. But I don't think that's the case.
On 3/15/24 18:26, Jens Axboe wrote: > On 3/15/24 11:26 AM, Pavel Begunkov wrote: >> On 3/15/24 16:49, Jens Axboe wrote: >>> On 3/15/24 10:44 AM, Pavel Begunkov wrote: >>>> On 3/15/24 16:27, Jens Axboe wrote: >>>>> On 3/15/24 10:25 AM, Jens Axboe wrote: >>>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>>>>>> On 3/15/24 16:20, Jens Axboe wrote: >>>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>>>>>> completions by putting CQEs into a temporary array for the purpose >>>>>>>>> completion lock/flush batching. >>>>>>>>> >>>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>>>>>> directly into the CQ and defer post completion handling with a flag. >>>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>>>>>> multishot requests, so have conditional locking with deferred flush >>>>>>>>> for them. >>>>>>>> >>>>>>>> This breaks the read-mshot test case, looking into what is going on >>>>>>>> there. >>>>>>> >>>>>>> I forgot to mention, yes it does, the test makes odd assumptions about >>>>>>> overflows, IIRC it expects that the kernel allows one and only one aux >>>>>>> CQE to be overflown. Let me double check >>>>>> >>>>>> Yeah this is very possible, the overflow checking could be broken in >>>>>> there. I'll poke at it and report back. >>>>> >>>>> It does, this should fix it: >>>>> >>>>> >>>>> diff --git a/test/read-mshot.c b/test/read-mshot.c >>>>> index 8fcb79857bf0..501ca69a98dc 100644 >>>>> --- a/test/read-mshot.c >>>>> +++ b/test/read-mshot.c >>>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) >>>>> } >>>>> if (!(cqe->flags & IORING_CQE_F_MORE)) { >>>>> /* we expect this on overflow */ >>>>> - if (overflow && (i - 1 == NR_OVERFLOW)) >>>>> + if (overflow && i >= NR_OVERFLOW) >>>> >>>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops >>>> one entry before CQ is full, so that the request can complete w/o >>>> overflowing. Not supposing the change because it's a marginal >>>> case, but we shouldn't limit ourselves. >>> >>> But if the event keeps triggering we have to keep posting CQEs, >>> otherwise we could get stuck. >> >> Or we can complete the request, then the user consumes CQEs >> and restarts as usual > > So you'd want to track if we'd overflow, wait for overflow to clear, and > then restart that request? No, the 2 line change in io_post_cqe() from the last email's snippet is the only thing you'd need. I probably don't understand why and what tracking you mean, but fwiw we currently do track and account for overflows. /* For defered completions this is not as strict as it is otherwise, * however it's main job is to prevent unbounded posted completions, * and in that it works just as well. */ if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) return false; which is being killed in the series. > I think that sounds a bit involved, no? > Particularly for a case like overflow, which generally should not occur. > If it does, just terminate it, and have the user re-issue it. That seems > like the simpler and better solution to me. > >>> As far as I'm concerned, the behavior with >>> the patch looks correct. The last CQE is overflown, and that terminates >>> it, and it doesn't have MORE set. The one before that has MORE set, but >>> it has to, unless you aborted it early. But that seems impossible, >>> because what if that was indeed the last current CQE, and we reap CQEs >>> before the next one is posted. >>> >>> So unless I'm missing something, I don't think we can be doing any >>> better. >> >> You can opportunistically try to avoid overflows, unreliably >> >> bool io_post_cqe() { >> // Not enough space in the CQ left, so if there is a next >> // completion pending we'd have to overflow. Avoid that by >> // terminating it now. >> // >> // If there are no more CQEs after this one, we might >> // terminate a bit earlier, but that better because >> // overflows are so expensive and unhandy and so on. >> if (cq_space_left() <= 1) >> return false; >> fill_cqe(); >> return true; >> } >> >> some_multishot_function(req) { >> if (!io_post_cqe(res)) >> complete_req(req, res); >> } >> >> Again, not suggesting the change for all the obvious reasons, but >> I think semantically we should be able to do it. > > Yeah not convinced this is worth looking at. If it was the case that the > hot path would often see overflows and it'd help to avoid it, then > probably it'd make sense. But I don't think that's the case. We're talking about different things. Seems you're discussing a particular implementation, its constraints and performance. I care purely about the semantics, the implicit uapi. And I define it as "multishot requests may decide to terminate at any point, the user should expect it and reissue when appropriate", not restricting it to "can only (normally) terminate when CQ is full". We're changing tests from time to time, but the there is that "behaviour defines semantics", especially when it wasn't clear in advance and breaks someone's app, and people might be using assumptions in tests as the universal truth.
On 3/15/24 12:51 PM, Pavel Begunkov wrote: > On 3/15/24 18:26, Jens Axboe wrote: >> On 3/15/24 11:26 AM, Pavel Begunkov wrote: >>> On 3/15/24 16:49, Jens Axboe wrote: >>>> On 3/15/24 10:44 AM, Pavel Begunkov wrote: >>>>> On 3/15/24 16:27, Jens Axboe wrote: >>>>>> On 3/15/24 10:25 AM, Jens Axboe wrote: >>>>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote: >>>>>>>> On 3/15/24 16:20, Jens Axboe wrote: >>>>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote: >>>>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays >>>>>>>>>> completions by putting CQEs into a temporary array for the purpose >>>>>>>>>> completion lock/flush batching. >>>>>>>>>> >>>>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions >>>>>>>>>> directly into the CQ and defer post completion handling with a flag. >>>>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for >>>>>>>>>> multishot requests, so have conditional locking with deferred flush >>>>>>>>>> for them. >>>>>>>>> >>>>>>>>> This breaks the read-mshot test case, looking into what is going on >>>>>>>>> there. >>>>>>>> >>>>>>>> I forgot to mention, yes it does, the test makes odd assumptions about >>>>>>>> overflows, IIRC it expects that the kernel allows one and only one aux >>>>>>>> CQE to be overflown. Let me double check >>>>>>> >>>>>>> Yeah this is very possible, the overflow checking could be broken in >>>>>>> there. I'll poke at it and report back. >>>>>> >>>>>> It does, this should fix it: >>>>>> >>>>>> >>>>>> diff --git a/test/read-mshot.c b/test/read-mshot.c >>>>>> index 8fcb79857bf0..501ca69a98dc 100644 >>>>>> --- a/test/read-mshot.c >>>>>> +++ b/test/read-mshot.c >>>>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow) >>>>>> } >>>>>> if (!(cqe->flags & IORING_CQE_F_MORE)) { >>>>>> /* we expect this on overflow */ >>>>>> - if (overflow && (i - 1 == NR_OVERFLOW)) >>>>>> + if (overflow && i >= NR_OVERFLOW) >>>>> >>>>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops >>>>> one entry before CQ is full, so that the request can complete w/o >>>>> overflowing. Not supposing the change because it's a marginal >>>>> case, but we shouldn't limit ourselves. >>>> >>>> But if the event keeps triggering we have to keep posting CQEs, >>>> otherwise we could get stuck. >>> >>> Or we can complete the request, then the user consumes CQEs >>> and restarts as usual >> >> So you'd want to track if we'd overflow, wait for overflow to clear, and >> then restart that request? > > No, the 2 line change in io_post_cqe() from the last email's > snippet is the only thing you'd need. Ah now I follow, so you're still terminating it, just one before overflow rather than letting it overflow. Yes I agree that makes more sense! It could still terminate early though, if the application reaps CQEs before another one is posted. So I think the opportunistic early termination is probably best ignored. Or the app could always be using the full size of the CQ ring, and never above. With the change, it'd still terminate every CQ-ring-size requests, even though it need not. IOW, I think we just leave it as-is, no? Neither of these should cause an app issue, as CQE_F_MORE is the one driving terminations. But you could have more unneeded terminations, even if that may be far fetched. Though you never know, an opportunistically terminating with a known racy check seems like something we should not do. > I probably don't understand why and what tracking you mean, but > fwiw we currently do track and account for overflows. Misunderstanding, I thought you'd still post with CQE_F_MORE and need to restart it. But that wasn't the case. >> I think that sounds a bit involved, no? >> Particularly for a case like overflow, which generally should not occur. >> If it does, just terminate it, and have the user re-issue it. That seems >> like the simpler and better solution to me. >> >>>> As far as I'm concerned, the behavior with >>>> the patch looks correct. The last CQE is overflown, and that terminates >>>> it, and it doesn't have MORE set. The one before that has MORE set, but >>>> it has to, unless you aborted it early. But that seems impossible, >>>> because what if that was indeed the last current CQE, and we reap CQEs >>>> before the next one is posted. >>>> >>>> So unless I'm missing something, I don't think we can be doing any >>>> better. >>> >>> You can opportunistically try to avoid overflows, unreliably >>> >>> bool io_post_cqe() { >>> // Not enough space in the CQ left, so if there is a next >>> // completion pending we'd have to overflow. Avoid that by >>> // terminating it now. >>> // >>> // If there are no more CQEs after this one, we might >>> // terminate a bit earlier, but that better because >>> // overflows are so expensive and unhandy and so on. >>> if (cq_space_left() <= 1) >>> return false; >>> fill_cqe(); >>> return true; >>> } >>> >>> some_multishot_function(req) { >>> if (!io_post_cqe(res)) >>> complete_req(req, res); >>> } >>> >>> Again, not suggesting the change for all the obvious reasons, but >>> I think semantically we should be able to do it. >> >> Yeah not convinced this is worth looking at. If it was the case that the >> hot path would often see overflows and it'd help to avoid it, then >> probably it'd make sense. But I don't think that's the case. > > We're talking about different things. Seems you're discussing a > particular implementation, its constraints and performance. I care > purely about the semantics, the implicit uapi. And I define it as > "multishot requests may decide to terminate at any point, the user > should expect it and reissue when appropriate", not restricting it > to "can only (normally) terminate when CQ is full". Yep fully agree, I think it was largely a talking past each other on exactly what it'd do. > We're changing tests from time to time, but the there is that > "behaviour defines semantics", especially when it wasn't clear > in advance and breaks someone's app, and people might be using > assumptions in tests as the universal truth. Agree, any time a test needs changing, it should be cause for extra thinking in terms of whether this will have application impacts as well. In general, the tests are overly anal, and sometimes they do end up testing implementation details. The API is pretty clear in this regard, if you see CQE_F_MORE, then you get more completions. If you don't, the request has terminated. The change doesn't really impact that.
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 5a2afbc93887..ea7e5488b3be 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -205,6 +205,7 @@ struct io_submit_state { bool plug_started; bool need_plug; + bool cq_flush; unsigned short submit_nr; unsigned int cqes_count; struct blk_plug plug; @@ -342,8 +343,6 @@ struct io_ring_ctx { unsigned cq_last_tm_flush; } ____cacheline_aligned_in_smp; - struct io_uring_cqe completion_cqes[16]; - spinlock_t completion_lock; /* IRQ completion list, under ->completion_lock */ diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 167a3429a056..023fcf5d52c1 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -176,7 +176,7 @@ static struct ctl_table kernel_io_uring_disabled_table[] = { static inline void io_submit_flush_completions(struct io_ring_ctx *ctx) { if (!wq_list_empty(&ctx->submit_state.compl_reqs) || - ctx->submit_state.cqes_count) + ctx->submit_state.cq_flush) __io_submit_flush_completions(ctx); } @@ -636,6 +636,12 @@ static inline void __io_cq_lock(struct io_ring_ctx *ctx) spin_lock(&ctx->completion_lock); } +static inline void __io_cq_unlock(struct io_ring_ctx *ctx) +{ + if (!ctx->lockless_cq) + spin_unlock(&ctx->completion_lock); +} + static inline void io_cq_lock(struct io_ring_ctx *ctx) __acquires(ctx->completion_lock) { @@ -888,31 +894,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, return false; } -static void __io_flush_post_cqes(struct io_ring_ctx *ctx) - __must_hold(&ctx->uring_lock) -{ - struct io_submit_state *state = &ctx->submit_state; - unsigned int i; - - lockdep_assert_held(&ctx->uring_lock); - for (i = 0; i < state->cqes_count; i++) { - struct io_uring_cqe *cqe = &ctx->completion_cqes[i]; - - if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) { - if (ctx->lockless_cq) { - spin_lock(&ctx->completion_lock); - io_cqring_event_overflow(ctx, cqe->user_data, - cqe->res, cqe->flags, 0, 0); - spin_unlock(&ctx->completion_lock); - } else { - io_cqring_event_overflow(ctx, cqe->user_data, - cqe->res, cqe->flags, 0, 0); - } - } - } - state->cqes_count = 0; -} - bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) { bool filled; @@ -933,31 +914,16 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) { struct io_ring_ctx *ctx = req->ctx; - u64 user_data = req->cqe.user_data; - struct io_uring_cqe *cqe; + bool posted; lockdep_assert(!io_wq_current_is_worker()); lockdep_assert_held(&ctx->uring_lock); - if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->completion_cqes)) { - __io_cq_lock(ctx); - __io_flush_post_cqes(ctx); - /* no need to flush - flush is deferred */ - __io_cq_unlock_post(ctx); - } - - /* For defered completions this is not as strict as it is otherwise, - * however it's main job is to prevent unbounded posted completions, - * and in that it works just as well. - */ - if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) - return false; - - cqe = &ctx->completion_cqes[ctx->submit_state.cqes_count++]; - cqe->user_data = user_data; - cqe->res = res; - cqe->flags = cflags; - return true; + __io_cq_lock(ctx); + posted = io_fill_cqe_aux(ctx, req->cqe.user_data, res, cflags); + ctx->submit_state.cq_flush = true; + __io_cq_unlock_post(ctx); + return posted; } static void __io_req_complete_post(struct io_kiocb *req, unsigned issue_flags) @@ -1551,9 +1517,6 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) struct io_wq_work_node *node; __io_cq_lock(ctx); - /* must come first to preserve CQE ordering in failure cases */ - if (state->cqes_count) - __io_flush_post_cqes(ctx); __wq_list_for_each(node, &state->compl_reqs) { struct io_kiocb *req = container_of(node, struct io_kiocb, comp_list); @@ -1575,6 +1538,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) io_free_batch_list(ctx, state->compl_reqs.first); INIT_WQ_LIST(&state->compl_reqs); } + ctx->submit_state.cq_flush = false; } static unsigned io_cqring_events(struct io_ring_ctx *ctx)
io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- include/linux/io_uring_types.h | 3 +- io_uring/io_uring.c | 64 ++++++++-------------------------- 2 files changed, 15 insertions(+), 52 deletions(-)