mbox series

[GIT,PULL] Add support for epoll min wait time

Message ID b0901cba-3cb8-a309-701e-7b8cb13f0e8a@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Add support for epoll min wait time | expand

Pull-request

git://git.kernel.dk/linux.git tags/epoll-min_ts-2022-12-08

Message

Jens Axboe Dec. 10, 2022, 3:36 p.m. UTC
Hi Linus,

I've had this done for months and posted a few times, but little
attention has been received. Sending it out for inclusion now, as having
it caught up in upstream limbo is preventing further use cases of it at
Meta. We upstream every feature that we develop, and we don't put any
features into our kernel that aren't already upstream, or on the way
upstream. This is obviously especially important when an API is
involved.

This adds an epoll_ctl method for setting the minimum wait time for
retrieving events. In production, workloads don't just run mostly idle
or mostly full tilt. A common pattern is medium load. epoll_wait and
friends receive a cap of max events and max time we want to wait for
them, but there's no notion of min events or min time. This leads to
services only getting a single event, even if they are totally fine with
waiting eg 200 usec for more events. More events leads to greater
efficiency in handling them.

The main patch has some numbers, but tldr is that we see a nice
reduction in context switches / second, and a reduction in busy time on
such systems.

It has been suggested that a syscall should be available for this as
well, and there are two main reasons for why this wasn't pursued (but
was still investigated):

- This most likely should've been done as epoll_pwait3(), as we already
  have epoll_wait, epoll_pwait, and epoll_pwait2. The latter two are
  already at the max number of syscall arguments, so a new method would
  have to be done where a struct would define the API. With some
  arguments being optional, this could get inefficient or ugly (or both).

- Main reason is that Meta doesn't need it. By using epoll_ctl, the
  check-for-support-of-feature can be relegated to setup time rather
  than in the fast path, and the workloads we are looking at would not
  need different min wait settings within a single epoll context.

Please pull!


The following changes since commit ef4d3ea40565a781c25847e9cb96c1bd9f462bc6:

  afs: Fix server->active leak in afs_put_server (2022-11-30 10:02:37 -0800)

are available in the Git repository at:

  git://git.kernel.dk/linux.git tags/epoll-min_ts-2022-12-08

for you to fetch changes up to 73b9320234c0ad1b5e6f576abb796221eb088c64:

  eventpoll: ensure we pass back -EBADF for a bad file descriptor (2022-12-08 07:05:42 -0700)

----------------------------------------------------------------
epoll-min_ts-2022-12-08

----------------------------------------------------------------
Jens Axboe (8):
      eventpoll: cleanup branches around sleeping for events
      eventpoll: don't pass in 'timed_out' to ep_busy_loop()
      eventpoll: split out wait handling
      eventpoll: move expires to epoll_wq
      eventpoll: move file checking earlier for epoll_ctl()
      eventpoll: add support for min-wait
      eventpoll: add method for configuring minimum wait on epoll context
      eventpoll: ensure we pass back -EBADF for a bad file descriptor

 fs/eventpoll.c                 | 192 +++++++++++++++++++++++++++++++++--------
 include/linux/eventpoll.h      |   2 +-
 include/uapi/linux/eventpoll.h |   1 +
 3 files changed, 158 insertions(+), 37 deletions(-)

Comments

Willy Tarreau Dec. 10, 2022, 3:58 p.m. UTC | #1
Hi Jens,

On Sat, Dec 10, 2022 at 08:36:11AM -0700, Jens Axboe wrote:
> Hi Linus,
> 
> I've had this done for months and posted a few times, but little
> attention has been received.

I personally think this is particularly cool, for having faced the
same needs in the past. I'm just wondering how long we'll avoid the
need for marking certain FDs as urgent (i.e. for inter-thread wakeup)
which would bypass the min delay.

I'm just seeing something a bit odd in this series:

> ----------------------------------------------------------------
> epoll-min_ts-2022-12-08
> 
> ----------------------------------------------------------------
> Jens Axboe (8):
>       eventpoll: cleanup branches around sleeping for events
>       eventpoll: don't pass in 'timed_out' to ep_busy_loop()
>       eventpoll: split out wait handling
>       eventpoll: move expires to epoll_wq
>       eventpoll: move file checking earlier for epoll_ctl()
>       eventpoll: add support for min-wait
>       eventpoll: add method for configuring minimum wait on epoll context
>       eventpoll: ensure we pass back -EBADF for a bad file descriptor

This last patch fixes a bug introduced by the 5th one. Why not squash it
instead of purposely introducing a bug then its fix ? Or maybe it was
just overlooked when you sent the PR ?

Thanks,
Willy
Jens Axboe Dec. 10, 2022, 4:05 p.m. UTC | #2
On 12/10/22 8:58?AM, Willy Tarreau wrote:
> Hi Jens,
> 
> On Sat, Dec 10, 2022 at 08:36:11AM -0700, Jens Axboe wrote:
>> Hi Linus,
>>
>> I've had this done for months and posted a few times, but little
>> attention has been received.
> 
> I personally think this is particularly cool, for having faced the
> same needs in the past. I'm just wondering how long we'll avoid the
> need for marking certain FDs as urgent (i.e. for inter-thread wakeup)
> which would bypass the min delay.

Thanks! No opinion on urgent fds, it's not something I have looked
into...

> I'm just seeing something a bit odd in this series:
> 
>> ----------------------------------------------------------------
>> epoll-min_ts-2022-12-08
>>
>> ----------------------------------------------------------------
>> Jens Axboe (8):
>>       eventpoll: cleanup branches around sleeping for events
>>       eventpoll: don't pass in 'timed_out' to ep_busy_loop()
>>       eventpoll: split out wait handling
>>       eventpoll: move expires to epoll_wq
>>       eventpoll: move file checking earlier for epoll_ctl()
>>       eventpoll: add support for min-wait
>>       eventpoll: add method for configuring minimum wait on epoll context
>>       eventpoll: ensure we pass back -EBADF for a bad file descriptor
> 
> This last patch fixes a bug introduced by the 5th one. Why not squash it
> instead of purposely introducing a bug then its fix ? Or maybe it was
> just overlooked when you sent the PR ?

I didn't want to rebase it, so I just put the fix at the end. Not that
important imho, only issue there was an ltp case getting a wrong error
value. Hence didn't deem it important enough to warrant a rebase.
Willy Tarreau Dec. 10, 2022, 4:17 p.m. UTC | #3
On Sat, Dec 10, 2022 at 09:05:02AM -0700, Jens Axboe wrote:
> On 12/10/22 8:58?AM, Willy Tarreau wrote:
> > Hi Jens,
> > 
> > On Sat, Dec 10, 2022 at 08:36:11AM -0700, Jens Axboe wrote:
> >> Hi Linus,
> >>
> >> I've had this done for months and posted a few times, but little
> >> attention has been received.
> > 
> > I personally think this is particularly cool, for having faced the
> > same needs in the past. I'm just wondering how long we'll avoid the
> > need for marking certain FDs as urgent (i.e. for inter-thread wakeup)
> > which would bypass the min delay.
> 
> Thanks! No opinion on urgent fds, it's not something I have looked
> into...

We'll see over time anyway :-)

> > This last patch fixes a bug introduced by the 5th one. Why not squash it
> > instead of purposely introducing a bug then its fix ? Or maybe it was
> > just overlooked when you sent the PR ?
> 
> I didn't want to rebase it, so I just put the fix at the end. Not that
> important imho, only issue there was an ltp case getting a wrong error
> value. Hence didn't deem it important enough to warrant a rebase.

OK. I tend to prefer making sure that a bisect session can never end up
in the middle of a patch set for a reason other than a yet-undiscovered
bug, that's why I was asking.

Thanks,
Willy
Jens Axboe Dec. 10, 2022, 4:23 p.m. UTC | #4
>>> This last patch fixes a bug introduced by the 5th one. Why not squash it
>>> instead of purposely introducing a bug then its fix ? Or maybe it was
>>> just overlooked when you sent the PR ?
>>
>> I didn't want to rebase it, so I just put the fix at the end. Not that
>> important imho, only issue there was an ltp case getting a wrong error
>> value. Hence didn't deem it important enough to warrant a rebase.
> 
> OK. I tend to prefer making sure that a bisect session can never end up
> in the middle of a patch set for a reason other than a yet-undiscovered
> bug, that's why I was asking.

If the bug in question is a complete malfunction, or a crash for
example, then I would certainly have squashed and rebased. But since
this one is really minor - checking for the return value in an error
condition, I didn't see it as important enough to do that. It's not
something you'd run into at runtime, except if you were running LTP...
Linus Torvalds Dec. 10, 2022, 6:51 p.m. UTC | #5
On Sat, Dec 10, 2022 at 7:36 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> This adds an epoll_ctl method for setting the minimum wait time for
> retrieving events.

So this is something very close to what the TTY layer has had forever,
and is useful (well... *was* useful) for pretty much the same reason.

However, let's learn from successful past interfaces: the tty layer
doesn't have just VTIME, it has VMIN too.

And I think they very much go hand in hand: you want for at least VMIN
events or for at most VTIME after the last event.

Yes, yes, you have that 'maxevents' thing, but that's not at all the
same as VMIN. That's just the buffer size.

Also note that the tty layer VTIME is *different* from what I think
your "minimum wait time" is. VTIME is a "inter event timer", not a
"minimum total time". If new events keep on coming, the timer resets -
until either things time out, or you hit VMIN events.

I get the feeling that the tty layer did this right, and this epoll
series did not. The tty model certainly feels more flexible, and does
have decades of experience. tty traffic *used* to be just about the
lowest-latency traffic machines handled back when, so I think it might
be worth looking at as a model.

So I get the feeling that if you are adding some new "timeout for
multiple events" model to epoll, you should look at previous users.

And btw, the tty layer most definitely doesn't handle every possible case.

There are at least three different valid timeouts:

 (a) the "final timeout" that epoll already has (ie "in no case wait
more than this, even if there are no events")

 (b) the "max time we wait if we have at least one event" (your new "min_wait")

 (c) the "inter-event timeout" (tty layer VTIME)

and in addition to the timers, there's that whole "if I have gotten X
events, I have enough, so stop timing out" (tty layer VMIN).

And again, that "at least X events" should not be "this is my buffer
size". You may well want to have a *big* buffer for when there are
events queued up or the machine is just under very heavy load, but may
well feel like "if I got N events, I have enough to deal with, and
don't want to time out for any more".

Now, maybe there is some reason why the tty like VMIN/VTIME just isn't
relevant, but I do think that people have successfully used VMIN/VTIME
for long enough that it should be at least given some thought.

Terminal traffic may not be very relevant any more as a hard load to
deal with well. But it really used to be very much an area that had to
balance both throughput and latency concerns and had exactly the kinds
of issues you describe (ie "returning after one single character is
*much* too inefficient").

Hmm?

              Linus
Linus Torvalds Dec. 10, 2022, 7:26 p.m. UTC | #6
On Sat, Dec 10, 2022 at 10:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, maybe there is some reason why the tty like VMIN/VTIME just isn't
> relevant, but I do think that people have successfully used VMIN/VTIME
> for long enough that it should be at least given some thought.

Side note: another thing the tty layer model does is to make this be a
per-tty thing.

That's actually noticeable in regular 'poll()/select()' usage, so it
has interesting semantics: if VTIME is 0 (ie there is no inter-event
timeout), then poll/select will return "readable" only once you hit
VMIN characters.

Maybe this isn't relevant for the epoll() situation, but it might be
worth thinking about.

It's most definitely not obvious that any epoll() timeout should be
the same for different file descriptors.

Willy already mentioned "urgent file descriptors", and making these
things be per-fd would very naturally solve that whole situation too.

Again: I don't want to in any way force a "tty-like" solution. I'm
just saying that this kind of thing does have a long history, and I do
get the feeling that the tty solution is the more flexible one.

And while the tty model is "per tty" (it's obviously hidden in the
termios structure), any epoll equivalent would have to be different
(presumably per-event or something).

So I'm also not advocating some 1:1 equivalence, just bringing up the
whole "ttys do this similar thing but they seem to have a more
flexible model".

            Linus
Jens Axboe Dec. 11, 2022, 1:58 a.m. UTC | #7
On 12/10/22 11:51?AM, Linus Torvalds wrote:
> On Sat, Dec 10, 2022 at 7:36 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> This adds an epoll_ctl method for setting the minimum wait time for
>> retrieving events.
> 
> So this is something very close to what the TTY layer has had forever,
> and is useful (well... *was* useful) for pretty much the same reason.
> 
> However, let's learn from successful past interfaces: the tty layer
> doesn't have just VTIME, it has VMIN too.
> 
> And I think they very much go hand in hand: you want for at least VMIN
> events or for at most VTIME after the last event.

It has been suggested before too. A more modern example is how IRQ
coalescing works on eg nvme or nics. Those generally are of the nature
of "wait for X time, or until Y events are available". We can certainly
do something like that here too, it's just adding a minevents and
passing them in together.

I'll add that, really should be trivial, and resend later in the merge
window once we're happy with that.

> Yes, yes, you have that 'maxevents' thing, but that's not at all the
> same as VMIN. That's just the buffer size.

Right, the fact that maxevents is all we have means it's not very useful
for anything but "don't write beyond this size of array I have for
events". io_uring has minevents as the general interface, and doesn't
really care about maxevents as it obviously doesn't need to copy it
anywhere.

> Also note that the tty layer VTIME is *different* from what I think
> your "minimum wait time" is. VTIME is a "inter event timer", not a
> "minimum total time". If new events keep on coming, the timer resets -
> until either things time out, or you hit VMIN events.

Right, and I don't think that's what we want here. Some of the hw
coalescing works the same time, basically triggering the timeout once we
have received one event. But that makes it hard to manage the latency,
if your budget is XX usec. Now it becomes YY usec + XX usec instead, if
an event isn't immediatly available.

> I get the feeling that the tty layer did this right, and this epoll
> series did not. The tty model certainly feels more flexible, and does
> have decades of experience. tty traffic *used* to be just about the
> lowest-latency traffic machines handled back when, so I think it might
> be worth looking at as a model.
> 
> So I get the feeling that if you are adding some new "timeout for
> multiple events" model to epoll, you should look at previous users.
> 
> And btw, the tty layer most definitely doesn't handle every possible case.
> 
> There are at least three different valid timeouts:
> 
>  (a) the "final timeout" that epoll already has (ie "in no case wait
> more than this, even if there are no events")
> 
>  (b) the "max time we wait if we have at least one event" (your new "min_wait")
> 
>  (c) the "inter-event timeout" (tty layer VTIME)
> 
> and in addition to the timers, there's that whole "if I have gotten X
> events, I have enough, so stop timing out" (tty layer VMIN).

I do like the VMIN and I think it makes sense. Using VMIN == 0 and VTIME
!= 0 would give you the same behavior it has now, having both be
non-zero would be an OR condition for when to exit.

> And again, that "at least X events" should not be "this is my buffer
> size". You may well want to have a *big* buffer for when there are
> events queued up or the machine is just under very heavy load, but may
> well feel like "if I got N events, I have enough to deal with, and
> don't want to time out for any more".

For epoll, the maxevents already exists for this. Any call should reap
anything up to maxevents, not stop if minevents is met but more events
are available. Only maxevents should terminate the reaping for available
events.

> Now, maybe there is some reason why the tty like VMIN/VTIME just isn't
> relevant, but I do think that people have successfully used VMIN/VTIME
> for long enough that it should be at least given some thought.

Ah it's close enough to thing that are available now. As you mentioned,
there are only really that many different ways to do this if you factor
in a VMIN events as well.
Jens Axboe Dec. 11, 2022, 2:03 a.m. UTC | #8
On 12/10/22 12:26?PM, Linus Torvalds wrote:
> On Sat, Dec 10, 2022 at 10:51 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Now, maybe there is some reason why the tty like VMIN/VTIME just isn't
>> relevant, but I do think that people have successfully used VMIN/VTIME
>> for long enough that it should be at least given some thought.
> 
> Side note: another thing the tty layer model does is to make this be a
> per-tty thing.
> 
> That's actually noticeable in regular 'poll()/select()' usage, so it
> has interesting semantics: if VTIME is 0 (ie there is no inter-event
> timeout), then poll/select will return "readable" only once you hit
> VMIN characters.
> 
> Maybe this isn't relevant for the epoll() situation, but it might be
> worth thinking about.

It really has to be per wait-index for epoll, which is the epoll
context...

> It's most definitely not obvious that any epoll() timeout should be
> the same for different file descriptors.

Certainly not, and that's where the syscall vs epoll context specific
discussion comes in. But I don't think you'll find many use cases where
this isn't a per epoll context kind of thing for networking.
Applications just don't mix and match like that and have wildly
different file descriptors in there. It's generally tens to hundreds of
thousands of sockets.

> Willy already mentioned "urgent file descriptors", and making these
> things be per-fd would very naturally solve that whole situation too.
> 
> Again: I don't want to in any way force a "tty-like" solution. I'm
> just saying that this kind of thing does have a long history, and I do
> get the feeling that the tty solution is the more flexible one.
> 
> And while the tty model is "per tty" (it's obviously hidden in the
> termios structure), any epoll equivalent would have to be different
> (presumably per-event or something).
> 
> So I'm also not advocating some 1:1 equivalence, just bringing up the
> whole "ttys do this similar thing but they seem to have a more
> flexible model".

Maybe this can be per-fd down the line when we have something like
urgent file descriptors. My hope there would be that we just use
io_uring for that, this series is very much just about eeking out some
more performance from it until that transition can be made anyway. I
don't have a lot of vested personal interest in improving epoll outside
of that, but it is a really big win that would be silly to throw away
while other more long term transitions are happening.
Jens Axboe Dec. 11, 2022, 2:20 a.m. UTC | #9
On 12/10/22 6:58 PM, Jens Axboe wrote:
> On 12/10/22 11:51?AM, Linus Torvalds wrote:
>> On Sat, Dec 10, 2022 at 7:36 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> This adds an epoll_ctl method for setting the minimum wait time for
>>> retrieving events.
>>
>> So this is something very close to what the TTY layer has had forever,
>> and is useful (well... *was* useful) for pretty much the same reason.
>>
>> However, let's learn from successful past interfaces: the tty layer
>> doesn't have just VTIME, it has VMIN too.
>>
>> And I think they very much go hand in hand: you want for at least VMIN
>> events or for at most VTIME after the last event.
> 
> It has been suggested before too. A more modern example is how IRQ
> coalescing works on eg nvme or nics. Those generally are of the nature
> of "wait for X time, or until Y events are available". We can certainly
> do something like that here too, it's just adding a minevents and
> passing them in together.
> 
> I'll add that, really should be trivial, and resend later in the merge
> window once we're happy with that.

Took a quick look, and it's not that trivial. The problem is you have
to wake the task to reap events anyway, this cannot be checked at
wakeup time. And now you lose the nice benefit of reducing the
context switch rate, which was a good chunk of the win here...

This can obviously very easily be done with io_uring, since that's
how it already works in terms of waiting. The min-wait part was done
separately there, though hasn't been posted or included upstream yet.

So now we're a bit stuck...
Jens Axboe Dec. 11, 2022, 2:31 a.m. UTC | #10
On 12/10/22 7:20?PM, Jens Axboe wrote:
> On 12/10/22 6:58?PM, Jens Axboe wrote:
>> On 12/10/22 11:51?AM, Linus Torvalds wrote:
>>> On Sat, Dec 10, 2022 at 7:36 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> This adds an epoll_ctl method for setting the minimum wait time for
>>>> retrieving events.
>>>
>>> So this is something very close to what the TTY layer has had forever,
>>> and is useful (well... *was* useful) for pretty much the same reason.
>>>
>>> However, let's learn from successful past interfaces: the tty layer
>>> doesn't have just VTIME, it has VMIN too.
>>>
>>> And I think they very much go hand in hand: you want for at least VMIN
>>> events or for at most VTIME after the last event.
>>
>> It has been suggested before too. A more modern example is how IRQ
>> coalescing works on eg nvme or nics. Those generally are of the nature
>> of "wait for X time, or until Y events are available". We can certainly
>> do something like that here too, it's just adding a minevents and
>> passing them in together.
>>
>> I'll add that, really should be trivial, and resend later in the merge
>> window once we're happy with that.
> 
> Took a quick look, and it's not that trivial. The problem is you have
> to wake the task to reap events anyway, this cannot be checked at
> wakeup time. And now you lose the nice benefit of reducing the
> context switch rate, which was a good chunk of the win here...

One approximation we could make is that once we've done that first reap
of events, let's say we get N events (where N could be zero), the number
of wakeups post that is a rough approximation of the number of events
that have arrived. We already use this to break out of min_wait if we
think we'll exceed maxevents. We could use that same metric to estimate
if we've hit minevents as well. It would not be guaranteed accurate, but
probably good enough. Even if we didn't quite hit minevents there, we'd
return rather than do another sleep and wakeup cycle.