mbox series

[for-next,00/12] io_uring: retarget rsrc nodes periodically

Message ID 20221031134126.82928-1-dylany@meta.com (mailing list archive)
Headers show
Series io_uring: retarget rsrc nodes periodically | expand

Message

Dylan Yudaken Oct. 31, 2022, 1:41 p.m. UTC
There is a problem with long running io_uring requests and rsrc node
cleanup, where a single long running request can block cleanup of all
subsequent nodes. For example a network server may have both long running
accepts and also use registered files for connections. When sockets are
closed and returned (either through close_direct, or through
register_files_update) the underlying file will not be freed until that
original accept completes. For the case of multishot this may be the
lifetime of the application, which will cause file numbers to grow
unbounded - resulting in either OOMs or ENFILE errors.

To fix this I propose retargeting the rsrc nodes from ongoing requests to
the current main request of the io_uring. This only needs to happen for
long running requests types, and specifically those that happen as a
result of some external event. For this reason I exclude write/send style
ops for the time being as even though these can cause this issue in
reality it would be unexpected to have a write block for hours. This
support can obviously be added later if needed.

In order to retarget nodes all the outstanding requests (in both poll
tables and io-wq) need to be iterated and the request needs to be checked
to make sure the retargeting is valid. For example for FIXED_FILE requests
this involves ensuring the file is still referenced in the current node.
This O(N) operation seems to take ~1ms locally for 30k outstanding
requests. Note it locks the io_uring while it happens and so no other work
can occur. In order to amortize this cost slightly, I propose running this
operation at most every 60 seconds. It is hard coded currently, but would
be happy to take suggestions if this should be customizable (and how to do
such a thing).

Without customizable retargeting period, it's a bit difficult to submit
tests for this. I have a test but it obviously takes a many minutes to run
which is not going to be acceptable for liburing.

Patches 1-5 are the basic io_uring infrastructure
Patch 6 is a helper function used in the per op customisations
Patch 7 splits out the zerocopy specific field in io_sr_msg
Patches 8-12 are opcode implementations for retargeting

Dylan Yudaken (12):
  io_uring: infrastructure for retargeting rsrc nodes
  io_uring: io-wq helper to iterate all work
  io_uring: support retargeting rsrc on requests in the io-wq
  io_uring: reschedule retargeting at shutdown of ring
  io_uring: add tracing for io_uring_rsrc_retarget
  io_uring: add fixed file peeking function
  io_uring: split send_zc specific struct out of io_sr_msg
  io_uring: recv/recvmsg retarget_rsrc support
  io_uring: accept retarget_rsrc support
  io_uring: read retarget_rsrc support
  io_uring: read_fixed retarget_rsrc support
  io_uring: poll_add retarget_rsrc support

 include/linux/io_uring_types.h  |   2 +
 include/trace/events/io_uring.h |  30 +++++++
 io_uring/io-wq.c                |  49 +++++++++++
 io_uring/io-wq.h                |   3 +
 io_uring/io_uring.c             |  28 ++++--
 io_uring/io_uring.h             |   1 +
 io_uring/net.c                  | 114 ++++++++++++++++--------
 io_uring/net.h                  |   2 +
 io_uring/opdef.c                |   7 ++
 io_uring/opdef.h                |   1 +
 io_uring/poll.c                 |  12 +++
 io_uring/poll.h                 |   2 +
 io_uring/rsrc.c                 | 148 ++++++++++++++++++++++++++++++++
 io_uring/rsrc.h                 |   2 +
 io_uring/rw.c                   |  29 +++++++
 io_uring/rw.h                   |   2 +
 16 files changed, 390 insertions(+), 42 deletions(-)


base-commit: 30209debe98b6f66b13591e59e5272cb65b3945e

Comments

Pavel Begunkov Nov. 2, 2022, 11:44 a.m. UTC | #1
On 10/31/22 13:41, Dylan Yudaken wrote:
> There is a problem with long running io_uring requests and rsrc node
> cleanup, where a single long running request can block cleanup of all
> subsequent nodes. For example a network server may have both long running
> accepts and also use registered files for connections. When sockets are
> closed and returned (either through close_direct, or through
> register_files_update) the underlying file will not be freed until that
> original accept completes. For the case of multishot this may be the
> lifetime of the application, which will cause file numbers to grow
> unbounded - resulting in either OOMs or ENFILE errors.
> 
> To fix this I propose retargeting the rsrc nodes from ongoing requests to
> the current main request of the io_uring. This only needs to happen for
> long running requests types, and specifically those that happen as a
> result of some external event. For this reason I exclude write/send style
> ops for the time being as even though these can cause this issue in
> reality it would be unexpected to have a write block for hours. This
> support can obviously be added later if needed.

Is there a particular reason why it tries to retarget instead of
downgrading? Taking a file ref / etc. sounds more robust, e.g.
what if we send a lingering request and then remove the file
from the table? It also doesn't need caching the file index.


> In order to retarget nodes all the outstanding requests (in both poll
> tables and io-wq) need to be iterated and the request needs to be checked
> to make sure the retargeting is valid. For example for FIXED_FILE requests
> this involves ensuring the file is still referenced in the current node.
> This O(N) operation seems to take ~1ms locally for 30k outstanding
> requests. Note it locks the io_uring while it happens and so no other work
> can occur. In order to amortize this cost slightly, I propose running this
> operation at most every 60 seconds. It is hard coded currently, but would
> be happy to take suggestions if this should be customizable (and how to do
> such a thing).
> 
> Without customizable retargeting period, it's a bit difficult to submit
> tests for this. I have a test but it obviously takes a many minutes to run
> which is not going to be acceptable for liburing.

We may also want to trigger it if there are too many rsrc nodes queued

> Patches 1-5 are the basic io_uring infrastructure
> Patch 6 is a helper function used in the per op customisations
> Patch 7 splits out the zerocopy specific field in io_sr_msg
> Patches 8-12 are opcode implementations for retargeting
> 
> Dylan Yudaken (12):
>    io_uring: infrastructure for retargeting rsrc nodes
>    io_uring: io-wq helper to iterate all work
>    io_uring: support retargeting rsrc on requests in the io-wq
>    io_uring: reschedule retargeting at shutdown of ring
>    io_uring: add tracing for io_uring_rsrc_retarget
>    io_uring: add fixed file peeking function
>    io_uring: split send_zc specific struct out of io_sr_msg
>    io_uring: recv/recvmsg retarget_rsrc support
>    io_uring: accept retarget_rsrc support
>    io_uring: read retarget_rsrc support
>    io_uring: read_fixed retarget_rsrc support
>    io_uring: poll_add retarget_rsrc support
> 
>   include/linux/io_uring_types.h  |   2 +
>   include/trace/events/io_uring.h |  30 +++++++
>   io_uring/io-wq.c                |  49 +++++++++++
>   io_uring/io-wq.h                |   3 +
>   io_uring/io_uring.c             |  28 ++++--
>   io_uring/io_uring.h             |   1 +
>   io_uring/net.c                  | 114 ++++++++++++++++--------
>   io_uring/net.h                  |   2 +
>   io_uring/opdef.c                |   7 ++
>   io_uring/opdef.h                |   1 +
>   io_uring/poll.c                 |  12 +++
>   io_uring/poll.h                 |   2 +
>   io_uring/rsrc.c                 | 148 ++++++++++++++++++++++++++++++++
>   io_uring/rsrc.h                 |   2 +
>   io_uring/rw.c                   |  29 +++++++
>   io_uring/rw.h                   |   2 +
>   16 files changed, 390 insertions(+), 42 deletions(-)
> 
> 
> base-commit: 30209debe98b6f66b13591e59e5272cb65b3945e
Dylan Yudaken Nov. 2, 2022, 1:08 p.m. UTC | #2
On Wed, 2022-11-02 at 11:44 +0000, Pavel Begunkov wrote:
> On 10/31/22 13:41, Dylan Yudaken wrote:
> > There is a problem with long running io_uring requests and rsrc
> > node
> > cleanup, where a single long running request can block cleanup of
> > all
> > subsequent nodes. For example a network server may have both long
> > running
> > accepts and also use registered files for connections. When sockets
> > are
> > closed and returned (either through close_direct, or through
> > register_files_update) the underlying file will not be freed until
> > that
> > original accept completes. For the case of multishot this may be
> > the
> > lifetime of the application, which will cause file numbers to grow
> > unbounded - resulting in either OOMs or ENFILE errors.
> > 
> > To fix this I propose retargeting the rsrc nodes from ongoing
> > requests to
> > the current main request of the io_uring. This only needs to happen
> > for
> > long running requests types, and specifically those that happen as
> > a
> > result of some external event. For this reason I exclude write/send
> > style
> > ops for the time being as even though these can cause this issue in
> > reality it would be unexpected to have a write block for hours.
> > This
> > support can obviously be added later if needed.
> 
> Is there a particular reason why it tries to retarget instead of
> downgrading? Taking a file ref / etc. 

Downgrading could work - but it is less general as it will not work for
buffers (and whatever future resources get added to this system). If it
could avoid periodic work that would be good but I don't really see how
that would happen

> sounds more robust, e.g.
> what if we send a lingering request and then remove the file
> from the table? 

This will work as the file will no longer match, and so cannot
retarget. 


> It also doesn't need caching the file index.

Yeah that would be a big benefit. Perhaps we should do that for some
ops (accept/poll) and retarget for others that can have fixed buffers?
It will make the code a bit simpler too for the simple ops. 

That being said removing the REQ_F_FIXED_FILE flag from the req sounds
like it might be problematic as flags had historically been constant?

> 
> 
> > In order to retarget nodes all the outstanding requests (in both
> > poll
> > tables and io-wq) need to be iterated and the request needs to be
> > checked
> > to make sure the retargeting is valid. For example for FIXED_FILE
> > requests
> > this involves ensuring the file is still referenced in the current
> > node.
> > This O(N) operation seems to take ~1ms locally for 30k outstanding
> > requests. Note it locks the io_uring while it happens and so no
> > other work
> > can occur. In order to amortize this cost slightly, I propose
> > running this
> > operation at most every 60 seconds. It is hard coded currently, but
> > would
> > be happy to take suggestions if this should be customizable (and
> > how to do
> > such a thing).
> > 
> > Without customizable retargeting period, it's a bit difficult to
> > submit
> > tests for this. I have a test but it obviously takes a many minutes
> > to run
> > which is not going to be acceptable for liburing.
> 
> We may also want to trigger it if there are too many rsrc nodes
> queued

In my testing this hasn't really been necessary as the individual nodes
do not take up too much space. But I am happy to add this if you think
it will help.


Also - thanks for the patch reviews, will get those into a v2 once the
general approach is ironed out.

Dylan