Message ID | 20240603003306.2030491-1-kent.overstreet@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | sys_ringbuffer | expand |
On Sun, Jun 02, 2024 at 08:32:57PM -0400, Kent Overstreet wrote: > New syscall for mapping generic ringbuffers for arbitary (supported) > file descriptors. > > Ringbuffers can be created either when requested or at file open time, > and can be mapped into multiple address spaces (naturally, since files > can be shared as well). > > Initial motivation is for fuse, but I plan on adding support to pipes > and possibly sockets as well - pipes are a particularly interesting use > case, because if both the sender and receiver of a pipe opt in to the > new ringbuffer interface, we can make them the _same_ ringbuffer for > true zero copy IO, while being backwards compatible with existing pipes. Hi Kent, I recently came across a similar use case where the ability to "upgrade" an fd into a more efficient interface would be useful like in this pipe scenario you are describing. My use case is when you have a block device using the ublk driver. ublk lets userspace servers implement block devices. ublk is great when compatibility is required with applications that expect block device fds, but when an application is willing to implement a shared memory interface to communicate directly with the ublk server then going through a block device is inefficient. In my case the application is QEMU, where the virtual machine runs a virtio-blk driver that could talk directly to the ublk server via vhost-user-blk. vhost-user-blk is a protocol that allows the virtual machine to talk directly to the ublk server via shared memory without going through QEMU or the host kernel block layer. QEMU would need a way to upgrade from a ublk block device file to a vhost-user socket. Just like in your pipe example, this approach relies on being able to go from a "compatibility" fd to a more efficient interface gracefully when both sides support this feature. The generic ringbuffer approach in this series would not work for the vhost-user protocol because the client must be able to provide its own memory and file descriptor passing is needed in general. The protocol spec is here: https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/vhost-user.rst A different way to approach the fd upgrading problem is to treat this as an AF_UNIX connectivity feature rather than a new ring buffer API. Imagine adding a new address type to AF_UNIX for looking up connections in a struct file (e.g. the pipe fd) instead of on the file system (or the other AF_UNIX address types). The first program creates the pipe and also an AF_UNIX socket. It calls bind(2) on the socket with the sockaddr_un path "/dev/self/fd/<fd>/<discriminator>" where fd is a pipe fd and discriminator is a string like "ring-buffer" that describes the service/protocol. The AF_UNIX kernel code parses this special path and stores an association with the pipe file for future connect(2) calls. The program listens on the AF_UNIX socket and then continues doing its stuff. The second program runs and inherits the pipe fd on stdin. It creates an AF_UNIX socket and attempts to connect(2) to "/dev/self/fd/0/ring-buffer". The AF_UNIX kernel code parses this special path and establishes a connection between the connecting and listening sockets inside the pipe fd's struct file. If connect(2) fails then the second program knows that this is an ordinary pipe that does not support upgrading to ring buffer operation. Now the AF_UNIX socket can be used to pass shared memory for the ring buffer and futexes. This AF_UNIX approach also works for my ublk block device to vhost-user-blk upgrade use case. It does not require a new ring buffer API but instead involves extending AF_UNIX. You have more use cases than just the pipe scenario, maybe my half-baked idea won't cover all of them, but I wanted to see what you think. Stefan > the ringbuffer_wait and ringbuffer_wakeup syscalls are probably going > away in a future iteration, in favor of just using futexes. > > In my testing, reading/writing from the ringbuffer 16 bytes at a time is > ~7x faster than using read/write syscalls - and I was testing with > mitigations off, real world benefit will be even higher. > > Kent Overstreet (5): > darray: lift from bcachefs > darray: Fix darray_for_each_reverse() when darray is empty > fs: sys_ringbuffer > ringbuffer: Test device > ringbuffer: Userspace test helper > > MAINTAINERS | 7 + > arch/x86/entry/syscalls/syscall_32.tbl | 3 + > arch/x86/entry/syscalls/syscall_64.tbl | 3 + > fs/Makefile | 2 + > fs/bcachefs/Makefile | 1 - > fs/bcachefs/btree_types.h | 2 +- > fs/bcachefs/btree_update.c | 2 + > fs/bcachefs/btree_write_buffer_types.h | 2 +- > fs/bcachefs/fsck.c | 2 +- > fs/bcachefs/journal_io.h | 2 +- > fs/bcachefs/journal_sb.c | 2 +- > fs/bcachefs/sb-downgrade.c | 3 +- > fs/bcachefs/sb-errors_types.h | 2 +- > fs/bcachefs/sb-members.h | 3 +- > fs/bcachefs/subvolume.h | 1 - > fs/bcachefs/subvolume_types.h | 2 +- > fs/bcachefs/thread_with_file_types.h | 2 +- > fs/bcachefs/util.h | 28 +- > fs/ringbuffer.c | 474 ++++++++++++++++++++++++ > fs/ringbuffer_test.c | 209 +++++++++++ > {fs/bcachefs => include/linux}/darray.h | 61 +-- > include/linux/darray_types.h | 22 ++ > include/linux/fs.h | 2 + > include/linux/mm_types.h | 4 + > include/linux/ringbuffer_sys.h | 18 + > include/uapi/linux/futex.h | 1 + > include/uapi/linux/ringbuffer_sys.h | 40 ++ > init/Kconfig | 9 + > kernel/fork.c | 2 + > lib/Kconfig.debug | 5 + > lib/Makefile | 2 +- > {fs/bcachefs => lib}/darray.c | 12 +- > tools/ringbuffer/Makefile | 3 + > tools/ringbuffer/ringbuffer-test.c | 254 +++++++++++++ > 34 files changed, 1125 insertions(+), 62 deletions(-) > create mode 100644 fs/ringbuffer.c > create mode 100644 fs/ringbuffer_test.c > rename {fs/bcachefs => include/linux}/darray.h (63%) > create mode 100644 include/linux/darray_types.h > create mode 100644 include/linux/ringbuffer_sys.h > create mode 100644 include/uapi/linux/ringbuffer_sys.h > rename {fs/bcachefs => lib}/darray.c (56%) > create mode 100644 tools/ringbuffer/Makefile > create mode 100644 tools/ringbuffer/ringbuffer-test.c > > -- > 2.45.1 >