mbox series

[v3,00/55] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)

Message ID 20230331160914.1608208-1-dhowells@redhat.com (mailing list archive)
Headers show
Series splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) | expand

Message

David Howells March 31, 2023, 4:08 p.m. UTC
Hi Willy, Dave, et al.,

I've been looking at how to make pipes handle the splicing in of multipage
folios and also looking to see if I could implement a suggestion from Willy
that pipe_buffers could perhaps hold a list of pages (which could make
splicing simpler - an entire splice segment would go in a single
pipe_buffer).

There are a couple of issues here:

 (1) Gifting/stealing a multipage folio is really tricky.  I think that if
     a multipage folio if gifted, the gift flag should be quietly dropped.
     Userspace has no control over what splice() and vmsplice() will see in
     the pagecache.

 (2) The sendpage op expects to be given a single page and various network
     protocols just attach that to a socket buffer.

This patchset aims to deal with the second by removing the ->sendpage()
operation and replacing it with sendmsg() and a new internal flag
MSG_SPLICE_PAGES.  As sendmsg() takes an I/O iterator, this also affords
the opportunity to pass a slew of pages in one go, rather than one at a
time.

If MSG_SPLICE_PAGES is set, the protocol sendmsg() instance will attempt to
splice the pages out of the buffer, copying into individual fragments those
that it can't (e.g. because they belong to the slab).

The patchset consists of the following parts:

 (1) A couple of fixes.

 (2) Define the MSG_SPLICE_PAGES flag.

 (3) The page_frag_alloc_align() allocator is overhauled:

     (a) Split it out from mm/page_alloc.c into its own file,
     mm/page_frag_alloc.c.

     (b) Make it use multipage folios rather than compound pages.

     (c) Give it per-cpu buckets to allocate from so no locking is
     required.

     (d) The netdev_alloc_cache and the napi fragment cache are then cast
     in terms of this and some private allocators are removed.

     I'm not sure that the existing allocator is 100% multithread safe.

 (4) Implement MSG_SPLICE_PAGES support in TCP.

 (5) Make MSG_SPLICE_PAGES copy unspliceable pages (eg. slab pages).

 (6) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
     various callers.

 (7) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
     a wrapper around sendmsg().

 (8) Make IP/UDP copy unspliceable pages.

 (9) Implement MSG_SPLICE_PAGES support in AF_UNIX.

(10) Make AF_UNIX copy unspliceable pages.

(11) Make AF_ALG use netfs_extract_iter_to_sg().

(12) Make AF_ALG implement MSG_SPLICE_PAGES and make af_alg_sendpage() just
     a wrapper around sendmsg().

(13) Make AF_ALG/hash implement MSG_SPLICE_PAGES.

(14) Make TLS implement MSG_SPLICE_PAGES and make its sendpage
     implementations just a wrapper.

     [!] Note that tls_sw_sendpage_locked() appears to have the wrong
     	 locking upstream.  I think the caller will only hold the socket
     	 lock, but it should hold tls_ctx->tx_lock too.

(15) Make Chelsio's chtls implement MSG_SPLICE_PAGES.

(16) Make AF_KCM implement MSG_SPLICE_PAGES.

(17) Rename pipe_to_sendpage() to pipe_to_sendmsg() and make it a wrapper
     around sendmsg().

(18) Replace splice_to_socket() with an implementation that doesn't use
     splice_from_pipe() to push one page at a time, but rather something
     that splices up to 16 pages at once.  This absorbs pipe_to_sendmsg().

(19) Remove sendpage file operation.

(20) Convert siw, ceph, iscsi and tcp_bpf to use sendmsg() instead of
     tcp_sendpage().

(21) Make skb_send_sock() use sendmsg().

(22) Convert ceph, rds, dlm, sunrpc, nvme, kcm, smc, ocfs2 and drbd to use
     sendmsg().

(23) Make drbd delegate copying of slab pages to TCP and pass an entire
     bio's bvec to sendmsg at a time.  Delegate copying of unspliceable
     pages to TCP.

(24) Remove the sendpage socket operation.

I've killed off all uses of kernel_sendpage() and all uses of sendpage_ok()
outside of the protocols.

I have tested AF_UNIX splicing - which, surprisingly, seems nearly twice as
fast - TCP splicing, the siw driver (softIWarp RDMA with nfs and cifs),
sunrpc (with nfsd), UDP (using a patched rxrpc) and TLS/sw.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-sendpage

David

Changes
=======
ver #3)
 - Dropped the iterator-of-iterators patch.
 - Only expunge MSG_SPLICE_PAGES in sys_send[m]msg, not sys_recv[m]msg.
 - Split MSG_SPLICE_PAGES code in __ip_append_data() out into helper
   functions.
 - Implement MSG_SPLICE_PAGES support in __ip6_append_data() using the
   above helper functions.
 - Rename 'xlength' to 'initial_length'.
 - Minimise the changes to sunrpc for the moment.
 - Don't give -EOPNOTSUPP if NETIF_F_SG not available, just copy instead.
 - Implemented MSG_SPLICE_PAGES support in the TLS, Chelsio-TLS and AF_KCM
   code.
 
ver #2)
 - Overhauled the page_frag_alloc() allocator: large folios and per-cpu.
   - Got rid of my own zerocopy allocator.
 - Use iov_iter_extract_pages() rather poking in iter->bvec.
 - Made page splicing fall back to page copying on a page-by-page basis.
 - Made splice_to_socket() pass 16 pipe buffers at a time.
 - Made AF_ALG/hash use finup/digest where possible in sendmsg.
 - Added an iterator-of-iterators, ITER_ITERLIST.
 - Made sunrpc use the iterator-of-iterators.
 - Converted more drivers.

Link: https://lore.kernel.org/r/20230316152618.711970-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230329141354.516864-1-dhowells@redhat.com/ # v2

David Howells (55):
  netfs: Fix netfs_extract_iter_to_sg() for ITER_UBUF/IOVEC
  iov_iter: Remove last_offset member
  net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  mm: Move the page fragment allocator from page_alloc.c into its own
    file
  mm: Make the page_frag_cache allocator use multipage folios
  mm: Make the page_frag_cache allocator use per-cpu
  tcp: Support MSG_SPLICE_PAGES
  tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
    tcp_sendmsg
  espintcp: Inline do_tcp_sendpages()
  tls: Inline do_tcp_sendpages()
  siw: Inline do_tcp_sendpages()
  tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  ip, udp: Support MSG_SPLICE_PAGES
  ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  ip6, udp6: Support MSG_SPLICE_PAGES
  udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  af_unix: Support MSG_SPLICE_PAGES
  af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  crypto: af_alg: Pin pages rather than ref'ing if appropriate
  crypto: af_alg: Use netfs_extract_iter_to_sg() to create scatterlists
  crypto: af_alg: Indent the loop in af_alg_sendmsg()
  crypto: af_alg: Support MSG_SPLICE_PAGES
  crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
  crypto: af_alg/hash: Support MSG_SPLICE_PAGES
  tls/device: Support MSG_SPLICE_PAGES
  tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES
  tls/sw: Support MSG_SPLICE_PAGES
  tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
  chelsio: Support MSG_SPLICE_PAGES
  chelsio: Convert chtls_sendpage() to use MSG_SPLICE_PAGES
  kcm: Support MSG_SPLICE_PAGES
  kcm: Convert kcm_sendpage() to use MSG_SPLICE_PAGES
  splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
  splice, net: Reimplement splice_to_socket() to pass multiple bufs to
    sendmsg()
  Remove file->f_op->sendpage
  siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()
  tcp_bpf: Make tcp_bpf_sendpage() go through
    tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
  net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
  algif: Remove hash_sendpage*()
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
  rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  nvme: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  kcm: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES
  ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
  drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendmsg()
  drdb: Send an entire bio in a single sendmsg
  sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

 Documentation/networking/scaling.rst          |   4 +-
 crypto/Kconfig                                |   1 +
 crypto/af_alg.c                               | 194 +++++--------
 crypto/algif_aead.c                           |  52 ++--
 crypto/algif_hash.c                           | 171 +++++------
 crypto/algif_rng.c                            |   2 -
 crypto/algif_skcipher.c                       |  24 +-
 drivers/block/drbd/drbd_main.c                |  86 ++----
 drivers/infiniband/sw/siw/siw_qp_tx.c         | 227 +++------------
 .../chelsio/inline_crypto/chtls/chtls.h       |   2 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 169 ++++-------
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   1 -
 drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  19 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h    |   2 -
 drivers/nvme/host/tcp.c                       |  63 ++--
 drivers/nvme/target/tcp.c                     |  69 +++--
 drivers/scsi/iscsi_tcp.c                      |  31 +-
 drivers/scsi/iscsi_tcp.h                      |   2 +-
 drivers/scsi/libiscsi_tcp.c                   |  13 +-
 drivers/target/iscsi/iscsi_target_util.c      |  14 +-
 fs/dlm/lowcomms.c                             |  10 +-
 fs/netfs/iterator.c                           |   2 +-
 fs/ocfs2/cluster/tcp.c                        | 107 +++----
 fs/splice.c                                   | 158 ++++++++--
 include/crypto/if_alg.h                       |   7 +-
 include/linux/fs.h                            |   3 -
 include/linux/gfp.h                           |  17 +-
 include/linux/mm_types.h                      |  13 +-
 include/linux/net.h                           |   8 -
 include/linux/socket.h                        |   3 +
 include/linux/splice.h                        |   2 +
 include/linux/sunrpc/svc.h                    |  11 +-
 include/linux/uio.h                           |   5 +-
 include/net/inet_common.h                     |   2 -
 include/net/ip.h                              |   4 +
 include/net/sock.h                            |   6 -
 include/net/tcp.h                             |   2 -
 include/net/tls.h                             |   2 +-
 mm/Makefile                                   |   2 +-
 mm/page_alloc.c                               | 126 --------
 mm/page_frag_alloc.c                          | 201 +++++++++++++
 net/appletalk/ddp.c                           |   1 -
 net/atm/pvc.c                                 |   1 -
 net/atm/svc.c                                 |   1 -
 net/ax25/af_ax25.c                            |   1 -
 net/caif/caif_socket.c                        |   2 -
 net/can/bcm.c                                 |   1 -
 net/can/isotp.c                               |   1 -
 net/can/j1939/socket.c                        |   1 -
 net/can/raw.c                                 |   1 -
 net/ceph/messenger_v1.c                       |  58 ++--
 net/ceph/messenger_v2.c                       |  89 ++----
 net/core/skbuff.c                             |  81 +++---
 net/core/sock.c                               |  35 +--
 net/dccp/ipv4.c                               |   1 -
 net/dccp/ipv6.c                               |   1 -
 net/ieee802154/socket.c                       |   2 -
 net/ipv4/af_inet.c                            |  21 --
 net/ipv4/ip_output.c                          | 122 +++++++-
 net/ipv4/tcp.c                                | 274 ++++++------------
 net/ipv4/tcp_bpf.c                            |  72 +----
 net/ipv4/tcp_ipv4.c                           |   1 -
 net/ipv4/udp.c                                |  54 ----
 net/ipv4/udp_impl.h                           |   2 -
 net/ipv4/udplite.c                            |   1 -
 net/ipv6/af_inet6.c                           |   3 -
 net/ipv6/ip6_output.c                         |  28 +-
 net/ipv6/raw.c                                |   1 -
 net/ipv6/tcp_ipv6.c                           |   1 -
 net/kcm/kcmsock.c                             | 249 ++++++----------
 net/key/af_key.c                              |   1 -
 net/l2tp/l2tp_ip.c                            |   1 -
 net/l2tp/l2tp_ip6.c                           |   1 -
 net/llc/af_llc.c                              |   1 -
 net/mctp/af_mctp.c                            |   1 -
 net/mptcp/protocol.c                          |   2 -
 net/netlink/af_netlink.c                      |   1 -
 net/netrom/af_netrom.c                        |   1 -
 net/packet/af_packet.c                        |   2 -
 net/phonet/socket.c                           |   2 -
 net/qrtr/af_qrtr.c                            |   1 -
 net/rds/af_rds.c                              |   1 -
 net/rds/tcp_send.c                            |  86 +++---
 net/rose/af_rose.c                            |   1 -
 net/rxrpc/af_rxrpc.c                          |   1 -
 net/sctp/protocol.c                           |   1 -
 net/smc/af_smc.c                              |  29 --
 net/smc/smc_stats.c                           |   2 +-
 net/smc/smc_stats.h                           |   1 -
 net/smc/smc_tx.c                              |  16 -
 net/smc/smc_tx.h                              |   2 -
 net/socket.c                                  |  76 +----
 net/sunrpc/svcsock.c                          |  38 +--
 net/tipc/socket.c                             |   3 -
 net/tls/tls_device.c                          |  91 +++---
 net/tls/tls_main.c                            |  31 +-
 net/tls/tls_sw.c                              | 215 ++++++--------
 net/unix/af_unix.c                            | 254 +++++++---------
 net/vmw_vsock/af_vsock.c                      |   3 -
 net/x25/af_x25.c                              |   1 -
 net/xdp/xsk.c                                 |   1 -
 net/xfrm/espintcp.c                           |  10 +-
 102 files changed, 1519 insertions(+), 2301 deletions(-)
 create mode 100644 mm/page_frag_alloc.c

Comments

Christoph Hellwig April 3, 2023, 9:30 a.m. UTC | #1
On Fri, Mar 31, 2023 at 05:08:19PM +0100, David Howells wrote:
> Hi Willy, Dave, et al.,

Can we please finish the previous big API transitions before starting
yet another one?  We still have 10 callers if iov_iter_get_pages2 and
of iov_iter_get_pages_alloc2 that need conversion to
iov_iter_extract_pages.  Except for the legacy direct I/O code they
should be easy _and_ largerly overlap what this series touches.

I'll gladly take on direct-io.c.