From patchwork Fri Jun 3 05:27:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arseniy Krasnov X-Patchwork-Id: 12868533 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DD47C433EF for ; Fri, 3 Jun 2022 05:28:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240520AbiFCF2k (ORCPT ); Fri, 3 Jun 2022 01:28:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240486AbiFCF2i (ORCPT ); Fri, 3 Jun 2022 01:28:38 -0400 Received: from mail.sberdevices.ru (mail.sberdevices.ru [45.89.227.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4166D37ABC; Thu, 2 Jun 2022 22:28:31 -0700 (PDT) Received: from s-lin-edge02.sberdevices.ru (localhost [127.0.0.1]) by mail.sberdevices.ru (Postfix) with ESMTP id EBC195FD02; Fri, 3 Jun 2022 08:28:27 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1654234108; bh=oht4gbQPlTi/rOmdUgW/djRI05DtUhmK5LxNHxmuDtE=; h=From:To:Subject:Date:Message-ID:Content-Type:MIME-Version; b=FHEO5fVTS8EeQDtc2JI1hq+5hQuJ9zzhB9lm5v3Dj+9CiM1lYVTEZ5EVR4OiaiDZ4 9V7cCKk3PCkYLvPKdBzIu7pM7mbHY5kQEKTgBlFjMNh+wi36wy4ryTjAZiuCIDUW+e IRBkGxflvXzat9edeMpfetFnO0JXDwxsIlTNUkPRvVaOPFWjDtAmXABkrr+o3xpCYG hrC1cl/MnPT376Fy96R58hK6ns8tYZK7z/pzgtofozmpST/+1rrhSV4P5aIURdz5tU 39gKoKlhKKKdARIOpZQXYPKNsTKUdH0gH9+SNuJZODANr6fXdPzRwbq3VU27rS2jFh mThGkxPWJa5aw== Received: from S-MS-EXCH02.sberdevices.ru (S-MS-EXCH02.sberdevices.ru [172.16.1.5]) by mail.sberdevices.ru (Postfix) with ESMTP; Fri, 3 Jun 2022 08:28:23 +0300 (MSK) From: Arseniy Krasnov To: Stefano Garzarella , Stefan Hajnoczi , "Michael S. Tsirkin" , Jason Wang , "David S. Miller" , "Jakub Kicinski" , Paolo Abeni CC: "linux-kernel@vger.kernel.org" , "kvm@vger.kernel.org" , "virtualization@lists.linux-foundation.org" , "netdev@vger.kernel.org" , kernel , Krasnov Arseniy Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Thread-Topic: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Thread-Index: AQHYdwquMlGCdyA1qUCtor+NkIn5wQ== Date: Fri, 3 Jun 2022 05:27:56 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.16.1.12] Content-ID: <56034BD2F7D286419339938E9C83A2EE@sberdevices.ru> MIME-Version: 1.0 X-KSMG-Rule-ID: 4 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiPhishing: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 1.1.2.30, bases: 2022/06/03 01:19:00 #19656765 X-KSMG-AntiVirus-Status: Clean, skipped Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org INTRODUCTION Hello, this is experimental implementation of virtio vsock zerocopy receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will fill provided vma area with pages of virtio RX buffers. After received data was processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED flag set(if user won't call 'madvise()', next 'getsockopt()' will fail). DETAILS Here is how mapping with mapped pages looks exactly: first page mapping contains array of trimmed virtio vsock packet headers (in contains only length of data on the corresponding page and 'flags' field): struct virtio_vsock_usr_hdr { uint32_t length; uint32_t flags; uint32_t copy_len; }; Field 'length' allows user to know exact size of payload within each sequence of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part. All other pages are data pages from RX queue. Page 0 Page 1 Page N [ hdr1 .. hdrN ][ data ] .. [ data ] | | ^ ^ | | | | | *-------------------* | | | | *----------------* Of course, single header could represent array of pages (when packet's buffer is bigger than one page).So here is example of detailed mapping layout for some set of packages. Lets consider that we have the following sequence of packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will be inserted to user's vma(vma is large enough). Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] Page 1: [ 56 ] Page 2: [ 4096 ] Page 3: [ 4096 ] Page 4: [ 4096 ] Page 5: [ 8 ] Page 0 contains only array of headers: 'hdr0' has 56 in length field. 'hdr1' has 4096 in length field. 'hdr2' has 8200 in length field. 'hdr3' has 0 in length field(this is end of data marker). Page 1 corresponds to 'hdr0' and has only 56 bytes of data. Page 2 corresponds to 'hdr1' and filled with data. Page 3 corresponds to 'hdr2' and filled with data. Page 4 corresponds to 'hdr2' and filled with data. Page 5 corresponds to 'hdr2' and has only 8 bytes of data. This patchset also changes packets allocation way: today implementation uses only 'kmalloc()' to create data buffer. Problem happens when we try to map such buffers to user's vma - kernel forbids to map slab pages to user's vma(as pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and "not large" could be bigger than one page). So to avoid this, data buffers now allocated using 'alloc_pages()' call. TESTS This patchset updates 'vsock_test' utility: two tests for new feature were added. First test covers invalid cases. Second checks valid transmission case. BENCHMARKING For benchmakring I've added small utility 'rx_zerocopy'. It works in client/server mode. When client connects to server, server starts sending exact amount of data to client(amount is set as input argument).Client reads data and waits for next portion of it. Client works in two modes: copy and zero-copy. In copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()' /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission is better. For server, we can set size of tx buffer and for client we can set size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility is quiet simple: For client mode: ./rx_zerocopy --mode client [--zerocopy] [--rx] For server mode: ./rx_zerocopy --mode server [--mb] [--tx] [--mb] sets number of megabytes to transfer. [--rx] sets size of receive buffer/mapping in pages. [--tx] sets size of transmit buffer in pages. I checked for transmission of 4000mb of data. Here are some results: size of rx/tx buffers in pages *---------------------------------------------------* | 8 | 32 | 64 | 256 | 512 | *--------------*--------*----------*---------*----------*----------* | zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to *--------------*---------------------------------------------------- process | non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb *--------------*---------------------------------------------------- Result in first column(where non-zerocopy works better than zerocopy) happens because time, spent in 'read()' system call is smaller that time in 'getsockopt' + 'madvise'. I've checked that. I think, that results are not so impressive, but at least it is not worse than copy mode and there is no need to allocate memory for processing date. PROBLEMS Updated packet's allocation logic creates some problem: when host gets data from guest(in vhost-vsock), it allocates at least one page for each packet (even if packet has 1 byte payload). I think this could be resolved in several ways: 1) Make zerocopy rx mode disabled by default, so if user didn't enable it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2) 2) Use 'kmalloc()' for "small" packets, else call page allocator. But in this case, we have mix of packets, allocated in two different ways thus during zerocopying to user(e.g. mapping pages to vma), such small packets will be handled in some stupid way: we need to allocate one page for user, copy data to it and then insert page to user's vma. v1 -> v2: 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now renamed to SO_VM_SOCKETS_MAP_RX. 2) Packet header which is exported to user now get new field: 'copy_len'. This field handles special case: user reads data from socket in non zerocopy way(with disabled zerocopy) and then enables zerocopy feature. In this case vhost part will switch data buffer allocation logic from 'kmalloc()' to direct calls for buddy allocator. But, there could be some pending 'kmalloc()' allocated packets in socket's rx list, and then user tries to read such packets in zerocopy way, dequeue will fail, because SLAB pages could not be inserted to user's vm area. So when such packet is found during zerocopy dequeue, dequeue loop will break and 'copy_len' will show size of such "bad" packet. After user detects this case, it must use 'read()/recv()' calls to dequeue such packet. 3) Also may be move this features under config option? Arseniy Krasnov(8) virtio/vsock: rework packet allocation logic vhost/vsock: rework packet allocation logic af_vsock: add zerocopy receive logic virtio/vsock: add transport zerocopy callback vhost/vsock: enable zerocopy callback virtio/vsock: enable zerocopy callback test/vsock: add receive zerocopy tests test/vsock: vsock rx zerocopy utility drivers/vhost/vsock.c | 121 +++++++++-- include/linux/virtio_vsock.h | 5 + include/net/af_vsock.h | 7 + include/uapi/linux/virtio_vsock.h | 6 + include/uapi/linux/vm_sockets.h | 3 + net/vmw_vsock/af_vsock.c | 100 +++++++++ net/vmw_vsock/virtio_transport.c | 51 ++++- net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++- tools/include/uapi/linux/virtio_vsock.h | 11 + tools/include/uapi/linux/vm_sockets.h | 8 + tools/testing/vsock/Makefile | 1 + tools/testing/vsock/control.c | 34 +++ tools/testing/vsock/control.h | 2 + tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++ tools/testing/vsock/vsock_test.c | 295 ++++++++++++++++++++++++++ 15 files changed, 1196 insertions(+), 15 deletions(-)