mbox series

[RFC,v1,0/8] virtio/vsock: experimental zerocopy receive

Message ID 7cdcb1e1-7c97-c054-19cf-5caeacae981d@sberdevices.ru (mailing list archive)
Headers show
Series virtio/vsock: experimental zerocopy receive | expand

Message

Arseniy Krasnov May 12, 2022, 5:04 a.m. UTC
INTRODUCTION

	Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

                                 DETAILS

	Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

	struct virtio_vsock_usr_hdr {
		uint32_t length;
		uint32_t flags;
	};

Field  'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). All other pages are data pages from RX queue.

             Page 0      Page 1      Page N

	[ hdr1 .. hdrN ][ data ] .. [ data ]
           |        |       ^           ^
           |        |       |           |
           |        *-------------------*
           |                |
           |                |
           *----------------*

	Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence  of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).

	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
	Page 1: [ 56 ]
	Page 2: [ 4096 ]
	Page 3: [ 4096 ]
	Page 4: [ 4096 ]
	Page 5: [ 8 ]

	Page 0 contains only array of headers:
	'hdr0' has 56 in length field.
	'hdr1' has 4096 in length field.
	'hdr2' has 8200 in length field.
	'hdr3' has 0 in length field(this is end of data marker).

	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
	Page 2 corresponds to 'hdr1' and filled with data.
	Page 3 corresponds to 'hdr2' and filled with data.
	Page 4 corresponds to 'hdr2' and filled with data.
	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

	This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

                                   TESTS

	This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

                                BENCHMARKING

	For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission 
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:

                           size of rx/tx buffers in pages
               *---------------------------------------------------*
               |    8   |    32    |    64   |   256    |   512    |
*--------------*--------*----------*---------*----------*----------*
|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
*--------------*---------------------------------------------------- process
| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
*--------------*----------------------------------------------------

I think, that results are not so impressive, but at least it is not worse than
copy mode and there is no need to allocate memory for processing date.

                                 PROBLEMS

	Updated packet's allocation logic creates some problem: when host gets
data from guest(in vhost-vsock), it allocates at least one page for each packet
(even if packet has 1 byte payload). I think this could be resolved in several
ways:
	1) Make zerocopy rx mode disabled by default, so if user didn't enable
it, current 'kmalloc()' way will be used.
	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
in this case, we have mix of packets, allocated in two different ways thus
during zerocopying to user(e.g. mapping pages to vma), such small packets will
be handled in some stupid way: we need to allocate one page for user, copy data
to it and then insert page to user's vma.

P.S: of course this is experimental RFC, so what do You think guys?

Arseniy Krasnov(8)
 virtio/vsock: rework packet allocation logic
 vhost/vsock: rework packet allocation logic
 af_vsock: add zerocopy receive logic
 virtio/vsock: add transport zerocopy callback
 vhost/vsock: enable zerocopy callback.
 virtio/vsock: enable zerocopy callback.
 test/vsock: add receive zerocopy tests
 test/vsock: vsock rx zerocopy utility

 drivers/vhost/vsock.c                   |  50 ++++-
 include/linux/virtio_vsock.h            |   4 +
 include/net/af_vsock.h                  |   4 +
 include/uapi/linux/virtio_vsock.h       |   5 +
 include/uapi/linux/vm_sockets.h         |   2 +
 net/vmw_vsock/af_vsock.c                |  61 ++++++
 net/vmw_vsock/virtio_transport.c        |   1 +
 net/vmw_vsock/virtio_transport_common.c | 195 ++++++++++++++++-
 tools/include/uapi/linux/virtio_vsock.h |  10 +
 tools/include/uapi/linux/vm_sockets.h   |   7 +
 tools/testing/vsock/Makefile            |   1 +
 tools/testing/vsock/control.c           |  34 +++
 tools/testing/vsock/control.h           |   2 +
 tools/testing/vsock/rx_zerocopy.c       | 356 ++++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_test.c        | 284 +++++++++++++++++++++++++
 15 files changed, 1005 insertions(+), 11 deletions(-)

Comments

Stefano Garzarella May 17, 2022, 3:14 p.m. UTC | #1
Hi Arseniy,

On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>                              INTRODUCTION
>
>	Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

Sounds cool, but maybe we would need some socket/net experts here for 
review.

Could we do something similar for the sending path as well?

>
>                                 DETAILS
>
>	Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
>	struct virtio_vsock_usr_hdr {
>		uint32_t length;
>		uint32_t flags;
>	};
>
>Field  'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). All other pages are data pages from RX queue.
>
>             Page 0      Page 1      Page N
>
>	[ hdr1 .. hdrN ][ data ] .. [ data ]
>           |        |       ^           ^
>           |        |       |           |
>           |        *-------------------*
>           |                |
>           |                |
>           *----------------*
>
>	Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence  of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).
>
>	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>	Page 1: [ 56 ]
>	Page 2: [ 4096 ]
>	Page 3: [ 4096 ]
>	Page 4: [ 4096 ]
>	Page 5: [ 8 ]
>
>	Page 0 contains only array of headers:
>	'hdr0' has 56 in length field.
>	'hdr1' has 4096 in length field.
>	'hdr2' has 8200 in length field.
>	'hdr3' has 0 in length field(this is end of data marker).
>
>	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>	Page 2 corresponds to 'hdr1' and filled with data.
>	Page 3 corresponds to 'hdr2' and filled with data.
>	Page 4 corresponds to 'hdr2' and filled with data.
>	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
>	This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
>                                   TESTS
>
>	This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.

Thanks for adding the test!

>
>                                BENCHMARKING
>
>	For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
>                           size of rx/tx buffers in pages
>               *---------------------------------------------------*
>               |    8   |    32    |    64   |   256    |   512    |
>*--------------*--------*----------*---------*----------*----------*
>|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>*--------------*----------------------------------------------------
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.

Why is it twice as slow in the first column?

>
>                                 PROBLEMS
>
>	Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:

Can we somehow copy the incoming packets into the payload of the already 
queued packet?

This reminds me that we have yet to fix a similar problem with kmalloc() 
as well...

https://bugzilla.kernel.org/show_bug.cgi?id=215329

>	1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used.

That sounds reasonable to me, I guess also TCP needs a setsockopt() call 
to enable the feature, right?

>	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.

It seems more difficult to me, but at the same time doable. I would go 
more on option 1, though.

>
>P.S: of course this is experimental RFC, so what do You think guys?

It seems cool :-)

But I would like some feedback from the net guys to have some TCP-like 
things.

Thanks,
Stefano
Arseniy Krasnov May 18, 2022, 11:04 a.m. UTC | #2
Hello Stefano,

On 17.05.2022 18:14, Stefano Garzarella wrote:
> Hi Arseniy,
> 
> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>                              INTRODUCTION
>>
>>     Hello, this is experimental implementation of virtio vsock zerocopy
>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>> fill provided vma area with pages of virtio RX buffers. After received data was
>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
> 
> Sounds cool, but maybe we would need some socket/net experts here for review.

Yes, that would be great

> 
> Could we do something similar for the sending path as well?

Here are thoughts about zerocopy transmission:
  
I tried to implement this feature in the following way: user creates
some page aligned buffer, then during tx packet allocation instead of
creating data buffer with 'kmalloc()', i tried to add user's buffer
to virtio queue. But found problem: as kernel virtio API uses virtual
addresses to add new buffers, in the deep of virtio subsystem
'virt_to_phys()' is called to get physical address of buffer, so user's
virtual address won't be translated correctly to physical address(in
theory, i can perform page walk for such user's va, get physical address
and pass some "fake" virtual address to virtio API in order to make
'virt_to_phys()' return valid physical address(but i think this is ugly).


If we are talking about 'mmap()' way, i think we can do the following:
user calls 'mmap()' on socket, kernel fills newly created mapping with
allocated pages(all pages have rw permissions). Now user can use pages
of this mapping(e.g. fill it with data). Finally, to start transmission,
user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
this mapping. Also as this call will return immediately(e.g. it is
asynchronous), some completion logic must be implemented. For example
use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
message that pages could be reused, or don't allow user to work with
these pages: unmap it, perform transmission and finally free pages.
To start new transmission user need to call 'mmap()' again.

                            OR

I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
/'splice()'. In this approach to transmit something, user does the following
steps:
1) Creates pipe.
2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
   free it.
3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
   move pages from pipe to socket(e.g. in special socket callback we got
   set of pipe's pages as input argument and all pages will be inserted
   to virtio queue).

But as SPLICE_F_MOVE support is disabled, it must be repaired first.
                                                                  
> 
>>
>>                                 DETAILS
>>
>>     Here is how mapping with mapped pages looks exactly: first page mapping
>> contains array of trimmed virtio vsock packet headers (in contains only length
>> of data on the corresponding page and 'flags' field):
>>
>>     struct virtio_vsock_usr_hdr {
>>         uint32_t length;
>>         uint32_t flags;
>>     };
>>
>> Field  'length' allows user to know exact size of payload within each sequence
>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>> bounds or record bounds). All other pages are data pages from RX queue.
>>
>>             Page 0      Page 1      Page N
>>
>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>           |        |       ^           ^
>>           |        |       |           |
>>           |        *-------------------*
>>           |                |
>>           |                |
>>           *----------------*
>>
>>     Of course, single header could represent array of pages (when packet's
>> buffer is bigger than one page).So here is example of detailed mapping layout
>> for some set of packages. Lets consider that we have the following sequence  of
>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>> be inserted to user's vma(vma is large enough).
>>
>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>     Page 1: [ 56 ]
>>     Page 2: [ 4096 ]
>>     Page 3: [ 4096 ]
>>     Page 4: [ 4096 ]
>>     Page 5: [ 8 ]
>>
>>     Page 0 contains only array of headers:
>>     'hdr0' has 56 in length field.
>>     'hdr1' has 4096 in length field.
>>     'hdr2' has 8200 in length field.
>>     'hdr3' has 0 in length field(this is end of data marker).
>>
>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>     Page 2 corresponds to 'hdr1' and filled with data.
>>     Page 3 corresponds to 'hdr2' and filled with data.
>>     Page 4 corresponds to 'hdr2' and filled with data.
>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>
>>     This patchset also changes packets allocation way: today implementation
>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>> "not large" could be bigger than one page). So to avoid this, data buffers now
>> allocated using 'alloc_pages()' call.
>>
>>                                   TESTS
>>
>>     This patchset updates 'vsock_test' utility: two tests for new feature
>> were added. First test covers invalid cases. Second checks valid transmission
>> case.
> 
> Thanks for adding the test!
> 
>>
>>                                BENCHMARKING
>>
>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>> client/server mode. When client connects to server, server starts sending exact
>> amount of data to client(amount is set as input argument).Client reads data and
>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>> is better. For server, we can set size of tx buffer and for client we can set
>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>> is quiet simple:
>>
>> For client mode:
>>
>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>
>> For server mode:
>>
>> ./rx_zerocopy --mode server [--mb] [--tx]
>>
>> [--mb] sets number of megabytes to transfer.
>> [--rx] sets size of receive buffer/mapping in pages.
>> [--tx] sets size of transmit buffer in pages.
>>
>> I checked for transmission of 4000mb of data. Here are some results:
>>
>>                           size of rx/tx buffers in pages
>>               *---------------------------------------------------*
>>               |    8   |    32    |    64   |   256    |   512    |
>> *--------------*--------*----------*---------*----------*----------*
>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>> *--------------*---------------------------------------------------- process
>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>> *--------------*----------------------------------------------------
>>
>> I think, that results are not so impressive, but at least it is not worse than
>> copy mode and there is no need to allocate memory for processing date.
> 
> Why is it twice as slow in the first column?

May be this is because memory copying for small buffers is very fast... i'll
analyze it deeply.

> 
>>
>>                                 PROBLEMS
>>
>>     Updated packet's allocation logic creates some problem: when host gets
>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>> (even if packet has 1 byte payload). I think this could be resolved in several
>> ways:
> 
> Can we somehow copy the incoming packets into the payload of the already queued packet?

May be, i'll analyze it...

> 
> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=215329

Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
uses memory chunks of some preallocated size.

> 
>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>> it, current 'kmalloc()' way will be used.
> 
> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?

Yes, You're right. I think i'll add this to v2.

> 
>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>> in this case, we have mix of packets, allocated in two different ways thus
>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>> be handled in some stupid way: we need to allocate one page for user, copy data
>> to it and then insert page to user's vma.
> 
> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
> 
>>
>> P.S: of course this is experimental RFC, so what do You think guys?
> 
> It seems cool :-)
> 
> But I would like some feedback from the net guys to have some TCP-like things.

Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
over latest kernel and work on packet allocation problem(from above).

> 
> Thanks,
> Stefano
> 

Thanks
Stefano Garzarella May 19, 2022, 7:42 a.m. UTC | #3
On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>Hello Stefano,
>
>On 17.05.2022 18:14, Stefano Garzarella wrote:
>> Hi Arseniy,
>>
>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>                              INTRODUCTION
>>>
>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>
>> Sounds cool, but maybe we would need some socket/net experts here for review.
>
>Yes, that would be great
>
>>
>> Could we do something similar for the sending path as well?
>
>Here are thoughts about zerocopy transmission:
>
>I tried to implement this feature in the following way: user creates
>some page aligned buffer, then during tx packet allocation instead of
>creating data buffer with 'kmalloc()', i tried to add user's buffer
>to virtio queue. But found problem: as kernel virtio API uses virtual
>addresses to add new buffers, in the deep of virtio subsystem
>'virt_to_phys()' is called to get physical address of buffer, so user's
>virtual address won't be translated correctly to physical address(in
>theory, i can perform page walk for such user's va, get physical address
>and pass some "fake" virtual address to virtio API in order to make
>'virt_to_phys()' return valid physical address(but i think this is ugly).

And maybe we should also pin the pages to prevent them from being 
replaced.

I think we should do something similar to what we do in vhost-vdpa.
Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c

>
>
>If we are talking about 'mmap()' way, i think we can do the following:
>user calls 'mmap()' on socket, kernel fills newly created mapping with
>allocated pages(all pages have rw permissions). Now user can use pages
>of this mapping(e.g. fill it with data). Finally, to start transmission,
>user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>this mapping. Also as this call will return immediately(e.g. it is
>asynchronous), some completion logic must be implemented. For example
>use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>message that pages could be reused, or don't allow user to work with
>these pages: unmap it, perform transmission and finally free pages.
>To start new transmission user need to call 'mmap()' again.
>
>                            OR
>
>I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>/'splice()'. In this approach to transmit something, user does the following
>steps:
>1) Creates pipe.
>2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>   free it.
>3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>   move pages from pipe to socket(e.g. in special socket callback we got
>   set of pipe's pages as input argument and all pages will be inserted
>   to virtio queue).
>
>But as SPLICE_F_MOVE support is disabled, it must be repaired first.

Splice seems interesting, but it would be nice If we do something 
similar to TCP. IIUC they use a flag for send(2):

     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);

  
>
>>
>>>
>>>                                 DETAILS
>>>
>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>> of data on the corresponding page and 'flags' field):
>>>
>>>     struct virtio_vsock_usr_hdr {
>>>         uint32_t length;
>>>         uint32_t flags;
>>>     };
>>>
>>> Field  'length' allows user to know exact size of payload within each sequence
>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>
>>>             Page 0      Page 1      Page N
>>>
>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>           |        |       ^           ^
>>>           |        |       |           |
>>>           |        *-------------------*
>>>           |                |
>>>           |                |
>>>           *----------------*
>>>
>>>     Of course, single header could represent array of pages (when packet's
>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>> for some set of packages. Lets consider that we have the following sequence  of
>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>> be inserted to user's vma(vma is large enough).
>>>
>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>     Page 1: [ 56 ]
>>>     Page 2: [ 4096 ]
>>>     Page 3: [ 4096 ]
>>>     Page 4: [ 4096 ]
>>>     Page 5: [ 8 ]
>>>
>>>     Page 0 contains only array of headers:
>>>     'hdr0' has 56 in length field.
>>>     'hdr1' has 4096 in length field.
>>>     'hdr2' has 8200 in length field.
>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>
>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>
>>>     This patchset also changes packets allocation way: today implementation
>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>> allocated using 'alloc_pages()' call.
>>>
>>>                                   TESTS
>>>
>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>> were added. First test covers invalid cases. Second checks valid transmission
>>> case.
>>
>> Thanks for adding the test!
>>
>>>
>>>                                BENCHMARKING
>>>
>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>> client/server mode. When client connects to server, server starts sending exact
>>> amount of data to client(amount is set as input argument).Client reads data and
>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>> is better. For server, we can set size of tx buffer and for client we can set
>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>> is quiet simple:
>>>
>>> For client mode:
>>>
>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>
>>> For server mode:
>>>
>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>
>>> [--mb] sets number of megabytes to transfer.
>>> [--rx] sets size of receive buffer/mapping in pages.
>>> [--tx] sets size of transmit buffer in pages.
>>>
>>> I checked for transmission of 4000mb of data. Here are some results:
>>>
>>>                           size of rx/tx buffers in pages
>>>               *---------------------------------------------------*
>>>               |    8   |    32    |    64   |   256    |   512    |
>>> *--------------*--------*----------*---------*----------*----------*
>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>> *--------------*---------------------------------------------------- process
>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>> *--------------*----------------------------------------------------
>>>
>>> I think, that results are not so impressive, but at least it is not worse than
>>> copy mode and there is no need to allocate memory for processing date.
>>
>> Why is it twice as slow in the first column?
>
>May be this is because memory copying for small buffers is very fast... i'll
>analyze it deeply.

Maybe I misunderstood, by small buffers here what do you mean?

I thought 8 was the number of pages, so 32KB buffers.

>
>>
>>>
>>>                                 PROBLEMS
>>>
>>>     Updated packet's allocation logic creates some problem: when host gets
>>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>>> (even if packet has 1 byte payload). I think this could be resolved in several
>>> ways:
>>
>> Can we somehow copy the incoming packets into the payload of the already queued packet?
>
>May be, i'll analyze it...

Thanks!

>
>>
>> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=215329
>
>Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
>uses memory chunks of some preallocated size.

Yep, I mean I think the problem is that when we receive 1-byte packets, 
we have all the header queued up that we don't consider in the credit 
mechanism. A little bit different.

>
>>
>>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>>> it, current 'kmalloc()' way will be used.
>>
>> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?
>
>Yes, You're right. I think i'll add this to v2.
>
>>
>>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>>> in this case, we have mix of packets, allocated in two different ways thus
>>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>>> be handled in some stupid way: we need to allocate one page for user, copy data
>>> to it and then insert page to user's vma.
>>
>> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
>>
>>>
>>> P.S: of course this is experimental RFC, so what do You think guys?
>>
>> It seems cool :-)
>>
>> But I would like some feedback from the net guys to have some TCP-like things.
>
>Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
>over latest kernel and work on packet allocation problem(from above).

If you have time, it would be cool to modify some performance tool that 
already supports vsock to take advantage of this feature and look better 
at performance.

We currently have both iperf3 (I have a modified fork for vsock [1]) and 
uperf (they have merged upstream the vsock support).

Perhaps the easiest to tweak is iperf-vsock, it should already have a 
--zerocopy option.

Thanks,
Stefano

[1] https://github.com/stefano-garzarella/iperf-vsock
Arseniy Krasnov May 20, 2022, 11:09 a.m. UTC | #4
Hello Stefano,

On 19.05.2022 10:42, Stefano Garzarella wrote:
> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>> Hello Stefano,
>>
>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>>
>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>                              INTRODUCTION
>>>>
>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>
>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>
>> Yes, that would be great
>>
>>>
>>> Could we do something similar for the sending path as well?
>>
>> Here are thoughts about zerocopy transmission:
>>
>> I tried to implement this feature in the following way: user creates
>> some page aligned buffer, then during tx packet allocation instead of
>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>> to virtio queue. But found problem: as kernel virtio API uses virtual
>> addresses to add new buffers, in the deep of virtio subsystem
>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>> virtual address won't be translated correctly to physical address(in
>> theory, i can perform page walk for such user's va, get physical address
>> and pass some "fake" virtual address to virtio API in order to make
>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
> 
> And maybe we should also pin the pages to prevent them from being replaced.
> 
> I think we should do something similar to what we do in vhost-vdpa.
> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c

Hm, ok. I'll read about vdpa...

> 
>>
>>
>> If we are talking about 'mmap()' way, i think we can do the following:
>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>> allocated pages(all pages have rw permissions). Now user can use pages
>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>> this mapping. Also as this call will return immediately(e.g. it is
>> asynchronous), some completion logic must be implemented. For example
>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>> message that pages could be reused, or don't allow user to work with
>> these pages: unmap it, perform transmission and finally free pages.
>> To start new transmission user need to call 'mmap()' again.
>>
>>                            OR
>>
>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>> /'splice()'. In this approach to transmit something, user does the following
>> steps:
>> 1) Creates pipe.
>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>   free it.
>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>   move pages from pipe to socket(e.g. in special socket callback we got
>>   set of pipe's pages as input argument and all pages will be inserted
>>   to virtio queue).
>>
>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
> 
> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
> 
>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
> 

Yes, but in this way i think:
1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
   allows to use such memory as data buffer.
2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY),
   uses socket's error queue - poll events that tx is finished. So same way must be
   implemented for virtio vsock.

>  
>>
>>>
>>>>
>>>>                                 DETAILS
>>>>
>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>> of data on the corresponding page and 'flags' field):
>>>>
>>>>     struct virtio_vsock_usr_hdr {
>>>>         uint32_t length;
>>>>         uint32_t flags;
>>>>     };
>>>>
>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>
>>>>             Page 0      Page 1      Page N
>>>>
>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>           |        |       ^           ^
>>>>           |        |       |           |
>>>>           |        *-------------------*
>>>>           |                |
>>>>           |                |
>>>>           *----------------*
>>>>
>>>>     Of course, single header could represent array of pages (when packet's
>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>> be inserted to user's vma(vma is large enough).
>>>>
>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>     Page 1: [ 56 ]
>>>>     Page 2: [ 4096 ]
>>>>     Page 3: [ 4096 ]
>>>>     Page 4: [ 4096 ]
>>>>     Page 5: [ 8 ]
>>>>
>>>>     Page 0 contains only array of headers:
>>>>     'hdr0' has 56 in length field.
>>>>     'hdr1' has 4096 in length field.
>>>>     'hdr2' has 8200 in length field.
>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>
>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>
>>>>     This patchset also changes packets allocation way: today implementation
>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>> allocated using 'alloc_pages()' call.
>>>>
>>>>                                   TESTS
>>>>
>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>> case.
>>>
>>> Thanks for adding the test!
>>>
>>>>
>>>>                                BENCHMARKING
>>>>
>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>> client/server mode. When client connects to server, server starts sending exact
>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>> is quiet simple:
>>>>
>>>> For client mode:
>>>>
>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>
>>>> For server mode:
>>>>
>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>
>>>> [--mb] sets number of megabytes to transfer.
>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>> [--tx] sets size of transmit buffer in pages.
>>>>
>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>
>>>>                           size of rx/tx buffers in pages
>>>>               *---------------------------------------------------*
>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>> *--------------*--------*----------*---------*----------*----------*
>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>> *--------------*---------------------------------------------------- process
>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>> *--------------*----------------------------------------------------
>>>>
>>>> I think, that results are not so impressive, but at least it is not worse than
>>>> copy mode and there is no need to allocate memory for processing date.
>>>
>>> Why is it twice as slow in the first column?
>>
>> May be this is because memory copying for small buffers is very fast... i'll
>> analyze it deeply.
> 
> Maybe I misunderstood, by small buffers here what do you mean?
> 
> I thought 8 was the number of pages, so 32KB buffers.

Yes, 8 is size in pages. Anyway, i need to check it more deeply.

> 
>>
>>>
>>>>
>>>>                                 PROBLEMS
>>>>
>>>>     Updated packet's allocation logic creates some problem: when host gets
>>>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>>>> (even if packet has 1 byte payload). I think this could be resolved in several
>>>> ways:
>>>
>>> Can we somehow copy the incoming packets into the payload of the already queued packet?
>>
>> May be, i'll analyze it...
> 
> Thanks!
> 
>>
>>>
>>> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215329
>>
>> Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
>> uses memory chunks of some preallocated size.
> 
> Yep, I mean I think the problem is that when we receive 1-byte packets, we have all the header queued up that we don't consider in the credit mechanism. A little bit different.
> 
>>
>>>
>>>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>>>> it, current 'kmalloc()' way will be used.
>>>
>>> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?
>>
>> Yes, You're right. I think i'll add this to v2.
>>
>>>
>>>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>>>> in this case, we have mix of packets, allocated in two different ways thus
>>>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>>>> be handled in some stupid way: we need to allocate one page for user, copy data
>>>> to it and then insert page to user's vma.
>>>
>>> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
>>>
>>>>
>>>> P.S: of course this is experimental RFC, so what do You think guys?
>>>
>>> It seems cool :-)
>>>
>>> But I would like some feedback from the net guys to have some TCP-like things.
>>
>> Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
>> over latest kernel and work on packet allocation problem(from above).
> 
> If you have time, it would be cool to modify some performance tool that already supports vsock to take advantage of this feature and look better at performance.
> 
> We currently have both iperf3 (I have a modified fork for vsock [1]) and uperf (they have merged upstream the vsock support).
> 
> Perhaps the easiest to tweak is iperf-vsock, it should already have a --zerocopy option.

Ack

> 
> Thanks,
> Stefano
> 
> [1] https://github.com/stefano-garzarella/iperf-vsock
>
Stefano Garzarella May 24, 2022, 7:32 a.m. UTC | #5
On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:
>Hello Stefano,
>
>On 19.05.2022 10:42, Stefano Garzarella wrote:
>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>> Hello Stefano,
>>>
>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>> Hi Arseniy,
>>>>
>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>>                              INTRODUCTION
>>>>>
>>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>>
>>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>>
>>> Yes, that would be great
>>>
>>>>
>>>> Could we do something similar for the sending path as well?
>>>
>>> Here are thoughts about zerocopy transmission:
>>>
>>> I tried to implement this feature in the following way: user creates
>>> some page aligned buffer, then during tx packet allocation instead of
>>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>>> to virtio queue. But found problem: as kernel virtio API uses virtual
>>> addresses to add new buffers, in the deep of virtio subsystem
>>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>>> virtual address won't be translated correctly to physical address(in
>>> theory, i can perform page walk for such user's va, get physical address
>>> and pass some "fake" virtual address to virtio API in order to make
>>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
>>
>> And maybe we should also pin the pages to prevent them from being replaced.
>>
>> I think we should do something similar to what we do in vhost-vdpa.
>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>
>Hm, ok. I'll read about vdpa...
>
>>
>>>
>>>
>>> If we are talking about 'mmap()' way, i think we can do the following:
>>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>>> allocated pages(all pages have rw permissions). Now user can use pages
>>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>>> this mapping. Also as this call will return immediately(e.g. it is
>>> asynchronous), some completion logic must be implemented. For example
>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>>> message that pages could be reused, or don't allow user to work with
>>> these pages: unmap it, perform transmission and finally free pages.
>>> To start new transmission user need to call 'mmap()' again.
>>>
>>>                            OR
>>>
>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>>> /'splice()'. In this approach to transmit something, user does the following
>>> steps:
>>> 1) Creates pipe.
>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>>   free it.
>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>>   move pages from pipe to socket(e.g. in special socket callback we got
>>>   set of pipe's pages as input argument and all pages will be inserted
>>>   to virtio queue).
>>>
>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
>>
>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
>>
>>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>
>
>Yes, but in this way i think:
>1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
>   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
>   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
>   allows to use such memory as data buffer.

IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the 
VQ.

>2) To wait tx process is done(e.g. pages can be used again), such 
>API(send + MSG_ZEROCOPY),
>   uses socket's error queue - poll events that tx is finished. So same 
>   way must be
>   implemented for virtio vsock.

Yeah, I think so.

>
>>  
>>>
>>>>
>>>>>
>>>>>                                 DETAILS
>>>>>
>>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>>> of data on the corresponding page and 'flags' field):
>>>>>
>>>>>     struct virtio_vsock_usr_hdr {
>>>>>         uint32_t length;
>>>>>         uint32_t flags;
>>>>>     };
>>>>>
>>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>>
>>>>>             Page 0      Page 1      Page N
>>>>>
>>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>>           |        |       ^           ^
>>>>>           |        |       |           |
>>>>>           |        *-------------------*
>>>>>           |                |
>>>>>           |                |
>>>>>           *----------------*
>>>>>
>>>>>     Of course, single header could represent array of pages (when packet's
>>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>>> be inserted to user's vma(vma is large enough).
>>>>>
>>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>>     Page 1: [ 56 ]
>>>>>     Page 2: [ 4096 ]
>>>>>     Page 3: [ 4096 ]
>>>>>     Page 4: [ 4096 ]
>>>>>     Page 5: [ 8 ]
>>>>>
>>>>>     Page 0 contains only array of headers:
>>>>>     'hdr0' has 56 in length field.
>>>>>     'hdr1' has 4096 in length field.
>>>>>     'hdr2' has 8200 in length field.
>>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>>
>>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>>
>>>>>     This patchset also changes packets allocation way: today implementation
>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>>> allocated using 'alloc_pages()' call.
>>>>>
>>>>>                                   TESTS
>>>>>
>>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>>> case.
>>>>
>>>> Thanks for adding the test!
>>>>
>>>>>
>>>>>                                BENCHMARKING
>>>>>
>>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>>> client/server mode. When client connects to server, server starts sending exact
>>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>>> is quiet simple:
>>>>>
>>>>> For client mode:
>>>>>
>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>
>>>>> For server mode:
>>>>>
>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>
>>>>> [--mb] sets number of megabytes to transfer.
>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>
>>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>>
>>>>>                           size of rx/tx buffers in pages
>>>>>               *---------------------------------------------------*
>>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>>> *--------------*--------*----------*---------*----------*----------*
>>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>>> *--------------*---------------------------------------------------- process
>>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>>> *--------------*----------------------------------------------------
>>>>>
>>>>> I think, that results are not so impressive, but at least it is not worse than
>>>>> copy mode and there is no need to allocate memory for processing date.
>>>>
>>>> Why is it twice as slow in the first column?
>>>
>>> May be this is because memory copying for small buffers is very fast... i'll
>>> analyze it deeply.
>>
>> Maybe I misunderstood, by small buffers here what do you mean?
>>
>> I thought 8 was the number of pages, so 32KB buffers.
>
>Yes, 8 is size in pages. Anyway, i need to check it more deeply.

Okay, thanks!

Stefano
Arseniy Krasnov June 7, 2022, 10:26 a.m. UTC | #6
On 24.05.2022 10:32, Stefano Garzarella wrote:
> On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:
>> Hello Stefano,
>>
>> On 19.05.2022 10:42, Stefano Garzarella wrote:
>>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>>> Hello Stefano,
>>>>
>>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>>> Hi Arseniy,
>>>>>
>>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>>>                              INTRODUCTION
>>>>>>
>>>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>>>
>>>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>>>
>>>> Yes, that would be great
>>>>
>>>>>
>>>>> Could we do something similar for the sending path as well?
>>>>
>>>> Here are thoughts about zerocopy transmission:
>>>>
>>>> I tried to implement this feature in the following way: user creates
>>>> some page aligned buffer, then during tx packet allocation instead of
>>>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>>>> to virtio queue. But found problem: as kernel virtio API uses virtual
>>>> addresses to add new buffers, in the deep of virtio subsystem
>>>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>>>> virtual address won't be translated correctly to physical address(in
>>>> theory, i can perform page walk for such user's va, get physical address
>>>> and pass some "fake" virtual address to virtio API in order to make
>>>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
>>>
>>> And maybe we should also pin the pages to prevent them from being replaced.
>>>
>>> I think we should do something similar to what we do in vhost-vdpa.
>>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>>
>> Hm, ok. I'll read about vdpa...
>>
>>>
>>>>
>>>>
>>>> If we are talking about 'mmap()' way, i think we can do the following:
>>>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>>>> allocated pages(all pages have rw permissions). Now user can use pages
>>>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>>>> this mapping. Also as this call will return immediately(e.g. it is
>>>> asynchronous), some completion logic must be implemented. For example
>>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>>>> message that pages could be reused, or don't allow user to work with
>>>> these pages: unmap it, perform transmission and finally free pages.
>>>> To start new transmission user need to call 'mmap()' again.
>>>>
>>>>                            OR
>>>>
>>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>>>> /'splice()'. In this approach to transmit something, user does the following
>>>> steps:
>>>> 1) Creates pipe.
>>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>>>   free it.
>>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>>>   move pages from pipe to socket(e.g. in special socket callback we got
>>>>   set of pipe's pages as input argument and all pages will be inserted
>>>>   to virtio queue).
>>>>
>>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
>>>
>>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
>>>
>>>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>>
>>
>> Yes, but in this way i think:
>> 1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
>>   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
>>   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
>>   allows to use such memory as data buffer.
> 
> IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the VQ.

Now i see, how to use 'pin_user_pages()'. I'm going to implement zerocopy tx with the same
API as TCP MSG_ZEROCOPY

> 
>> 2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY),
>>   uses socket's error queue - poll events that tx is finished. So same   way must be
>>   implemented for virtio vsock.
> 
> Yeah, I think so.
> 
>>
>>>  
>>>>
>>>>>
>>>>>>
>>>>>>                                 DETAILS
>>>>>>
>>>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>>>> of data on the corresponding page and 'flags' field):
>>>>>>
>>>>>>     struct virtio_vsock_usr_hdr {
>>>>>>         uint32_t length;
>>>>>>         uint32_t flags;
>>>>>>     };
>>>>>>
>>>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>>>
>>>>>>             Page 0      Page 1      Page N
>>>>>>
>>>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>>>           |        |       ^           ^
>>>>>>           |        |       |           |
>>>>>>           |        *-------------------*
>>>>>>           |                |
>>>>>>           |                |
>>>>>>           *----------------*
>>>>>>
>>>>>>     Of course, single header could represent array of pages (when packet's
>>>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>>>> be inserted to user's vma(vma is large enough).
>>>>>>
>>>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>>>     Page 1: [ 56 ]
>>>>>>     Page 2: [ 4096 ]
>>>>>>     Page 3: [ 4096 ]
>>>>>>     Page 4: [ 4096 ]
>>>>>>     Page 5: [ 8 ]
>>>>>>
>>>>>>     Page 0 contains only array of headers:
>>>>>>     'hdr0' has 56 in length field.
>>>>>>     'hdr1' has 4096 in length field.
>>>>>>     'hdr2' has 8200 in length field.
>>>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>>>
>>>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>>>
>>>>>>     This patchset also changes packets allocation way: today implementation
>>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>>>> allocated using 'alloc_pages()' call.
>>>>>>
>>>>>>                                   TESTS
>>>>>>
>>>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>>>> case.
>>>>>
>>>>> Thanks for adding the test!
>>>>>
>>>>>>
>>>>>>                                BENCHMARKING
>>>>>>
>>>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>>>> client/server mode. When client connects to server, server starts sending exact
>>>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>>>> is quiet simple:
>>>>>>
>>>>>> For client mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>>
>>>>>> For server mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>>
>>>>>> [--mb] sets number of megabytes to transfer.
>>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>>
>>>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>>>
>>>>>>                           size of rx/tx buffers in pages
>>>>>>               *---------------------------------------------------*
>>>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>>>> *--------------*--------*----------*---------*----------*----------*
>>>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>>>> *--------------*---------------------------------------------------- process
>>>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>>>> *--------------*----------------------------------------------------
>>>>>>
>>>>>> I think, that results are not so impressive, but at least it is not worse than
>>>>>> copy mode and there is no need to allocate memory for processing date.
>>>>>
>>>>> Why is it twice as slow in the first column?
>>>>
>>>> May be this is because memory copying for small buffers is very fast... i'll
>>>> analyze it deeply.
>>>
>>> Maybe I misunderstood, by small buffers here what do you mean?
>>>
>>> I thought 8 was the number of pages, so 32KB buffers.
>>
>> Yes, 8 is size in pages. Anyway, i need to check it more deeply.
> 
> Okay, thanks!
> 
> Stefano
>